Techniques for avoiding undesirable output

Last updated: Nov 27, 2024

Every foundation model has the potential to generate output that includes incorrect or even harmful content. Understand the types of undesirable output that can be generated, the reasons for the undesirable output, and steps that you can take to reduce the risk of harm.

The foundation models that are available in IBM watsonx.ai can generate output that contains hallucinations, personal information, hate speech, abuse, profanity, and bias. The following techniques can help reduce the risk, but do not guarantee that generated output will be free of undesirable content.

Find techniques to help you avoid the following types of undesirable content in foundation model output:

Hallucinations
Personal information
Hate speech, abuse, and profanity
Bias

Hallucinations

When a foundation model generates off-topic, repetitive, or incorrect content or fabricates details, that behavior is sometimes called hallucination.

Off-topic hallucinations can happen because of pseudo-randomness in the decoding of the generated output. In the best cases, that randomness can result in wonderfully creative output. But randomness can also result in nonsense output that is not useful.

The model might return hallucinations in the form of fabricated details when it is prompted to generate text, but is not given enough related text to draw upon. If you include correct details in the prompt, for example, the model is less likely to hallucinate and make up details.

Techniques for avoiding hallucinations

To avoid hallucinations, test one or more of these techniques:

Choose a model with pretraining and fine-tuning that matches your domain and the task you are doing.
Provide context in your prompt.

If you instruct a foundation model to generate text on a subject that is not common in its pretraining data and you don't add information about the subject to the prompt, the model is more likely to hallucinate.
Specify conservative values for the Min tokens and Max tokens parameters and specify one or more stop sequences.

When you specify a high value for the Min tokens parameter, you can force the model to generate a longer response than the model would naturally return for a prompt. The model is more likely to hallucinate as it adds words to the output to reach the required limit.
For use cases that don't require much creativity in the generated output, use greedy decoding. If you prefer to use sampling decoding, be sure to specify conservative values for the temperature, top-p, and top-k parameters.
To reduce repetitive text in the generated output, try increasing the repetition penalty parameter.
If you see repetitive text in the generated output when you use greedy decoding, and if some creativity is acceptable for your use case, then try using sampling decoding instead. Be sure to set moderately low values for the temperature, top-p, and top-k parameters.
In your prompt, instruct the model what to do when it has no confident or high-probability answer.

For example, in a question-answering scenario, you can include the instruction: If the answer is not in the article, say “I don't know”.

Personal information

A foundation model's vocabulary is formed from words in its pretraining data. If pretraining data includes web pages that are scraped from the internet, the model's vocabulary might contain the following types of information:

Names of article authors
Contact information from company websites
Personal information from questions and comments that are posted in open community forums

If you use a foundation model to generate text for part of an advertising email, the generated content might include contact information for another company!

If you ask a foundation model to write a paper with citations, the model might include references that look legitimate but aren't. It might even attribute those made-up references to real authors from the correct field. A foundation model is likely to generate imitation citations, correct in form but not grounded in facts, because the models are good at stringing together words (including names) that have a high probability of appearing together. The fact that the model lends the output a touch of legitimacy, by including the names of real people as authors in citations, makes this form of hallucination compelling and believable. It also makes this form of hallucination dangerous. People can get into trouble if they believe that the citations are real. Not to mention the harm that can come to people who are listed as authors of works they did not write.

Techniques for excluding personal information

To exclude personal information, try these techniques:

In your prompt, instruct the model to refrain from mentioning names, contact details, or personal information.

For example, when you prompt a model to generate an advertising email, instruct the model to include your company name and phone number. Also, instruct the model to “include no other company or personal information”.
From the watsonx.ai API, you can enable the PII filter in the moderations field when you submit an inference request.

For more information, see API reference documentation.
In your larger application, pipeline, or solution, post-process the content that is generated by the foundation model to find and remove personal information.

Hate speech, abuse, and profanity

As with personal information, when pretraining data includes hateful or abusive terms or profanity, a foundation model that is trained on that data has those problematic terms in its vocabulary. If inappropriate language is in the model's vocabulary, the foundation model might generate text that includes undesirable content.

When you use foundation models to generate content for your business, you must do the following things:

Recognize that this kind of output is always possible.
Take steps to reduce the likelihood of triggering the model to produce this kind of harmful output.
Build human review and verification processes into your solutions.

Techniques for reducing the risk of hate speech, abuse, and profanity

To avoid hate speech, abuse, and profanity, test one or more of these techniques:

In the Prompt Lab, set the AI guardrails switch to On. When this feature is enabled, any sentence in the input prompt or generated output that contains harmful language is replaced with a message that says that potentially harmful text was removed.
Do not include hate speech, abuse, or profanity in your prompt to prevent the model from responding in kind.
In your prompt, instruct the model to use clean language.

For example, depending on the tone you need for the output, instruct the model to use “formal”, “professional”, “PG”, or “friendly” language.
From the watsonx.ai API, you can enable the HAP filter in the moderations field when you submit an inference request.

For more information, see API reference documentation.
In your larger application, pipeline, or solution, post-process the content that is generated by the foundation model to remove undesirable content.

Reducing the risk of bias in model output

During pretraining, a foundation model learns the statistical probability that certain words follow other words based on how those words appear in the training data. Any bias in the training data is trained into the model.

For example, if the training data more frequently refers to doctors as men and nurses as women, that bias is likely to be reflected in the statistical relationships between those words in the model. As a result, the model is likely to generate output that more frequently refers to doctors as men and nurses as women. Sometimes, people believe that algorithms can be more fair and unbiased than humans because the algorithms are “just using math to decide”. But bias in training data is reflected in content that is generated by foundation models that are trained on that data.

Techniques for reducing bias

It is difficult to debias output that is generated by a foundation model that was pretrained on biased data. However, you might improve results by including content in your prompt to counter bias that might apply to your use case.

For example, instead of instructing a model to “list heart attack symptoms”, you might instruct the model to “list heart attack symptoms, including symptoms that are common for men and symptoms that are common for women”.

Parent topic: Prompt tips

Was the topic helpful?

0/1000

HallucinationsCopy link to section

Techniques for avoiding hallucinationsCopy link to section

Personal informationCopy link to section

Techniques for excluding personal informationCopy link to section

Hate speech, abuse, and profanityCopy link to section

Techniques for reducing the risk of hate speech, abuse, and profanityCopy link to section

Reducing the risk of bias in model outputCopy link to section

Techniques for reducing biasCopy link to section