Avoiding undesirable output

Last updated: Oct 09, 2024

Every foundation model has the potential to generate output that includes incorrect or even harmful content. Review this topic to understand the types of undesirable output that can be generated, the underlying reasons for that output, and steps you can take to reduce the risk of harm.

Types of undesirable or harmful output

Hallucinations

When a foundation model generates off-topic, repetitive, or incorrect content or fabricates details, that behavior is sometimes called hallucination. Off-topic hallucinations can happen because of pseudo-randomness in the decoding of the generated output. In the best cases, that randomness can result in wonderfully creative output. But it can also result in nonsense output that is not useful. Hallucinations involving fabricating details can happen when a model is prompted to generate text, but the model doesn't have enough related text to draw upon (in the prompt, for example) to generate a result that contains the correct details.

Personal information

A foundation model's vocabulary is formed from words in its pretraining data. If pretraining data includes web pages scraped from the internet, that means a model's vocabulary could contain the names of article authors, contact information from company web sites, and personal information posted in questions and comments to open community forums. If you are using a foundation model to generate text for part of an advertising email, the generated content might include contact information for another company! If you ask a foundation model to write an paper with citations, the model might include likely-sounding references, and it might even attribute those references to real authors from the correct field. But those "citations" are not real, they are an imitation of citations, correct in form but not grounded in facts, just the result of stringing together words (including names) that have a high probability of appearing together. That touch of personal information, the names of real people as authors in citations, makes this form of hallucination very compelling and believable, which can get people into a lot of trouble if they believe the citations are real. Not to mention the harm that can come to people who are listed as authors of works they never wrote.

Hate speech, abuse, and profanity

As with personal information, when pretraining data includes hateful or abusive terms or profanity, a foundation model trained on that data will have those problematic terms in its vocabulary and can generate text that includes that undesirable content. When using foundation models to generate content for your business, you need to recognize that this kind of output is always possible, take steps to reduce the likelihood of triggering the model to produce this kind of harmful output, and build human review and verification processes into your solutions.

Bias

During pretraining, a foundation model learns the statistical probability that certain words follow other words based on how those words appear in the training data. Any bias in the training data will be trained into the model. For example, if the training data more frequently refers to doctors as men and nurses as women, that bias will be reflected in the statistical relationships between those words in the model, and so the model will generate output that more frequently refers to doctors as men and nurses as women. Sometimes, people believe algorithms can be more fair and unbiased than humans because the algorithms are "just using math to reach decisions." But bias in training data will be reflected in content generated by foundation models trained on that data.

Reducing the potential for undesirable and harmful output

Any foundation model available in IBM watsonx.ai is capable of generating output containing hallucinations, personal information, hate speech, abuse, profanity, and bias. The best practices listed below can help reduce the risk, but there is no way to guarantee that generated output will not contain undesirable content.

Reducing the risk of hallucinations

Choose a model with pretraining and fine-tuning that matches the domain and task you are performing.
Provide context in your prompt. Instructing a foundation model to generate text on a subject that is not common in its pretraining data and for which no context has been provided in the prompt increases the likelihood of hallucinations.
Specify conservative values for the Min tokens and Max tokens parameters and specify one or more stop sequences. Forcing a response that is longer than a model would naturally produce for a given prompt by specifying a high value for the Min tokens parameter increases the likelihood of hallucinations.
To reduce the risk of off-topic hallucination with use cases that don't require much creativity in the generated output, use greedy decoding or specify conservative values for the temperature, top-p, and top-k parameters with sampling decoding.
To reduce repetitive text in the generated output, try increasing the repetition penalty parameter.
If you are using greedy decoding and you see repetitive text in the generated output, and if some creativity is acceptable for your use case, use sampling decoding with moderately low values for the temperature, top-p, and top-k parameters.
In your prompt, instruct the model what to do in the case it has no confident or high-probability answer. For example, in a question-answering scenario, you could include the instruction: "If the answer is not in the article, say 'I don't know.'"

Reducing the risk of personal information in generated output

In your prompt, instruct the model to refrain from mentioning names, contact details, or personal information. For example, when prompting a model to generate an advertising email, instruct the model to include your company name and phone number and also instruct the model to "include no other company or personal information."
In your larger application, pipeline, or solution, post-process the content generated by the foundation model to remove personal information.

Reducing the risk of hate speech, abuse, and profanity in generated output

In the Prompt Lab, set the AI guardrails toggle on. When this toggle is on, any sentence in the input prompt or generated output that contains harmful language will be replaced with a message saying potentially harmful text has been removed.
Include no hate speech, abuse, or profanity in your prompt to avoid causing the model to respond in kind.
In your prompt, instruct the model to use clean language. For example, depending on the tone you need for the output, instruct the model to use "formal", "professional", "PG", or "friendly" language.
In your larger application, pipeline, or solution, post-process the content generated by the foundation model to remove undesirable content.

Reducing the risk of bias in generated output

It is very difficult to debias output generated by a foundation model pretrained on biased data. However, you might improve results by including content in your prompt to counter bias which might apply to your use case. For example, instead of instructing a model to "list heart attack symptoms" you might instruct the model to "list heart attack symptoms, including symptoms common for men and symptoms common for women."

Parent topic: Prompt tips