Generating accurate output

Last updated: Nov 27, 2024

Foundation models sometimes generate output that is not factually accurate. If factual accuracy is important for your project, set yourself up for success by learning how and why these models might sometimes get facts wrong and how you can ground generated output in correct facts.

Why foundation models get facts wrong

Foundation models can get facts wrong for a few reasons:

Pre-training builds word associations, not facts
Pre-training data sets contain out-of-date facts
Pre-training data sets do not contain esoteric or domain-specific facts and jargon
Sampling decoding is more likely to stray from the facts

Pre-training builds word associations, not facts

During pre-training, a foundation model builds up a vocabulary of words (tokens) encountered in the pre-training data sets. Also during pre-training, statistical relationships between those words become encoded in the model weights.

For example, "Mount Everest" often appears near "tallest mountain in the world" in many articles, books, speeches, and other common pre-training sources. As a result, a pre-trained model will probably correctly complete the prompt "The tallest mountain in the world is " with the output "Mount Everest."

These word associations can make it seem that facts have been encoded into these models too. For very common knowledge and immutable facts, you might have good luck generating factually accurate output using pre-trained foundation models with simple prompts like the tallest-mountain example. However, it is a risky strategy to rely on only pre-trained word associations when using foundation models in applications where accuracy matters.

Pre-training data sets contain out-of-date facts

Collecting pre-training data sets and performing pre-training runs can take a significant amount of time, sometimes months. If a model was pre-trained on a data set from several years ago, the model vocabulary and word associations encoded in the model weights won't reflect current world events or newly popular themes. For this reason, if you submit the prompt "The most recent winner of the world cup of football (soccer) is " to a model pre-trained on information a few years old, the generated output will be out of date.

Pre-training data sets do not contain esoteric or domain-specific facts and jargon

Common foundation model pre-training data sets, such as The Pile (Wikipedia), contain hundreds of millions of documents. Given how famous Mount Everest is, it's reasonable to expect a foundation model to have encoded a relationship between "tallest mountain in the world" and "Mount Everest". However, if a phenomenon, person, or concept is mentioned in only a handful of articles, chances are slim that a foundation model would have any word associations about that topic encoded in its weights. Prompting a pre-trained model about information that was not in its pre-training data sets is unlikely to produce factually accurate generated output.

Sampling decoding is more likely to stray from the facts

Decoding is the process a model uses to choose the words (tokens) in the generated output:

Greedy decoding always selects the token with the highest probability
Sampling decoding selects tokens pseudo-randomly from a probability distribution

Greedy decoding generates output that is more predictable and more repetitive. Sampling decoding is more random, which feels "creative". If, based on pre-training data sets, the most likely words to follow "The tallest mountain is " are "Mount Everest", then greedy decoding could reliably generate that factually correct output, whereas sampling decoding might sometimes generate the name of some other mountain or something that's not even a mountain.

How to ground generated output in correct facts

Rather than relying on only pre-trained word associations for factual accuracy, provide context in your prompt text.

Use context in your prompt text to establish facts

When you prompt a foundation model to generate output, the words (tokens) in the generated output are influenced by the words in the model vocabulary and the words in the prompt text. You can use your prompt text to boost factually accurate word associations.

Example 1

Here's a prompt to cause a model to complete a sentence declaring your favorite color:

My favorite color is

Given that only you know what your favorite color is, there's no way the model could reliably generate the correct output.

Instead, a color will be selected from colors mentioned in the model's pre-training data:

If greedy decoding is used, whichever color appears most frequently with statements about favorite colors in pre-training content will be selected.
If sampling decoding is used, a color will be selected randomly from colors mentioned most often as favorites in the pre-training content.

Example 2

Here's a prompt that includes context to establish the facts:

I recently painted my kitchen yellow, which is my favorite color.

My favorite color is

If you prompt a model with text that includes factually accurate context like this, then the output the model generates will be more likely to be accurate.

For more examples of including context in your prompt, see these samples:

Use less "creative" decoding

When you include context with the needed facts in your prompt, using greedy decoding is likely to generate accurate output. If you need some variety in the output, you can experiment with sampling decoding with low values for parameters like Temperature, Top P, and Top K. However, using sampling decoding increases the risk of inaccurate output.

Retrieval-augmented generation

The retrieval-augmented generation pattern scales out the technique of pulling context into prompts. If you have a knowledge base, such as process documentation in web pages, legal contracts in PDF files, a database of products for sale, a GitHub repository of C++ code files, or any other collection of information, you can use the retrieval-augmented generation pattern to generate factually accurate output based on the information in that knowledge base.

Retrieval-augmented generation involves three basic steps:

Search for relevant content in your knowledge base
Pull the most relevant content into your prompt as context
Send the combined prompt text to the model to generate output

For more information, see Retrieval-augmented generation

Parent topic: Prompt tips