Customizing RAG experiment settings

Last updated: Mar 05, 2025

When you build a retrieval-augmented generation solution in AutoAI, you can customize experiment settings to tailor your results.

If you run a RAG experiment based on default settings, the AutoAI process selects:

The optimization metric to be maximized when searching for the best RAG pipeline
The embedding models to try, based on the available list
The foundation models to try, based on the available list

To exercise more control over the RAG experiment, you can customize the experiment settings. After entering the required experiment definition information, click Experiment settings to customize options before running the experiment. Settings you can review or edit fall into three categories:

Retrieval & generation: choose which metric to use to optimize the choice of RAG pattern, how much data to retrieve, and the models AutoAI can use for the experiment.
Indexing: choose how the data is broken down into chunks, the metric used to measure semantic similarity, and which embedding model AutoAI can use for experimentation.
Additional information: review the watsonx.ai Runtime instance and the environment to use for the experiment.

Retrieval and generation settings

View or edit the settings that are used to generate the RAG pipelines.

Optimization metric

Choose the metric to maximize when searching for the optimal RAG patterns. For more information about optimization metrics and their implementation details, see RAG metrics.

Answer faithfulness measures how closely the generated response aligns to the context retrieved from the vector store. The score is calculated using a lexical metric which counts how many of the generated response tokens are included in the context retrieved from the vector store. A high score indicates the response represents the retrieved context well. Note that a high faithfulness score does not necessarily indicate correctness of the response. For more information on how the metric is implemented, see Faithfulness.
Answer correctness measures the correctness of the generated answer compared to the correct answer provided in the benchmark files. This includes the relevance of the retrieved context and the quality of the generated response. The score is calculated using a lexical metric which counts how many of the ground-truth response tokens are included in the generated response. For more information on how the metric is implemented, see Correctness.
Context correctness indicates to what extent the context retrieved from the vector store aligns with the ground truth context provided in the benchmark. The score is calcuated based on the rank of the ground truth context among the retrieved chunks. The closer the ground truth context is to the top of the list, the higher the score. For more information on how the metric is implemented, see [Context correctness](For more information about optimization metrics, see RAG metrics.

Retrieval methods

Choose the method for retrieving relevant data. Retrieval methods differ in the ways that they filter and rank documents.

Window retrieval method surrounds retrieved chunks with additional chunks before and after the chunks, based on what was in the original document. This method is useful for including more context that might be missing in the originally retrieved chunk. Window retrieval works as follows:
- Search: Finds the most relevant document chunks in the vector store.
- Expand: For each found chunk, retrieves surrounding chunks to provide context.
- Each chunk stores its sequence number in its metadata.
- After retrieving a chunk, the chunk metadata is used to fetch neighboring chunks from the same document. For example, if window_size is 2, it adds 2 chunks before it and 2 chunks after it.
- Merge: Combines overlapping text within the window to remove redundancy.
- Metadata handling: Merges metadata dictionaries by keeping the same keys and grouping values into lists.
- Return: Outputs the merged window as a new chunk, replacing the original one.
Simple retrieval method finds the most relevant chunks in the vector store.

Foundation models to include

By default, all available foundation models that support AutoAI for RAG are selected for experimentation. You can manually edit the list of foundation models that AutoAI can consider for generating RAG patterns. For each model, you can click Model details to view or export details about the model.

For the list of available foundation models along with descriptions, see Foundation models by task.

Max RAG patterns to complete

You can specify the number of RAG patterns to complete in the experimentation phase, up to a maximum of 20. A higher number compares more patterns, and might result in higher-scoring patterns, but consumes more compute resources.

Indexing settings

View or edit the settings for creating the text vector database from the document collection.

Chunking

Chunking settings determine how indexed documents are broken down into smaller pieces before ingestion into a vector store. Chunking data allows search and retrieval of those chunks in a document most relevant to a query. This allows the generation model to process only the most relevant data.

AutoAI RAG uses Langchain’s recursive text splitter to break down the documents into chunks. This method has the effect of decomposing the document in a hierarchical fashion, trying to keep all paragraphs (and then sentences, and then words) together as long as possible, until the chunk is smaller than the requested chunk size. For more information about the recursive chunking method, see Retrieval recursively split by character in the Langchain documentation.

How to best chunk your data depends on your use case. Smaller chunks provide a more granular interaction with text, enabling more focused search for relevant content, whereas larger chunks can provide more context. For your chunking use case, specify one or more options for:

The number of characters to include in each chunk of data.
The number of characters to overlap for chunking data. The number must be smaller than the chunking size.

The selected options are explored and compared in the experimentation phase.

Embedding models

Embedding models are used in retrieval-augmented generation solutions for encoding chunks and queries as vectors to capture their semantic meaning. The vectorized input data chunks are ingested into a vector store. Given a query, the vectorized representation is used to search the vector store for relevant chunks.

For a list of embedding models available for use with AutoAI RAG experiments, see Supported encoder models available with watsonx.ai.

Additional information

Review the watsonx.ai Runtime instance used for this experiment and the environment definition.

Learn more

Retrieval-Augmented Generation (RAG)

Parent topic: Creating a RAG experiment

Was the topic helpful?

0/1000