Automating a RAG pattern with AutoAI

Last updated: Feb 21, 2025

Use AutoAI to automate and accelerate the search for an optimized, production-quality, Retrieval-augmented generation (RAG) pattern based on your data and use-case.

Data format: Document collection files of type PDF, HTML, DOCX, MD, or plain text; Test data with questions and answers in JSON format
Data file limits: Up to 20 files or folders for the document collection. For larger document collections, AutoAI runs the experiment with a sample of 1 GB.; 1 JSON file for test data. An experiment will use up to 25 question and answer pairs to evaluate patterns.
Environment size: Large: 8 CPU and 32 GB RAM

Estimating the costs for a RAG pattern

AutoAI for RAG experiments use both Capacity Unit Hours (CUH) for training the experiment, and Resource Units (RU) to measure the tokens consumed for embedding documents and inferencing the foundation models.

CUH is a standard measurement of 20 CUH per hour for the supported environment. The CUH consumed for an experiment depends on the complexity of the experiment and the time required to train the patterns.
RU consumption depends on a number of factors, including:
- size of the document collection for embedding
- number of evaluation questions and answers to embed
- chunking configuration, calculated using this formula: ((Chunk overlap + Chunk size) * Chunk count) *Evaluation records
- the number of patterns the experiment creates

This sample demonstrates how resources are calculated for a single RAG pattern.

Experiment input:

100 document pages
25 evaluation question/answer records

Activity	Tokens consumed
Embedding documents	3,000,000
Embedding evaluation records	25,000
Retrieving context for prompts	192,000
Retrieving context for evaluation records	25,000
Generating answers	25,000
Total tokens consumed	3,267,000

The total resource units consumed (tokens/100) = 3267 RU for this pattern.

Providing accurate answers with Retrieval-augmented generation

Retrieval-augmented generation (RAG) combines the generative power of a large language model with the accuracy of a collection of grounding documents. Interaction with a RAG application follows this pattern:

A user submits a question to the app.
The search first retrieves relevant context from a set of grounding documents.
The accompanying large language model generates an answer that includes the relevant information.

For example, the sample notebooks that are provided for this feature use the product documentation for the watsonx.ai Python client library as the grounding documents for a Q&A app about coding watsonx.ai solutions. Pattern users get the benefit of specific, relevant information from the documentation, with the generative AI model adding context and presenting the answers in natural language.

For a complete description and examples of how retrieval-augmented generation can improve your question and answer applications, see Retrieval-augmented generation (RAG).

Automating the search for the best RAG configuration

RAG comes with many configuration parameters, including which large language model to choose, how to chunk the grounding documents, and how many documents to retrieve. Configuration choices that work well for another use case might not be the best choice for your data. To create the best possible RAG pattern for your dataset, you might explore all the possible combinations of RAG configuration options to find, evaluate, and deploy the best solution. This part of the process can require a significant investment of time and resources. Just as you can use AutoAI to rapidly train and optimize machine learning models, you can use AutoAI capabilities to automate the search for the optimal RAG solution based on your data and use case. Accelerating the experimentation can dramatically reduce the time to production.

Key features of the AutoAI approach include:

Full exploration and evaluation of a constrained set of configuration options.
Rapidly reevaluate and modify the configuration when something changes. For example, you can easily re-run the training process when a new model is available or when evaluation results signal a change in the quality of responses.

Using AutoAI automates the end-to-end flow from experimentation to deployment. The following diagram illustrates the AutoAI approach to finding an optimized RAG pattern for your data and use case in 3 layers:

At the base level are parameterized RAG pipelines that are used to populate a vector store (index) and to retrieve data from the vector store to use when the large language model generates responses.
Next, RAG evaluation metrics and benchmarking tools evaluate response quality.
Finally, a hyper-parameter optimization algorithm searches for the best possible RAG configuration for your data.

AutoAI RAG optimization process

Running experiments by using AutoAI RAG avoids testing all RAG configuration options (for example, it avoids a grid search) by using a hyper-parameter optimization algorithm. The following diagram shows a subset of the RAG configuration search space with 16 RAG patterns to choose from. If the experiment evaluates them all, they are ranked 1 to 16, with the highest-ranking three configurations tagged as best performing. The optimization algorithm determines which subset of the RAG patterns to evaluate and stops processing the others, which are shown in gray. This process avoids exploring an exponential search space while still selecting better-performing RAG patterns in practice.

Automating the optimization process for RAG patterns

Use the fast path for automating the search for a RAG pattern

AutoAI provides a no-code solution for automating the search for a RAG pattern. To use the fastpath, start from a project, and use the AutoAI interface to upload your grounding and test documents. Accept the default configuration or update the experiment settings. Run the experiment to create the RAG patterns best suited for your use case.

Use the AutoAI SDK for coding a RAG pattern

Use the sample notebooks to learn how to use the watsonx.ai Python client library (version 1.1.11 or later) to code an automated RAG solution for your use case.

Example	Description
Automating RAG pattern with Chroma database	This notebook shows the fast path approach to creating a RAG pattern. - Uses the watsonx.ai Python SDK documentation files as the grounding documents for a RAG pattern. - Stores the vectorized content in the default, in-memory Chroma database
Automating RAG pattern with Milvus database	- Uses the watsonx.ai Python SDK documentation files as the grounding documents for a RAG pattern. - Stores the vectorized content in an external Milvus database

Scaling a RAG experiment

AutoAI automates the search for an optimized RAG pattern based on your grounding documents. If your documentation knowledge base exceeds the data limits allowed for an experiment, you can still use AutoAI to find the RAG pattern, then use the auto-generated indexing notebook created when you save a pattern to index more documents. The RAG pattern will apply to the larger body of indexed documents.

For details, see Saving a RAG pattern.

Supported features

Review these details for features provided with the AutoAI RAG process.

Feature	Description
Supported interface	API, UI
File formats for grounding document collection	PDF, HTML, DOCX, MD, plain text
Data connections for document collection	IBM Cloud Object Storage (bucket) folder in the bucket files (up to 20)
Test data format	JSON with fixed schema (Fields: - question, correct_answer, correct_answer_document_ids)
Data connections for test data	IBM Cloud Object Storage(single JSON file) single JSON file in project or space (data asset) single JSON file in NFS Storage Volume
Chunking	Multiple presets of 64-1024 characters Grounding documents split into chunks with optimized size and overlap.
Embedding model	Supported embedding models available with watsonx.ai
Vector store	Milvus and ChromaDB
Chunk augmentation	Enabled (add surrounding chunks from document)
Search-Type	Standard (in a single index)
Generative models	See Foundation models by task
Sampling	Benchmark-driven (first select the questions, then the documents, fill with random ones till the limit)
Search Algorithm	Tree Parzen Estimator (TPE) from the hyperopt library is used for hyper-parameter optimization
Metrics	Answer correctness, Faithfulness, Context correctness. For more information, see See Unitxt lexical rag metrics
Optimization Metric	The metric that is used as the optimization target. Answer correctness and Faithfulness are supported.
Customizable user constraints	Embedding model Generative model Configuration count limit (max output patterns number 4 to 20)
Deployment	Milvus: AutoAI notebooks for indexing and inference by using Milvus external vector database Deployable AI service asset Chroma: single AutoAI notebook for indexing and inferencing by using the Chroma in-memory vector database

Next steps

See Choosing a vector store to plan where to store your vectorized documents.
See Creating a RAG experiment (fastpath)

Parent topic: Coding generative AI solutions

Was the topic helpful?

0/1000