0 / 0

Adding extracted text to a RAG solution

Last updated: May 03, 2025
Adding extracted text to a RAG solution

Based on the file type you configure in your text extraction request, the structure and format of the extracted output differs. You may need to perform some post-processing on the result before you can use the contents as grounding data in your RAG solution.

You can convert the generated Markdown file into a text file by changing the file extension from .md to .txt. The resulting text file includes the Markdown tags. If you want to remove the tagging, you can use a parser library to find and convert the tags.

You can use a JSON processor library to extract text from the generated JSON file and store it as plain text. For example, the following command extracts the text from each token for all structures in the document and stores the text in a file named parsed_output_text.txt:

cat output_retail.json | jq '[.all_structures.tokens[].text] | join(" ")' > parsed_output_text.txt
Note: This command uses jq, which is a command-line JSON processor that needs to be installed separately.

After you convert the generated file to a TXT file, you can use the extracted text as contextual information for a foundation model prompt in the following ways:

Markdown output

The extracted text is written to a Markdown file with the name that you specified in the results_reference.location.file_name field.

The Markdown content captures structures in the document, such as sections and tables. For example, the following image shows how a table from the original PDF file is represented in Markdown after the text is extracted. A preview of the markdown table is included to show that the text from the original table in the PDF remains intact after extraction.

Three screenshots where the first one shows a table in a PDF document, the next shows the table text extracted as markdown, and the third shows a preview of the table

JSON output

When text is extracted to a JSON file, the resulting file contains details about different data structures in the document such as sections, paragraphs, table structures, tokens and more.

For more information about how to work with text that is extracted in JSON format, see Parsing JSON structures generated by text extraction.

What to do next

You can now use the refined extracted text files as input for your AutoAI RAG experiment to automate a RAG pattern. For details, see Coding an AutoAI RAG experiment with text extraction.

Learn more

Parent topic: Text extraction