Adding extracted text to a RAG solution
Based on the file type you configure in your text extraction request, the structure and format of the extracted output differs. You may need to perform some post-processing on the result before you can use the contents as grounding data in your RAG solution.
You can convert the generated Markdown file into a text file by changing the file extension from .md
to .txt
. The resulting text file includes the Markdown tags. If you want to remove the tagging, you can use a parser
library to find and convert the tags.
You can use a JSON processor library to extract text from the generated JSON file and store it as plain text. For example, the following command extracts the text from each token for all structures in the document and stores the text in a file
named parsed_output_text.txt
:
cat output_retail.json | jq '[.all_structures.tokens[].text] | join(" ")' > parsed_output_text.txt
After you convert the generated file to a TXT file, you can use the extracted text as contextual information for a foundation model prompt in the following ways:
-
Reference the extracted text from a Python notebook.
For example, you can use your TXT file instead of the
state_of_the_union.txt
file in the Use watsonx, Chroma, and LangChain to answer questions (RAG) sample notebook. -
You can use the TXT file as a grounding document in Prompt Lab. For more information, see Grounding foundation model prompts in contextual information.
Markdown output
The extracted text is written to a Markdown file with the name that you specified in the results_reference.location.file_name
field.
The Markdown content captures structures in the document, such as sections and tables. For example, the following image shows how a table from the original PDF file is represented in Markdown after the text is extracted. A preview of the markdown table is included to show that the text from the original table in the PDF remains intact after extraction.
JSON output
When text is extracted to a JSON file, the resulting file contains details about different data structures in the document such as sections, paragraphs, table structures, tokens and more.
For more information about how to work with text that is extracted in JSON format, see Parsing JSON structures generated by text extraction.
What to do next
You can now use the refined extracted text files as input for your AutoAI RAG experiment to automate a RAG pattern. For details, see Coding an AutoAI RAG experiment with text extraction.
Learn more
Parent topic: Text extraction