Extract text to convert high-quality business PDF documents into a simpler file format that can be used by AI models or to find and isolate key pieces of information from documents such as contracts.
Ways to develop
Copy link to section
You can extract text from documents by using these programming methods:
Simplifying your business documents in this way is especially useful for retrieval-augmented generation tasks where you want to find information that is relevant to a user query and include it with the input to a foundation model. Including
accurate contextual information in model input helps the foundation model to incorporate factual and up-to-date information in the model output. For more information, see Retrieval-augmented generation (RAG).
Text extraction is also powerful for use cases where you want to extract specific entities or categories of information from a document based on the document structure.
REST API
Copy link to section
You can use the document text extraction method of the watsonx.ai REST API to convert PDF files that are highly structured and use diagrams, images, and tables to convey information, into a file format that is easier to work with programmatically,
such as markdown or JSON.
Text extraction is available with paid plans only. Billing is based on the number of pages that are processed. For more information, see Billing details for generative AI assets.
The text extraction API applies natural language understanding technology developed by IBM to identify document structures.
Text extraction is an asynchronous process that converts one file at a time. You can make parallel method requests to extract text from a set of documents.
You can use the API to extract text from the following file types:
PDF
GIF
JPG
PNG
TIFF
You can store the extracted text in the following formats:
JSON
Markdown
Supported languages
Copy link to section
The capability that extracts text from images is called optical character recognition (OCR). This capability is useful for preserving information that is depicted in images, diagrams, or in text that is embedded in files such as scanned PDFs.
Although optical character recognition can extract text from noisy images, the quality of the image files must meet the minimum requirement of 80 DPI (dots per inch).
If the document with images that you want to convert is in a language other than English, you must specify the language by its ISO 639 language code in the language_list parameter of your request.
"languages_list":["fr"]
Copy to clipboardCopied to clipboard
If the document has a mix of languages, list each language separately. Optical character recognition can convert images in a mixed-language document only when the languages share a common script. For example, you can extract text from images
in a document with a mix of English and French text because both languages are Latin based. However, you cannot use OCR to extract text from images in a document with a mix of Japanese and English text.
The optical character recognition function can extract text from images in documents that are written in the following languages:
Language
ISO 639 language code
Script
Chinese (Simplified)
zh-CN
Chinese
Chinese (Traditional)
zh-TW
Chinese
Danish
da
Latin
Dutch
nl
Latin
English
en
Latin
English handwriting
en_hw
Latin
Finnish
fi
Latin
French
fr
Latin
German
de
Latin
Greek
el
Greek
Hebrew
he
Hebrew
Italian
it
Latin
Japanese
ja
Japanese
Korean
ko
Korean
Norwegian (Bokmål)
nb
Latin
Norwegian (Nynorsk)
nn
Latin
Polish
pl
Latin
Portuguese
pt
Latin
Spanish
es
Latin
Swedish
sv
Latin
Extracting text from tables
Copy link to section
Convert tabular data within a document into consumable text that captures the table information. Many large language models have difficulty with interpreting tabular information correctly.
To enable table conversion, specify the following parameter in your request.
"steps":{"tables_processing":{"enabled":true}}
Copy to clipboardCopied to clipboard
Choosing the output file format
Copy link to section
By default, the extracted text is written in JSON syntax. If you want the extracted text to be written in markdown instead, specify the following parameter in the API request body:
"assembly_md":{}
Copy to clipboardCopied to clipboard
Managing your documents
Copy link to section
You add the documents that you want to process into IBM Cloud Object Storage so you can reference them from the API.
Only connection assets that use the Access key and Secret key pair for credentials are supported. For more information about how to set up the connection, see Referencing files from the API.
For example, you reference the file that you add to IBM Cloud Object Storage as follows:
The following diagram shows the workflow you use to extract structural information about a business document with the document text extraction API.
Procedure
Copy link to section
Follow these high-level steps to extract text from a business document by using the REST API:
Add the file from which you want to extract text to a IBM Cloud Object Storage bucket, and then define a connection from your watsonx.ai project to the IBM Cloud Object Storage service instance.
From the response, copy the metadata.id, such as 64162e0a-b05d-4ba6-a688-422893f58663. Specify this ID in the endpoint that you use to check the status of the extraction process.
The extracted text is written to a markdown file with the name that you specified in the results_reference.location.file_name field.
The markdown captures structures in the document, such as sections and tables. For example, the following image shows how a table from the original PDF file is represented in markdown after the text is extracted. A preview of the markdown table
is included to show that the text from the original table in the PDF remains intact after extraction.
Example JSON output
Copy link to section
When text is extracted to a JSON file, the resulting file contains details about different data structures in the document such as sections, paragraphs, table structures, tokens and more.
You can convert the generated markdown file into a text file by changing the file extension from .md to .txt. The resulting text file includes the markdown tags. If you want to remove the tagging, you can use a parser
library to find and convert the tags.
You can use a JSON processor library to extract text from the generated JSON file and store it as plain text. For example, the following command extracts the text from each token for all structures in the document and stores the text in a file
named parsed_output_text.txt:
Note: This command uses jq, which is a command-line JSON processor that needs to be installed separately.
After you convert the generated file to a TXT file, you can use the extracted text as contextual information for a foundation model prompt in the following ways:
Reference the extracted text from a Python notebook.