Convert high-quality business PDF documents into a simpler file format that can be used by AI models.
You can use the document text extraction method of the watsonx.ai REST API to convert PDF files that are highly structured and use diagrams, images, and tables to convey information, into a file format that is easier to work with programmatically, such as markdown or JSON.
Text extraction is available with paid plans only. Billing is based on the number of pages that are processed. For more information, see Billing details for generative AI assets.
Simplifying your business documents in this way is especially useful for retrieval-augmented generation tasks where you want to find information that is relevant to a user query and include it with the input to a foundation model. Including accurate contextual information in model input helps the foundation model to incorporate factual and up-to-date information in the model output. For more information, see Retrieval-augmented generation (RAG).
The text extraction API applies natural language understanding technology developed by IBM to identify document structures.
Text extraction is an asynchronous process that converts one file at a time. You can make parallel method requests to extract text from a set of documents.
Supported file types
You can use the API to extract text from the following file types:
You can store the extracted text in the following formats:
- JSON
- Markdown
Managing your documents
You add the documents that you want to process into IBM Cloud Object Storage so you can reference them from the API.
Only connection assets that use the Access key and Secret key pair for credentials are supported. For more information about how to set up the connection, see Referencing files from the API.
For example, you reference the file that you add to IBM Cloud Object Storage as follows:
"document_location": {
"type": "connection_asset",
"connection": {
"id": "6f5688fd-f3bf-42c2-a18b-49c0d8a1920d"
},
"location": {
"file_name": "document.pdf",
"bucket":"janessandbox"
}
}
You define the location where you want to store the generated output file as follows:
"results_location": {
"type": "connection_asset",
"connection": {
"id": "6f5688fd-f3bf-42c2-a18b-49c0d8a1920d"
},
"location": {
"file_name": "extracted_document.json"
}
}
Supported languages
The capability that extracts text from images is called optical character recognition (OCR). This capability is useful for preserving information that is depicted in images, diagrams, or in text that is embedded in files such as scanned PDFs.
Although optical character recognition can extract text from noisy images, the quality of the image files must meet the minimum requirement of 80 DPI (dots per inch).
If the document with images that you want to convert is in a language other than English, you must specify the language by its ISO 639 language code in the language_list
parameter of your request.
"languages_list": [
"fr"
]
If the document has a mix of languages, list each language separately. Optical character recognition can convert images in a mixed-language document only when the languages share a common script. For example, you can extract text from images in a document with a mix of English and French text because both languages are Latin based. However, you cannot use OCR to extract text from images in a document with a mix of Japanese and English text.
The optical character recognition function can extract text from images in documents that are written in the following languages:
Language | ISO 639 language code | Script |
---|---|---|
Chinese (Simplified) | zh-CN |
Chinese |
Chinese (Traditional) | zh-TW |
Chinese |
Danish | da |
Latin |
Dutch | nl |
Latin |
English | en |
Latin |
English handwriting | en_hw |
Latin |
Finnish | fi |
Latin |
French | fr |
Latin |
German | de |
Latin |
Greek | el |
Greek |
Hebrew | he |
Hebrew |
Italian | it |
Latin |
Japanese | ja |
Japanese |
Korean | ko |
Korean |
Norwegian (Bokmål) | nb |
Latin |
Norwegian (Nynorsk) | nn |
Latin |
Polish | pl |
Latin |
Portuguese | pt |
Latin |
Spanish | es |
Latin |
Swedish | sv |
Latin |
Extracting text from tables
Convert tabular data within a document into consumable text that captures the table information. Many large language models have difficulty with interpreting tabular information correctly.
To enable table conversion, specify the following parameter in your request.
"steps": {
"tables_processing": {
"enabled": true
}
}
Choosing the output file format
By default, the extracted text is written in JSON syntax. If you want the extracted text to be written in markdown instead, specify the following parameter in the API request body:
"assembly_md": {}
API details
The following diagram shows the workflow you use to extract structural information about a business document with the document text extraction API.
Follow these high-level steps to extract text from a business document by using the REST API:
-
Add the file from which you want to extract text to a IBM Cloud Object Storage bucket, and then define a connection from your watsonx.ai project to the IBM Cloud Object Storage service instance.
For more information, see Referencing files from the API.
-
Use the Start a text extraction request method to start the text extraction process.
Note the ID that is returned in the
metadata.id
field. You use this ID as the extraction ID to check the status of your request in the next step. -
Use the Get the results of the request method request to check the status of your request.
Checking the status is the only way to find out whether the process failed for any reason.
When the status is
Completed
, the extracted text file is available in the specified IBM Cloud Object Storage bucket. -
Download the generated file from Cloud Object Storage.
For API method details, see the API reference documentation.
Request example
For example, the following command submits a request to extract text from the retail_guidebook.pdf
file and save it in markdown format.
curl -X POST \
'https://{region}.ml.cloud.ibm.com/ml/v1/text/extractions?version=2024-10-18' \
--header 'Accept: application/json' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer eyJraWQiOi...'
The request body looks as follows:
{
"project_id": "e40e5895-ce4d-42a3-b699-8ac764b89a09",
"document_reference": {
"type": "connection_asset",
"connection": {
"id": "5c0cefce-da57-408b-b47d-58f7785de3ee"
},
"location": {
"bucket":"my-cloud-object-storage-bucket",
"file_name": "retail_guidebook.pdf"
}
},
"results_reference": {
"type": "connection_asset",
"connection": {
"id": "5c0cefce-da57-408b-b47d-58f7785de3ee"
},
"location": {
"bucket":"my-cloud-object-storage-bucket",
"file_name": "output_retail.md"
}
},
"steps": {
"ocr": {
"languages_list": [
"en"
]
},
"tables_processing": {
"enabled": true
}
},
"assembly_md": {}
}
From the response, copy the metadata.id
, such as 64162e0a-b05d-4ba6-a688-422893f58663
. Specify this ID in the endpoint that you use to check the status of the extraction process.
curl -X GET \
'https://{region}.ml.cloud.ibm.com/ml/v1/text/extractions/64162e0a-b05d-4ba6-a688-422893f58663?project_id=e40e5895-ce4d-42a3-b699-8ac764b89a09&version=2024-09-23' \
--header 'Accept: application/json' \
--header 'Authorization: Bearer eyJraWQiOi...'
Output details
The extracted text is written to a markdown file with the name that you specified in the results_reference.location.file_name
field.
The markdown captures structures in the document, such as sections and tables. For example, the following image shows how a table from the original PDF file is represented in markdown after the text is extracted. A preview of the markdown table is included to show that the text from the original table in the PDF remains intact after extraction.
Example JSON output
When text is extracted to a JSON file, the resulting file contains details about different data structures in the document such as sections, paragraphs, table structures, tokens and more.
For more information about how to work with text that is extracted in JSON format, see Parsing JSON structures generated by text extraction.
Using the text you extract from the PDF file
You can convert the generated markdown file into a text file by changing the file extension from .md
to .txt
. The resulting text file includes the markdown tags. If you want to remove the tagging, you can use a parser
library to find and convert the tags.
You can use a JSON processor library to extract text from the generated JSON file and store it as plain text. For example, the following command extracts the text from each token for all structures in the document and stores the text in a file
named parsed_output_text.txt
:
cat output_retail.json | jq '[.all_structures.tokens[].text] | join(" ")' > parsed_output_text.txt
After you convert the generated file to a TXT file, you can use the extracted text as contextual information for a foundation model prompt in the following ways:
-
Reference the extracted text from a Python notebook.
For example, you can use your TXT file instead of the
state_of_the_union.txt
file in the Use watsonx, Chroma, and LangChain to answer questions (RAG) sample notebook. -
You can use the TXT file as a grounding document in Prompt Lab. For more information, see Grounding foundation model prompts in contextual information.
Learn more
- Extracting text from a file programmatically
- Parsing JSON structures generated by text extraction
- Credentials for programmatic access
Parent topic: Coding generative AI solutions