Extracting text from documents

Last updated: Feb 07, 2025

Extract text to convert high-quality business PDF documents into a simpler file format that can be used by AI models or to find and isolate key pieces of information from documents such as contracts.

Ways to develop

You can extract text from documents by using these programming methods:

Overview

Simplifying your business documents in this way is especially useful for retrieval-augmented generation tasks where you want to find information that is relevant to a user query and include it with the input to a foundation model. Including accurate contextual information in model input helps the foundation model to incorporate factual and up-to-date information in the model output. For more information, see Retrieval-augmented generation (RAG).

Text extraction is also powerful for use cases where you want to extract specific entities or categories of information from a document based on the document structure.

REST API

You can use the document text extraction method of the watsonx.ai REST API to convert PDF files that are highly structured and use diagrams, images, and tables to convey information, into a file format that is easier to work with programmatically, such as markdown or JSON.

Text extraction is available with paid plans only. Billing is based on the number of pages that are processed. For more information, see Billing details for generative AI assets.

The text extraction API applies natural language understanding technology developed by IBM to identify document structures.

Text extraction is an asynchronous process that converts one file at a time. You can make parallel method requests to extract text from a set of documents.

API reference

For API method details, see the API reference documentation.

Supported file types

You can use the API to extract text from the following file types:

PDF
GIF
JPG
PNG
TIFF

You can store the extracted text in the following formats:

JSON
Markdown

Supported languages

The capability that extracts text from images is called optical character recognition (OCR). This capability is useful for preserving information that is depicted in images, diagrams, or in text that is embedded in files such as scanned PDFs.

Although optical character recognition can extract text from noisy images, the quality of the image files must meet the minimum requirement of 80 DPI (dots per inch).

If the document with images that you want to convert is in a language other than English, you must specify the language by its ISO 639 language code in the language_list parameter of your request.

    "languages_list": [
      "fr"
    ]

If the document has a mix of languages, list each language separately. Optical character recognition can convert images in a mixed-language document only when the languages share a common script. For example, you can extract text from images in a document with a mix of English and French text because both languages are Latin based. However, you cannot use OCR to extract text from images in a document with a mix of Japanese and English text.

The optical character recognition function can extract text from images in documents that are written in the following languages:

Language	ISO 639 language code	Script
Chinese (Simplified)	`zh-CN`	Chinese
Chinese (Traditional)	`zh-TW`	Chinese
Danish	`da`	Latin
Dutch	`nl`	Latin
English	`en`	Latin
English handwriting	`en_hw`	Latin
Finnish	`fi`	Latin
French	`fr`	Latin
German	`de`	Latin
Greek	`el`	Greek
Hebrew	`he`	Hebrew
Italian	`it`	Latin
Japanese	`ja`	Japanese
Korean	`ko`	Korean
Norwegian (Bokmål)	`nb`	Latin
Norwegian (Nynorsk)	`nn`	Latin
Polish	`pl`	Latin
Portuguese	`pt`	Latin
Spanish	`es`	Latin
Swedish	`sv`	Latin

Extracting text from tables

Convert tabular data within a document into consumable text that captures the table information. Many large language models have difficulty with interpreting tabular information correctly.

To enable table conversion, specify the following parameter in your request.

"steps": {
    "tables_processing": {
      "enabled": true
    }
  }

Choosing the output file format

By default, the extracted text is written in JSON syntax. If you want the extracted text to be written in markdown instead, specify the following parameter in the API request body:

"assembly_md": {}

Managing your documents

You add the documents that you want to process into IBM Cloud Object Storage so you can reference them from the API.

Only connection assets that use the Access key and Secret key pair for credentials are supported. For more information about how to set up the connection, see Referencing files from the API.

For example, you reference the file that you add to IBM Cloud Object Storage as follows:

"document_location": {
  "type": "connection_asset",
  "connection": {
    "id": "6f5688fd-f3bf-42c2-a18b-49c0d8a1920d"
  },
  "location": {
    "file_name": "document.pdf",
    "bucket":"janessandbox"
  }
}

You define the location where you want to store the generated output file as follows:

"results_location": {
  "type": "connection_asset",
  "connection": {
    "id": "6f5688fd-f3bf-42c2-a18b-49c0d8a1920d"
  },
  "location": {
    "file_name": "extracted_document.json"
  }
}

The following diagram shows the workflow you use to extract structural information about a business document with the document text extraction API.

watsonx.ai document text extraction API workflow

Procedure

Follow these high-level steps to extract text from a business document by using the REST API:

Add the file from which you want to extract text to a IBM Cloud Object Storage bucket, and then define a connection from your watsonx.ai project to the IBM Cloud Object Storage service instance.

For more information, see Referencing files from the API.
Use the Start a text extraction request method to start the text extraction process.

Note the ID that is returned in the metadata.id field. You use this ID as the extraction ID to check the status of your request in the next step.
Use the Get the results of the request method request to check the status of your request.

Checking the status is the only way to find out whether the process failed for any reason.

When the status is Completed, the extracted text file is available in the specified IBM Cloud Object Storage bucket.
Download the generated file from Cloud Object Storage.

Request example

For example, the following command submits a request to extract text from the retail_guidebook.pdf file and save it in markdown format.

curl -X POST \
  'https://{region}.ml.cloud.ibm.com/ml/v1/text/extractions?version=2024-10-18' \
  --header 'Accept: application/json' \
  --header 'Content-Type: application/json' \
  --header 'Authorization: Bearer eyJraWQiOi...'

The request body looks as follows:

{
    "project_id": "e40e5895-ce4d-42a3-b699-8ac764b89a09",
    "document_reference": {
      "type": "connection_asset",
      "connection": {
        "id": "5c0cefce-da57-408b-b47d-58f7785de3ee"
      },
      "location": {
        "bucket":"my-cloud-object-storage-bucket",
        "file_name": "retail_guidebook.pdf"
      }
    },
    "results_reference": {
      "type": "connection_asset",
      "connection": {
        "id": "5c0cefce-da57-408b-b47d-58f7785de3ee"
      },
      "location": {
        "bucket":"my-cloud-object-storage-bucket",
        "file_name": "output_retail.md"
      }
    },
    "steps": {
      "ocr": {
        "languages_list": [
          "en"
        ]
      },
      "tables_processing": {
        "enabled": true
      }
    },
    "assembly_md": {}
  }

From the response, copy the metadata.id, such as 64162e0a-b05d-4ba6-a688-422893f58663. Specify this ID in the endpoint that you use to check the status of the extraction process.

curl -X GET \
  'https://{region}.ml.cloud.ibm.com/ml/v1/text/extractions/64162e0a-b05d-4ba6-a688-422893f58663?project_id=e40e5895-ce4d-42a3-b699-8ac764b89a09&version=2024-09-23' \
  --header 'Accept: application/json' \
  --header 'Authorization: Bearer eyJraWQiOi...'

Output details

The extracted text is written to a markdown file with the name that you specified in the results_reference.location.file_name field.

The markdown captures structures in the document, such as sections and tables. For example, the following image shows how a table from the original PDF file is represented in markdown after the text is extracted. A preview of the markdown table is included to show that the text from the original table in the PDF remains intact after extraction.

Three screenshots where the first one shows a table in a PDF document, the next shows the table text extracted as markdown, and the third shows a preview of the table

Example JSON output

When text is extracted to a JSON file, the resulting file contains details about different data structures in the document such as sections, paragraphs, table structures, tokens and more.

For more information about how to work with text that is extracted in JSON format, see Parsing JSON structures generated by text extraction.

Using the text you extract from the PDF file

You can convert the generated markdown file into a text file by changing the file extension from .md to .txt. The resulting text file includes the markdown tags. If you want to remove the tagging, you can use a parser library to find and convert the tags.

You can use a JSON processor library to extract text from the generated JSON file and store it as plain text. For example, the following command extracts the text from each token for all structures in the document and stores the text in a file named parsed_output_text.txt:

cat output_retail.json | jq '[.all_structures.tokens[].text] | join(" ")' > parsed_output_text.txt

Note: This command uses jq, which is a command-line JSON processor that needs to be installed separately.

After you convert the generated file to a TXT file, you can use the extracted text as contextual information for a foundation model prompt in the following ways:

Reference the extracted text from a Python notebook.

For example, you can use your TXT file instead of the state_of_the_union.txt file in the Use watsonx, Chroma, and LangChain to answer questions (RAG) sample notebook.
You can use the TXT file as a grounding document in Prompt Lab. For more information, see Grounding foundation model prompts in contextual information.

Learn more

Parent topic: Coding generative AI solutions