0 / 0

Extracting text from documents

Last updated: May 03, 2025
Extracting text from documents

Extract text from a complex, highly-structured document to a simpler text-based file format that you can easily incorporate into your RAG solution.

Text extraction is powerful for use cases where you want to extract specific entities or categories of information from a document based on the document structure.

You can use the text extraction REST API to extract text from an input file that is stored in your project as an asset. Text extraction is an asynchronous process that converts one file at a time. You can make parallel method requests to extract text from a set of documents. The text extraction results are stored in your project in one or several files depending on the output formats you specify in your request.

Before you begin

  1. Prepare your documents as follows before you add them to your project:

    • Remove any password protection from your document.
    • If your PDF document is digitally certified, convert your document to another file format, like DOC or DOCX.
  2. Decide the parameters to include in your text extraction request to fit your specific use case. For details, see Text extraction parameters.

Procedure

Follow these high-level steps to extract text from a business document by using the REST API:

  1. Add the file from which you want to extract text to a IBM Cloud Object Storage bucket, and then define a connection from your watsonx.ai project to the IBM Cloud Object Storage service instance.

    Your document must be stored as a connection asset in your project, and then referenced by its connection ID. Only connection assets that use the Access key and Secret key pair for credentials are supported. For details, see Adding files to reference from the API.

  2. Use the Start a text extraction request method to start the text extraction process.

    Note the ID that is returned in the metadata.id field. You use this ID as the extraction ID to check the status of your request in the next step.

  3. Use the Get the results of the request request to check the status of your request.

    Checking the status is the only way to find out whether the process failed for any reason.

    When the status is Completed, the extracted text file is available in the specified IBM Cloud Object Storage bucket.

  4. Download the generated file from Cloud Object Storage.

Request example

The following command submits a request to extract text from the retail_guidebook.pdf file and save it in Markdown format in the output_retail.md.

curl -X POST \
  'https://{region}.ml.cloud.ibm.com/ml/v1/text/extractions?version=2024-10-18' \
  --header 'Accept: application/json' \
  --header 'Content-Type: application/json' \
  --header 'Authorization: Bearer eyJraWQiOi...'

The request body is as follows:

{
    "project_id": "e40e5895-ce4d-42a3-b699-8ac764b89a09",
    "document_reference": {
      "type": "connection_asset",
      "connection": {
        "id": "5c0cefce-da57-408b-b47d-58f7785de3ee"
      },
      "location": {
        "bucket":"my-cloud-object-storage-bucket",
        "file_name": "retail_guidebook.pdf"
      }
    },
    "results_reference": {
      "type": "connection_asset",
      "connection": {
        "id": "5c0cefce-da57-408b-b47d-58f7785de3ee"
      },
      "location": {
        "bucket":"my-cloud-object-storage-bucket",
        "file_name": "output_retail.md"
      }
    },
    "parameters": {
      "requested_outputs": {
        "languages_list": [
          "assembly",
          "md"
        ]
      },
      "mode": "standard",
      "ocr_mode": "enabled",
      "create_embedded_images": "disabled"
    }
  }

From the response, copy the metadata.id, such as 64162e0a-b05d-4ba6-a688-422893f58663. Specify this ID in the endpoint that you use to check the status of the extraction process.

curl -X GET \
  'https://{region}.ml.cloud.ibm.com/ml/v1/text/extractions/64162e0a-b05d-4ba6-a688-422893f58663?project_id=e40e5895-ce4d-42a3-b699-8ac764b89a09&version=2024-09-23' \
  --header 'Accept: application/json' \
  --header 'Authorization: Bearer eyJraWQiOi...'

You can find the location of the extracted results in the response from the results attribute:

"results": {
  "completed_at": "2025-04-28T09:05:42.880Z",
  "location": ["/results_data1/assembly.html", "/results_data1/assembly.json",
               "/results_data1/assembly.md",
               "/results_data1/embedded_images_assembly/*.png",
               "/results_data1/page_images/*.png", "/results_data1/tables.json"],
  "number_pages_processed": 1,
  "running_at": "2025-04-28T09:05:27.345Z",
  "status": "completed"
}

Next steps

Parent topic: Text extraction