Extracting text from a file programmatically

Last updated: Feb 21, 2025

You can extract text from files in IBM watsonx.ai programmatically by using the Python library.

You can run a document text extraction job to extract text from a file that is stored in IBM Cloud Object Storage by using the ibm-watsonx-ai Python SDK and retrieve the results in a JSON file.

Sample notebook

The Use watsonx.ai Text Extraction service to extract text from file sample Python notebook contains code to run a text extraction job in watsonx.ai.

Use a text extraction job to extract text from a file

This notebook uses the Text Extractions class of the watsonx.ai Python library.

The following high-level steps are involved in setting up a source document from which text is extracted and an output file to collect the extracted results, and running a text extraction job to generate the results:

Upload a source document to IBM Cloud Object Storage and a JSON file to be populated with the extracted data.

from ibm_watsonx_ai.helpers import DataConnection, S3Location

local_source_file_name = "granite_code_models_paper.pdf"
source_file_name = "./files/granite_code_models_paper.pdf"
results_file_name = "./files/text_extraction_granite_code_models_paper.json"

Create data connection objects that represent the source document and results file.

document_reference = DataConnection(connection_asset_id=connection_asset_id,
                                    location=S3Location(bucket=bucketname,
                                                        path=source_file_name))

results_reference = DataConnection(connection_asset_id=connection_asset_id,
                                   location=S3Location(bucket=bucketname,
                                                       path=results_file_name))

Initialize a text extraction manager object by using the TextExtractions class.

from ibm_watsonx_ai.foundation_models.extractions import TextExtractions

extraction = TextExtractions(api_client=client,
                            project_id=project_id)

Set the properties that you want to extract in the text extraction process. In this example, English language text is detected by using Optical Character Recognition (OCR) and any tables present in the documents are processed.
```
from ibm_watsonx_ai.metanames import TextExtractionsMetaNames

steps = {TextExtractionsMetaNames.OCR: {'language_list': ['en']},
        TextExtractionsMetaNames.TABLE_PROCESSING: {'enabled': True}}
```

Run the text extraction job and retrieve the job ID.

details = extraction.run_job(document_reference=document_reference, 
                            results_reference=results_reference, 
                            steps=steps)
extraction_job_id = extraction.get_id(extraction_details=details)

After the job finishes running, you can download the results output file and process the extracted data.

results_reference = extraction.get_results_reference(extraction_id=extraction_job_id)
filename = "text_extraction_results_granite_code_models_paper.json"
results_reference.download(filename=filename)

import json

metadata = json.load(open(filename, 'r'))
metadata.get('all_structures').get('tokens')[:10]

Learn more

Extracting text from documents

Parent topic: Python library