You can extract text from files in IBM watsonx.ai programmatically by using the Python library.
You can run a document text extraction job to extract text from a file that is stored in IBM Cloud Object Storage by using the ibm-watsonx-ai
Python SDK and retrieve the results in a JSON file.
Sample notebook
The Use watsonx.ai Text Extraction service to extract text from file sample Python notebook contains code to run a text extraction job in watsonx.ai.
Use a text extraction job to extract text from a file
The following high-level steps are involved in setting up a source document from which text is extracted and an output file to collect the extracted results, and running a text extraction job to generate the results:
-
Upload a source document to IBM Cloud Object Storage and a JSON file to be populated with the extracted data.
from ibm_watsonx_ai.helpers import DataConnection, S3Location local_source_file_name = "granite_code_models_paper.pdf" source_file_name = "./files/granite_code_models_paper.pdf" results_file_name = "./files/text_extraction_granite_code_models_paper.json"
-
Create data connection objects that represent the source document and results file.
document_reference = DataConnection(connection_asset_id=connection_asset_id, location=S3Location(bucket=bucketname, path=source_file_name)) results_reference = DataConnection(connection_asset_id=connection_asset_id, location=S3Location(bucket=bucketname, path=results_file_name))
-
Initialize a text extraction manager object by using the
TextExtractions
class.from ibm_watsonx_ai.foundation_models.extractions import TextExtractions extraction = TextExtractions(api_client=client, project_id=project_id)
-
Set the properties that you want to extract in the text extraction process. In this example, English language text is detected by using Optical Character Recognition (OCR) and any tables present in the documents are processed.
from ibm_watsonx_ai.metanames import TextExtractionsMetaNames steps = {TextExtractionsMetaNames.OCR: {'language_list': ['en']}, TextExtractionsMetaNames.TABLE_PROCESSING: {'enabled': True}}
-
Run the text extraction job and retrieve the job ID.
details = extraction.run_job(document_reference=document_reference, results_reference=results_reference, steps=steps) extraction_job_id = extraction.get_id(extraction_details=details)
-
After the job finishes running, you can download the results output file and process the extracted data.
results_reference = extraction.get_results_reference(extraction_id=extraction_job_id) filename = "text_extraction_results_granite_code_models_paper.json" results_reference.download(filename=filename) import json metadata = json.load(open(filename, 'r')) metadata.get('all_structures').get('tokens')[:10]
Learn more
Parent topic: Python library