0 / 0
Extracting text from a file programmatically
Last updated: Nov 27, 2024
Extracting text from a file programmatically

You can extract text from files in IBM watsonx.ai programmatically by using the Python library.

You can run a document text extraction job to extract text from a file that is stored in IBM Cloud Object Storage by using the ibm-watsonx-ai Python SDK and retrieve the results in a JSON file.

Sample notebook

The Use watsonx.ai Text Extraction service to extract text from file sample Python notebook contains code to run a text extraction job in watsonx.ai.

Use a text extraction job to extract text from a file

The following high-level steps are involved in setting up a source document from which text is extracted and an output file to collect the extracted results, and running a text extraction job to generate the results:

  1. Upload a source document to IBM Cloud Object Storage and a JSON file to be populated with the extracted data.

    from ibm_watsonx_ai.helpers import DataConnection, S3Location
    
    local_source_file_name = "granite_code_models_paper.pdf"
    source_file_name = "./files/granite_code_models_paper.pdf"
    results_file_name = "./files/text_extraction_granite_code_models_paper.json"
    
  2. Create data connection objects that represent the source document and results file.

    document_reference = DataConnection(connection_asset_id=connection_asset_id,
                                        location=S3Location(bucket=bucketname,
                                                            path=source_file_name))
    
    results_reference = DataConnection(connection_asset_id=connection_asset_id,
                                       location=S3Location(bucket=bucketname,
                                                           path=results_file_name))
    
  3. Initialize a text extraction manager object by using the TextExtractions class.

    from ibm_watsonx_ai.foundation_models.extractions import TextExtractions
    
    extraction = TextExtractions(api_client=client,
                                project_id=project_id)
    
  4. Set the properties that you want to extract in the text extraction process. In this example, English language text is detected by using Optical Character Recognition (OCR) and any tables present in the documents are processed.

    from ibm_watsonx_ai.metanames import TextExtractionsMetaNames
    
    steps = {TextExtractionsMetaNames.OCR: {'language_list': ['en']},
            TextExtractionsMetaNames.TABLE_PROCESSING: {'enabled': True}}
    
  5. Run the text extraction job and retrieve the job ID.

    details = extraction.run_job(document_reference=document_reference, 
                                results_reference=results_reference, 
                                steps=steps)
    extraction_job_id = extraction.get_id(extraction_details=details)
    
  6. After the job finishes running, you can download the results output file and process the extracted data.

    results_reference = extraction.get_results_reference(extraction_id=extraction_job_id)
    filename = "text_extraction_results_granite_code_models_paper.json"
    results_reference.download(filename=filename)
    
    import json
    
    metadata = json.load(open(filename, 'r'))
    metadata.get('all_structures').get('tokens')[:10]
    

Learn more

Parent topic: Python library

Generative AI search and answer
These answers are generated by a large language model in watsonx.ai based on content from the product documentation. Learn more