Text extraction parameters
When you submit a text extraction request by using the watsonx.ai REST API, you include a payload that specifies configuration details for the text extraction operation.
Make choices about the various text extraction parameters that meet your requirements in the REST API request body:
- File type in which to store the extracted text
- Quality and speed of text extraction
- Language of the input text
- Include text from images in the extracted output
- Include key-value pairs in the extracted output
For details about the different parameters you can set to customize your text extraction REST API request, see the watsonx.ai API reference documentation.
Specifying the output file format
By default, the extracted text is written in plain text. If you want the extracted text to be written in another format, like Markdown, specify the following parameter in the API request body:
"parameters": {
"requested_outputs": [
"md"
]
}
The following table provides details about the different output formats generated by the text extraction process when you specify the requested_outputs
paramater in your API request:
Requested output | Generated file type | Description |
---|---|---|
md |
Markdown File | Extract text into a Markdown file |
html |
HTML File | Extract text in HTML format |
plain_text |
Plain Text File | Extract all information into an unstructured text representation |
assembly |
Assembly File | Extract text into a JSON format. |
page_images |
Serialized Images | Extract each page of the document into a separate image |
Processing mode
You can control the speed at which your text extraction request is processed by setting the mode
parameter in your API request.
"parameters": {
"mode": "standard"
]
}
The high quality processing mode preserves all data structures in your document but may take longer to process than the standard mode. In the standard mode, the extraction request completes faster but generates lower quality output that may lack details.
For details about the different processing modes, see the watsonx.ai API reference documentation.
Supported languages
If your document is in a language other than English, you must specify the language by its ISO 639 language code in the languages
parameter of your API request.
"parameters": {
"languages": "de"
]
}
If the document has a mix of languages, list each language separately.
For example, you can extract text from images in a document with a mix of English and French text because both languages are Latin based. However, you cannot extract text from images in a document with a mix of Japanese and French text.
The language code you specify differs based on whether your document contains machine-printed text or handwriting.
Supported handwritten languages
If your document contains text in English handwriting, use the en_hw
language code in your API request body.
Supported machine-printed languages
The following table provides details about the languages supported by the text extraction API for printed text recognition:
Language | ISO 639 language code | API script code | Script |
---|---|---|---|
Acehnese | ‐ | latn |
Latin |
Afrikaans | af |
latn |
Latin |
Albanian | sq |
latn |
Latin |
Araucanian/Mapuche | ‐ | latn |
Latin |
Awadhi | ‐ | deva |
Devanagari |
Aymara | ay |
latn |
Latin |
Balinese | ‐ | latn |
Latin |
Baso Minangkabau | ‐ | latn |
Latin |
Basque | eu |
latn |
Latin |
Belarusian | be |
cyrl |
Cyrillic |
Bemba | ‐ | latn |
Latin |
Bikol | ‐ | latn |
Latin |
Bislama | bi |
latn |
Latin |
Bhojpuri | ‐ | deva |
Devanagari |
Bulgarian | bg |
cyrl |
Cyrillic |
Catalan | ca |
latn |
Latin |
Cebuano | ‐ | latn |
Latin |
Chechen | ‐ | cyrl |
Cyrillic |
Chinese (Simplified) | zh_cn |
cjk |
Han (Simplified) |
Chinese (Traditional) | zh_tw |
cjk |
Han (Traditional) |
Choctaw | ‐ | latn |
Latin |
Cree | cr |
latn |
Latin |
Dakota | ‐ | latn |
Latin |
Danish | da |
latn |
Latin |
Dogri | ‐ | deva |
Devanagari |
Dutch | nl |
latn |
Latin |
English | en |
latn |
Latin |
Estonian | et |
latn |
Latin |
Fijian | fj |
latn |
Latin |
Filipino | fil |
latn |
Latin |
Finnish | fi |
latn |
Latin |
French | fr |
latn |
Latin |
Galician | gl |
latn |
Latin |
Gayo | ‐ | latn |
Latin |
German | de |
latn |
Latin |
Gilbertese | ‐ | latn |
Latin |
Greek | el |
el |
Greek |
Haitian Creole | ht |
latn |
Latin |
Hebrew | he |
he |
Hebrew |
Hiligaynon | ‐ | latn |
Latin |
Hindi | hi |
deva |
Devanagari |
Iban | ‐ | latn |
Latin |
Iloko | ‐ | latn |
Latin |
Indonesian | id |
latn |
Latin |
Irish | ga |
latn |
Latin |
Italian | it |
it |
Latin |
Japanese | ja |
cjk |
Japanese |
Javanese | jv |
latn |
Latin |
Kachin | ‐ | latn |
Latin |
Kalaallisut | kl |
latn |
Latin |
Kanienʼkéha | ‐ | latn |
Latin |
Khasi | ‐ | latn |
Latin |
Kinyarwanda | rw |
latn |
Latin |
Konkani | ‐ | deva |
Devanagari |
Kongo | kg |
latn |
Latin |
Korean | ko |
cjk |
Korean |
Kosraean | ‐ | latn |
Latin |
Kuanyama | kj |
latn |
Latin |
Latin | la |
latn |
Latin |
Lozi | ‐ | latn |
Latin |
Low German | ‐ | latn |
Latin |
Luo | ‐ | latn |
Latin |
Malagasy | mg |
latn |
Latin |
Maithili | ‐ | deva |
Devanagari |
Manx | gv |
latn |
Latin |
Marathi | mr |
deva |
Devanagari |
Middle English | ‐ | latn |
Latin |
Mittelhochdeutsch | ‐ | latn |
Latin |
Macedonian | mk |
cyrl |
Cyrillic |
Ndonga | ng |
latn |
Latin |
Nepali | ne |
deva |
Devanagari |
NorthNdebele | nd |
latn |
Latin |
Norwegian | no |
no |
Latin |
Nyankole | ‐ | latn |
Latin |
Occitan | oc |
latn |
Latin |
Ojibwa | oj |
latn |
Latin |
Old English | ‐ | latn |
Latin |
Old French | ‐ | latn |
Latin |
Old High German | ‐ | latn |
Latin |
Old Norse | ‐ | latn |
Latin |
Old Provençal | ‐ | latn |
Latin |
Pampanga | ‐ | latn |
Latin |
Pangasinan | ‐ | latn |
Latin |
Papiamento | ‐ | latn |
Latin |
Polish | pl |
latn |
Latin |
Portuguese | pt |
pt |
Latin |
Quechua | qu |
latn |
Latin |
Romansh | rm |
latn |
Latin |
Rundi | rn |
latn |
Latin |
Russian | ru |
cyrl |
Cyrillic |
Sango | sg |
latn |
Latin |
Sanskrit | sa |
deva |
Devanagari |
Scots | ‐ | latn |
Latin |
Serbian | sr |
cyrl |
Cyrillic |
Shona | sn |
latn |
Latin |
Spanish | es |
es |
Latin |
Sundanese | su |
latn |
Latin |
Swahili | sw |
latn |
Latin |
Swati | ss |
latn |
Latin |
Swedish | sv |
sv |
Latin |
Tamil | ta |
deva |
Tamil |
Telugu | te |
deva |
Telugu |
Tsonga | ts |
latn |
Latin |
Tswana | tn |
latn |
Latin |
Ukrainian | uk |
cyrl |
Cyrillic |
Uzbek | uz |
cyrl
Note:
latn API script code. |
Cyrillic |
Xhosa | xh |
latn |
Latin |
Zulu | zu |
latn |
Latin |
Extracting text from images
You can specify how you to process text in images in your document by using optical character recognition (OCR). Specify the following parameter in the API request body:
"parameters": {
"ocr_mode": "enabled"
]
}
For details about the different OCR modes, see watsonx.ai API reference documentation.
You can also configure how to process images embedded in your document and convert them to Markdown and JSON formats.
The embedded image is the area on a page of the document that represents only the picture without including portions of the page that contain text or tables. Text and tables in the original document are processed with OCR. The embedded images extraction mode is used to specify how to serialize images in the document and preserve them in the extracted output.
Based on the embedded images extraction mode you specify, you can choose how embedded images are represented in the output:
- Whether to include images in the extracted output. If images are included, they are stored in the
embedded_images_assembly
folder as.png
files - Whether generic placeholder text or the text extracted by OCR from the image appears in the Markdown and JSON output formats
- Whether image is verbalized by describing the image in natural language. For example, an image of a cat may be verbalized as
The image displays a cat resting on the floor
.
To extract embedded images including text that describes the images, specify the following parameter in the API request body:
"parameters": {
"create_embedded_images": "enabled_verbalization"
]
}
The following table provides details about the different modes you can use in your API request to extract embedded images:
Mode | Image (in bytes) in output | Markdown output details | JSON output details |
---|---|---|---|
disabled |
No | None | List of token IDs that represent the text in the image |
enabled_placeholder |
✓ | Link to image location | • Image • List of token IDs that represent the text in the image |
enabled_text |
✓ | Text is extracted from the image | • Image • List of token IDs that represent the text in the image |
enabled_verbalization |
✓ | • Link to image location • Textual description of the image |
• Image • List of token IDs that represent the text in the image |
enabled_verbalization_all |
✓ | • Link to image location • Textual description of the image |
• Image • List of token IDs that represent the text in the image |
Extracting text in key-value pairs
You can choose to extract text as key-value pairs from documents that contain domain-specific structured data. The extracted text is stored in a format where each piece of data (the value) is associated with a unique identifier (the key). Key-value pair data is extracted by using a general-purpose foundation model or a model that is tuned for specific document formats.
Based on the contents of your input document, you can extract key-value pair data with one of the following methods:
- Generic key-value pair extraction
- The generic extraction process identifies and extracts all key-value pairs in a document. This method is useful for extracting labeled information without needing to know details about specific fields in advance.
- Schema-based (Fixed) extraction
- The schema-based process targets specific, pre-defined fields in documents by using built-in schemas for common document types like invoices, utility bills, passports, and more. Every page is classified into one of the supported schema types. Based on the classification, text is extracted into the key-value pair format defined in the schema for the specific document type. By classifying the document first, this method increases accuracy for known document types without requiring dedicated model training.
For example, if you want to extract text as key-value pair data using a model tuned for invoices, specify the following parameter in the API request body
"parameters": {
"kvp_mode": "invoice"
]
}
If you do not specify the kvp_mode
in your text extraction API request, labelled data in your document is not stored in a key-value pair format in the extracted output.
Key-value pairs extraction modes
You can specify one of the following modes in your API request to extract key-value pair data from your document:
-
invoice
Extract text from an invoice with a specialized model in a key-value pair format. The model is trained with datasets that contain various invoices.
For details about the schema in which the key-value pairs are stored in this mode, see Invoice schema.
-
ubill
Extract text from a utility bill with a specialized model in a key-value pair format. The model is trained with datasets that contain various utility bills.
For details about the schema in which the key-value pairs are stored in this mode, see Utility bill schema.
-
generic_with_semantic
Extract generic labelled data and domain-specific data with a general purpose model into a key-value pair format. Domain-specific data extracted from several common document types is stored in pre-defined schemas. The foundation model generates key-value pairs from the extracted text based on the provided schema. The
pixtral-12b
model is used in this mode.Restriction:The generic_with_semantic
mode setting is not available in the Toronto and Sydney regions.The following document types use pre-defined schemas:
- Mortgage lending document
- Bill of lading
- Customs form
- Delivery receipt
- Expense report
- Receipt
- Purchase order
- Tax form
- Financial statement
- Remittance or Payment Advice
- Bank statement
- Credit card statement
- Driver's license
- Passport
- National ID card
- W-4 form
- I-9 form
- Patient intake form
- Insurance claim
- Transcript
- Diploma or certification
- Life insurance standard disability claim form
- Standard life insurance authorization form
- Association for Cooperative Operations Research and Development (ACORD) standardized insurance form
- Claimant's statement - death claim form
- Business license and permit
If your documents contains a unique structured content, you can provide a custom schema that defines specific data and unique identifiers. When you specify a custom schema, the text extration process overrides the pre-defined common document schemas and only uses the schema you provide.
You can provide a custom schema for key-value pair extraction by specifying the
semantic_config
paramater in your API request. For more information about how to configure custom schema parameters, see the watsonx.ai API reference documentation.
Learn more
Parent topic: Text extraction