0 / 0

Text extraction parameters

Last updated: May 13, 2025
Text extraction parameters

When you submit a text extraction request by using the watsonx.ai REST API, you include a payload that specifies configuration details for the text extraction operation.

Make choices about the various text extraction parameters that meet your requirements in the REST API request body:

For details about the different parameters you can set to customize your text extraction REST API request, see the watsonx.ai API reference documentation.

Specifying the output file format

By default, the extracted text is written in plain text. If you want the extracted text to be written in another format, like Markdown, specify the following parameter in the API request body:

"parameters": {
  "requested_outputs": [
    "md"
  ]
}

The following table provides details about the different output formats generated by the text extraction process when you specify the requested_outputs paramater in your API request:

Requested output formats in the text extraction API
Requested output Generated file type Description
md Markdown File Extract text into a Markdown file
html HTML File Extract text in HTML format
plain_text Plain Text File Extract all information into an unstructured text representation
assembly Assembly File Extract text into a JSON format.
page_images Serialized Images Extract each page of the document into a separate image

Processing mode

You can control the speed at which your text extraction request is processed by setting the mode parameter in your API request.

"parameters": {
  "mode": "standard"
  ]
}

The high quality processing mode preserves all data structures in your document but may take longer to process than the standard mode. In the standard mode, the extraction request completes faster but generates lower quality output that may lack details.

For details about the different processing modes, see the watsonx.ai API reference documentation.

Supported languages

If your document is in a language other than English, you must specify the language by its ISO 639 language code in the languages parameter of your API request.

"parameters": {
  "languages": "de"
  ]
}

If the document has a mix of languages, list each language separately.

Note: You cannot extract text from a mixed-language document when the languages do not share a common script. However, you can use documents with a mix of English and one other language in any script.

For example, you can extract text from images in a document with a mix of English and French text because both languages are Latin based. However, you cannot extract text from images in a document with a mix of Japanese and French text.

The language code you specify differs based on whether your document contains machine-printed text or handwriting.

Supported handwritten languages

If your document contains text in English handwriting, use the en_hw language code in your API request body.

Supported machine-printed languages

The following table provides details about the languages supported by the text extraction API for printed text recognition:

Note: If your document language does not have an ISO 639 language code listed, use the API script code.
Machine-printed languages supported in the text extraction API
Language ISO 639 language code API script code Script
Acehnese latn Latin
Afrikaans af latn Latin
Albanian sq latn Latin
Araucanian/Mapuche latn Latin
Awadhi deva Devanagari
Aymara ay latn Latin
Balinese latn Latin
Baso Minangkabau latn Latin
Basque eu latn Latin
Belarusian be cyrl Cyrillic
Bemba latn Latin
Bikol latn Latin
Bislama bi latn Latin
Bhojpuri deva Devanagari
Bulgarian bg cyrl Cyrillic
Catalan ca latn Latin
Cebuano latn Latin
Chechen cyrl Cyrillic
Chinese (Simplified) zh_cn cjk Han (Simplified)
Chinese (Traditional) zh_tw cjk Han (Traditional)
Choctaw latn Latin
Cree cr latn Latin
Dakota latn Latin
Danish da latn Latin
Dogri deva Devanagari
Dutch nl latn Latin
English en latn Latin
Estonian et latn Latin
Fijian fj latn Latin
Filipino fil latn Latin
Finnish fi latn Latin
French fr latn Latin
Galician gl latn Latin
Gayo latn Latin
German de latn Latin
Gilbertese latn Latin
Greek el el Greek
Haitian Creole ht latn Latin
Hebrew he he Hebrew
Hiligaynon latn Latin
Hindi hi deva Devanagari
Iban latn Latin
Iloko latn Latin
Indonesian id latn Latin
Irish ga latn Latin
Italian it it Latin
Japanese ja cjk Japanese
Javanese jv latn Latin
Kachin latn Latin
Kalaallisut kl latn Latin
Kanienʼkéha latn Latin
Khasi latn Latin
Kinyarwanda rw latn Latin
Konkani deva Devanagari
Kongo kg latn Latin
Korean ko cjk Korean
Kosraean latn Latin
Kuanyama kj latn Latin
Latin la latn Latin
Lozi latn Latin
Low German latn Latin
Luo latn Latin
Malagasy mg latn Latin
Maithili deva Devanagari
Manx gv latn Latin
Marathi mr deva Devanagari
Middle English latn Latin
Mittelhochdeutsch latn Latin
Macedonian mk cyrl Cyrillic
Ndonga ng latn Latin
Nepali ne deva Devanagari
NorthNdebele nd latn Latin
Norwegian no no Latin
Nyankole latn Latin
Occitan oc latn Latin
Ojibwa oj latn Latin
Old English latn Latin
Old French latn Latin
Old High German latn Latin
Old Norse latn Latin
Old Provençal latn Latin
Pampanga latn Latin
Pangasinan latn Latin
Papiamento latn Latin
Polish pl latn Latin
Portuguese pt pt Latin
Quechua qu latn Latin
Romansh rm latn Latin
Rundi rn latn Latin
Russian ru cyrl Cyrillic
Sango sg latn Latin
Sanskrit sa deva Devanagari
Scots latn Latin
Serbian sr cyrl Cyrillic
Shona sn latn Latin
Spanish es es Latin
Sundanese su latn Latin
Swahili sw latn Latin
Swati ss latn Latin
Swedish sv sv Latin
Tamil ta deva Tamil
Telugu te deva Telugu
Tsonga ts latn Latin
Tswana tn latn Latin
Ukrainian uk cyrl Cyrillic
Uzbek uz cyrl
Note: If you want to process Uzbek language documents written in a Latin script, use the latn API script code.
Cyrillic
Xhosa xh latn Latin
Zulu zu latn Latin

Extracting text from images

You can specify how you to process text in images in your document by using optical character recognition (OCR). Specify the following parameter in the API request body:

"parameters": {
  "ocr_mode": "enabled"
  ]
}

For details about the different OCR modes, see watsonx.ai API reference documentation.

You can also configure how to process images embedded in your document and convert them to Markdown and JSON formats.

The embedded image is the area on a page of the document that represents only the picture without including portions of the page that contain text or tables. Text and tables in the original document are processed with OCR. The embedded images extraction mode is used to specify how to serialize images in the document and preserve them in the extracted output.

Based on the embedded images extraction mode you specify, you can choose how embedded images are represented in the output:

  • Whether to include images in the extracted output. If images are included, they are stored in the embedded_images_assembly folder as .png files
  • Whether generic placeholder text or the text extracted by OCR from the image appears in the Markdown and JSON output formats
  • Whether image is verbalized by describing the image in natural language. For example, an image of a cat may be verbalized as The image displays a cat resting on the floor.

To extract embedded images including text that describes the images, specify the following parameter in the API request body:

"parameters": {
  "create_embedded_images": "enabled_verbalization"
  ]
}

The following table provides details about the different modes you can use in your API request to extract embedded images:

Embedded images extraction modes in the text extraction API
Mode Image (in bytes) in output Markdown output details JSON output details
disabled No None List of token IDs that represent the text in the image
enabled_placeholder Link to image location • Image
• List of token IDs that represent the text in the image
enabled_text Text is extracted from the image • Image
• List of token IDs that represent the text in the image
enabled_verbalization • Link to image location
• Textual description of the image
• Image
• List of token IDs that represent the text in the image
enabled_verbalization_all • Link to image location
• Textual description of the image
• Image
• List of token IDs that represent the text in the image

Extracting text in key-value pairs

You can choose to extract text as key-value pairs from documents that contain domain-specific structured data. The extracted text is stored in a format where each piece of data (the value) is associated with a unique identifier (the key). Key-value pair data is extracted by using a general-purpose foundation model or a model that is tuned for specific document formats.

Note: Key-value pair data extraction is only supported for English language documents.

Based on the contents of your input document, you can extract key-value pair data with one of the following methods:

Generic key-value pair extraction
The generic extraction process identifies and extracts all key-value pairs in a document. This method is useful for extracting labeled information without needing to know details about specific fields in advance.
Schema-based (Fixed) extraction
The schema-based process targets specific, pre-defined fields in documents by using built-in schemas for common document types like invoices, utility bills, passports, and more. Every page is classified into one of the supported schema types. Based on the classification, text is extracted into the key-value pair format defined in the schema for the specific document type. By classifying the document first, this method increases accuracy for known document types without requiring dedicated model training.

For example, if you want to extract text as key-value pair data using a model tuned for invoices, specify the following parameter in the API request body

"parameters": {
  "kvp_mode": "invoice"
  ]
}

If you do not specify the kvp_mode in your text extraction API request, labelled data in your document is not stored in a key-value pair format in the extracted output.

Key-value pairs extraction modes

You can specify one of the following modes in your API request to extract key-value pair data from your document:

Learn more

Parent topic: Text extraction