0 / 0
資料の 英語版 に戻る
最終更新: 2025年2月21日

テキスト抽出APIを使用して文書または画像ファイルから抽出したテキストは、文書内のさまざまなテキストおよびビジュアル要素に関する詳細を含むJSONファイルに書き込まれます。 生成されたJSONをさらに処理して、必要な情報を抽出することができる。

テキスト抽出APIを使ってテキストを抽出すると、必ず以下のJSONオブジェクトが返される。 これらのルート・オブジェクト内の構造体はオプショナルであり、出力で返される場合もあれば、返されない場合もある。

stylesキーは、よく返されるもう一つのルートレベルオブジェクトであるが、オプションである。 stylesオブジェクトは辞書のリストを含んでおり、各辞書はドキュメントで使われているフォントの詳細(フォントサイズやフォントスタイルなど)を含んでいます。

興味のある構造からテキストを抽出するコードを書くことができる。 詳細は以下のセクションを参照:




  • num_pages: ドキュメントのページ数。
  • title: 文書のタイトル。
  • keywords: 文書に関連するキーワード。
  • author: 文書の作成者。
  • publication_date: 文書が作成または公開された日付。
  • subject: 文書の件名。
  • charset: 文書で使われる文字の集合の標準。


  "title":"Put AI to work for HR and talent transformation for the retail industry",
  "subject":"Apply AI capabilities to drive your HR and talent transformation and generate better business outcomes in the retail industry.",



  • top_level_structures: トップレベルのデータ構造のIDのリスト。
  • all_structures: すべてのデータ構造型のリスト。

all_structuresキーには、構文解析された文書に含まれるデータ構造のすべての可能な型のリストが含まれる。 これらの構造はオプションであるため、出力に含まれる場合も含まれない場合もあります。 パースされた文書に含まれる可能性のあるデータ構造には、以下のようなものがある:

  • sections: ドキュメント内のすべてのセクションのリスト。
  • section_titles: 検出されたセクションのタイトルのリスト。
  • lists: ドキュメント内のすべてのリストのコレクション。
  • list_items: 検出されたリストオブジェクトに存在するリスト項目のコレクション。
  • list_identifiers: 検出されたリストオブジェクトのリスト識別子のコレクション。
  • tables: ドキュメント内のすべてのテーブルのリスト。
  • table_rows: 検出されたテーブルに存在するテーブル行のリスト。
  • table_cells: 検出されたテーブル行に存在するテーブルセルのリスト。
  • tokens: プレーン・テキスト・トークンのリスト。
  • subscripts: 文書内で検出されたトークンに関連する添え字テキストのインスタンスのリスト。
  • superscripts: ドキュメント内で検出されたトークンに関連する上付きテキストのインスタンスのリスト
  • footnotes: 脚注のリスト。
  • paragraphs: 段落のリスト。




cat output_retail.json | jq '.metadata.num_pages'

テーブルやリストなど一部の構造では、抽出されたテキストは生成されたJSON内のさまざまなオブジェクトに格納される。 コードを使ってオブジェクトを走査し、興味のあるテキストを抽出することができる。





Collect organize grow data」という文章が強調表示されたPDFのスクリーンショット。

//The section is listed in the top_level_structures array.

//The section has a list of parapraphs.

//The paragraph contains a section title.

//Token IDs listed for the section title.

//Consecutive tokens with a shared parent_id contain the text from the sentence of interest.


watsonx.aiAPI のテ キ ス ト 抽出方式に画像付き PDF フ ァ イ ルま たは画像フ ァ イ ルを送信す る と 、 画像か ら 抽出 さ れたテ キ ス ト は 「tokens で表 さ れます。 tokensは通常、'paragraphまたは'sectionオブジェクトに含まれる。


以下のJSONの抜粋は、テキスト抽出メソッドに送信したPNGファイルがJSON出力でどのように表現されるかを示しています。 テキスト・トークンを含むパラグラフ・オブジェクトは、'top_level_structuresオブジェクトと'all_structuresルート・オブジェクトの両方から利用できる。


抽出されたテキストは、段落内のトークンで指定される。 以下のトークンは、画像からThe AI Ladder®という単語を以下のように表現している:




  • lists: 箇条書きまたは番号付きリストとしてフォーマットされたリスト項目のセット。
  • list_items: テキスト、段落、ネストされたリストを持つトークンを含むことができるリストの単一アイテム。
  • list_identifiers: リスト項目を識別するハイフンや数字などの記号を持つトークンを含む。



//The lists object contains the list where the listitem is located.

//The list_item object contains the list item which contains a list ID followed by several tokens.

//The list_identifiers object contains list IDs with tokens.

//The list ID token includes a token with a hyphen.

//The list item tokens include the text *Providing transparency* in them.


# Import required libraries
import json
import numpy as np
import pandas as pd

# Define helper functions

## Function, which finds entry in collection by key-value pair
def find_by_key(key: str, value, collection: list, unique=True):
  find = list(filter(lambda x: x[key] == value, collection))
  if unique:
    if len(find) > 1:
      raise ValueError(f"Found non-unique key-value pair.\n{find}")
    return find[0]
    return find

## Function, which flattens iterable collection of dicts
def flatten_collection(collection):
  result = []
  for val in collection.values():
  return result

# Load the file with the extracted text
with open("/Users/janedoe/Downloads/output_retail.json") as f:
  raw_output = json.load(f)

# Get all list-related structures
all_lists = raw_output['all_structures']['lists']
all_list_items = raw_output['all_structures']['list_items']
all_list_identifiers = raw_output['all_structures']['list_identifiers']

# Get all list items from the first list in the file
list_1 = all_lists[0]
list_1_items = []

for list_item_id in list_1['children_ids']:
  list_1_items.append(find_by_key('id', list_item_id, all_list_items))

# Reconstruct the list
recon_list = []

flat_col = flatten_collection(raw_output['all_structures'])
for list_item in list_1_items:
  val = []
  for list_value_id in list_item['children_ids']:
    list_value = find_by_key('id', list_value_id, flat_col)
    if list_value['id'].startswith("LIST_ID"):
      for list_id_value_id in list_value['children_ids']:
        list_id_value = find_by_key('id', list_id_value_id, flat_col)
        if 'text' in list_id_value:
    elif list_value['id'].startswith("PARA"):
      for para_value_id in list_value['children_ids']:
        para_value = find_by_key('id', para_value_id, flat_col)
        if 'text' in para_value:
    elif list_value['id'].startswith("TOKEN"):
  print(' '.join(val))



  • tables: 複数のテーブル行に関連する。
  • table_rows: 各テーブル行は複数のテーブルセルに関連付けられている。
  • table_cells: 各テーブルのセルには、トークンの列、段落とトークンの混在した列、リストと段落とトークンの混在した列が含まれる。



//The all_structures root object contains the table, which has many rows.

//A separate table rows array contains table cells.

//One of the table cells is identified as a column header and contains a paragraph.

//The paragraph has a token.

//The token contains the text *Workflows*.


# Import required libraries
import json
import numpy as np
import pandas as pd

# Define helper functions
## Function, which finds entry in collection by key-value pair
def find_by_key(key: str, value, collection: list, unique=True):
  find = list(filter(lambda x: x[key] == value, collection))
  if unique:
    if len(find) > 1:
      raise ValueError(f"Found non-unique key-value pair.\n{find}")
    return find[0]
    return find

## Function, which flattens iterable collection of dicts
def flatten_collection(collection):
  result = []
  for val in collection.values():
  return result

# Load the file with the extracted text
with open("/Users/janedoe/Downloads/output_retail.json") as f:
  raw_output = json.load(f)

# Get all table-related structures
all_tables = raw_output['all_structures']['tables']
all_table_rows = raw_output['all_structures']['table_rows']
all_table_cells = raw_output['all_structures']['table_cells']

# Get all of the cells from the first table
table_1 = all_tables[0]
table_1_cells = []

for row_id in table_1['children_ids']:
  row = find_by_key('id', row_id, all_table_rows)
  for cell_id in row['children_ids']:
    table_1_cells.append(find_by_key('id', cell_id, all_table_cells))

# Reconstruct the first table
last_col = table_1_cells[-1]['col_start']
last_row = table_1_cells[-1]['row_start']

recon_table = np.empty([last_row, last_col], dtype=object)

flat_col = flatten_collection(raw_output['all_structures'])
for cell in all_table_cells:
  cell_col, cell_row = cell['col_start'], cell['row_start']
  for cell_value in cell['children_ids']:
    value = find_by_key('id', cell_value, flat_col)
    entries = []
    for cell_entry in value['children_ids']:
      entry = find_by_key('id', cell_entry, flat_col)
      if 'text' in entry:
    cell_content = " ".join(entries)
  recon_table[cell_row-1][cell_col-1] = str(cell_content)

pd.DataFrame(data=recon_table[1:,:], columns=recon_table[0,:])




Not requiredに言及する記述を持つ構造体は、スキーマの将来の反復で削除されるかもしれない。 コードの中でオプショナルな構造を参照することにした場合、スキーマに変更が加えられたときに、後でコードを更新する必要があるかもしれない。

{ "$defs": {
    "AssemblyJsonOutput": {
      "type": "object",
      "properties": {
        "metadata": {
          "description": "Metadata about this document.",
          "$ref": "#/$defs/Metadata"
        "styles": {
          "description": "Font styles used in this document. Not required.",
          "type": "array",
          "items": {
            "$ref": "#/$defs/Style"
        "top_level_structures": {
          "type": "array",
          "description": "Array of ids of the top level structures which belong directly under the document",
          "items": {
            "type": "string"
        "all_structures": {
          "type": "object",
          "description": "An object containing lists of all structures identified in this document.",
          "$ref": "#/$defs/Structures"
      "required": [
    "Metadata": {
      "type": "object",
      "additionalProperties": true,
      "title": "Metadata",
      "properties": {
        "num_pages": {
          "type": "integer",
          "description": "Total number of pages in the document"
        "title": {
          "type": "string",
          "description": "Document title as obtained from source document. Not required."
        "keywords": {
          "type": "string",
          "description": "Keywords associated with document. Not required."
        "author": {
          "type": "string",
          "description": "Author of the document. Not required."
        "publication_date": {
          "type": "string",
          "description": "Best effort bases for a publication date (may be the creation date). Not required."
        "subject": {
          "type": "string",
          "description": "Subject as obtained from the source document. Not required."
        "charset": {
          "type": "string",
          "description": "Character set used for the output"
      "required": []
    "Style": {
      "type": "object",
      "title": "Style",
      "properties": {
        "style_id": {
          "type": "string",
          "description": "Style Identifier which will be used for reference in other objects"
        "font_size": {
          "type": "string",
          "description": "Font size"
        "font_name": {
          "type": "string",
          "description": "Font name"
        "is_bold": {
          "type": "string",
          "description": "Whether or not the the font is bold"
        "is_italic": {
          "type": "string",
          "description": "Whether or not the the font is italic"
    "Structures": {
      "type": "object",
      "description": "An object containing all of the flattened structures identified in the document.
      None of the items in this object are required.",
      "sections":    {
        "type": "array",
        "items": {
          "$ref": "#/$defs/Section"
      "section_titles":    {
        "type": "array",
        "items": {
          "$ref": "#/$defs/SectionTitle"
        "type": "array",
        "items": {
          "$ref": "#/$defs/List"
        "type": "array",
        "items": {
          "$ref": "#/$defs/ListItem"
        "type": "array",
        "items": {
          "$ref": "#/$defs/ListIdentifier"
        "type": "array",
        "items": {
          "$ref": "#/$defs/Table"
        "type": "array",
        "items": {
          "$ref": "#/$defs/TableRow"
        "type": "array",
        "items": {
          "$ref": "#/$defs/TableCell"
        "type": "array",
        "items": {
          "$ref": "#/$defs/Subscript"
        "type": "array",
        "items": {
          "$ref": "#/$defs/Superscript"
        "type": "array",
        "items": {
          "$ref": "#/$defs/Footnote"
        "type": "array",
        "items": {
          "$ref": "#/$defs/Paragraph"
        "type": "array",
        "items": {
          "$ref": "#/$defs/Token"
    "Section": {
      "type": "object",
      "title": "Section",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the section"
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence"
        "section_number": {
          "type": "string",
          "description": "Section identifier identified in the document"
        "section_level": {
          "type": "string",
          "description": "Nesting level of section identified in the document"
    "SectionTitle": {
      "type": "object",
      "title": "SectionTitle",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the section"
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence. Not required."
        "text_alignment": {
          "type": "string",
          "description": "Text alignment of the section title. Not required."
    "List": {
      "type": "object",
      "title": "List",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the list "
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence"
    "ListItem": {
      "type": "object",
      "title": "ListItem",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the list item"
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence. Not required."
    "ListIdentifier": {
      "type": "object",
      "title": "ListIdentifier",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the list item"
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence"
    "Table": {
      "type": "object",
      "title": "Table",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the table"
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence, in this case, table rows"
    "TableRow": {
      "type": "object",
      "title": "TableRow",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the table row"
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence, in this case, table cells"
    "TableCell": {
      "type": "object",
      "title": "TableCell",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the table cell"
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        "is_row_header": {
          "type": "boolean",
          "description": "Whether the cell is part of row header or not"
        "is_column_header": {
          "type": "boolean",
          "description": "Whether the cell is part of column header or not"
        "col_span": {
          "type": "integer",
          "description": "column span of the cell"
        "row_span": {
          "type": "integer",
          "description": "row span of the cell"
        "col_start": {
          "type": "integer",
          "description": "column start of the cell within the table"
        "row_start": {
          "type": "integer",
          "description": "row start of the cell within the table"
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence, in this case, underlying paragraphs. Not required."
    "Subscript": {
      "type": "object",
      "title": "Subscript",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the subscript"
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence. Not required."
        "token_id_ref": {
          "type": "string",
          "description": "Id of the token to which the subscript belongs"
    "Superscript": {
      "type": "object",
      "title": "Superscript",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the superscript"
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        "footnote_ref": {
          "type": "string",
          "description": "Matching footnote id found on the page"
        "token_id_ref": {
          "type": "string",
          "description": "Id of the token to which the superscript belongs"
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence. Not required."
    "Footnote": {
      "type": "object",
      "title": "Footnote",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the footnote"
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence. Not required."
    "Paragraph": {
      "type": "object",
      "title": "Paragraph",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the paragraph"
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence, in this case, tokens. Not required."
        "text_alignment": {
          "type": "string",
          "description": "Text alignment of the paragraph. Not required."
        "indentation": {
          "type": "integer",
          "description": "Paragraph indentation. Not required."
    "Token": {
      "type": "object",
      "title": "Token",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the list identifier"
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        "style_id": {
          "type": "string",
          "description": "Identifier of the style object associated with this token. Not required."
        "text": {
          "type": "string",
          "description": "Actual text of the token"
        "bbox": {
          "description": "Not required.",
          "$ref": "#/$defs/BoundingBox"
    "BoundingBox": {
      "type": "object",
      "title": "BoundingBox",
      "properties": {
        "page_number": {
          "description": "which page this represents",
          "type": "integer"
        "x": {
          "description": "X coordinate of the bounding box",
          "type": "float"
        "y": {
          "description": "X coordinate of the bounding box",
          "type": "float"
        "width": {
          "description": "width of the bounding box",
          "type": "float"
        "height": {
          "description": "height of the bounding box",
          "type": "float"
  "$ref": "#/$defs/AssemblyJsonOutput"


親トピック ドキュメントからテキストを抽出する