テキスト抽出によって生成されたJSON構造の解析
テキスト抽出APIを使用して文書または画像ファイルから抽出したテキストは、文書内のさまざまなテキストおよびビジュアル要素に関する詳細を含むJSONファイルに書き込まれます。 生成されたJSONをさらに処理して、必要な情報を抽出することができる。
テキスト抽出APIを使ってテキストを抽出すると、必ず以下のJSONオブジェクトが返される。 これらのルート・オブジェクト内の構造体はオプショナルであり、出力で返される場合もあれば、返されない場合もある。
styles
キーは、よく返されるもう一つのルートレベルオブジェクトであるが、オプションである。 styles
オブジェクトは辞書のリストを含んでおり、各辞書はドキュメントで使われているフォントの詳細(フォントサイズやフォントスタイルなど)を含んでいます。
興味のある構造からテキストを抽出するコードを書くことができる。 詳細は以下のセクションを参照:
JSONスキーマの詳細については、テキスト抽出JSONスキーマを参照してください。
メタデータ
metadata
キーは辞書で、処理された文書のメタデータの詳細を以下のように格納する:
num_pages
: ドキュメントのページ数。title
: 文書のタイトル。keywords
: 文書に関連するキーワード。author
: 文書の作成者。publication_date
: 文書が作成または公開された日付。subject
: 文書の件名。charset
: 文書で使われる文字の集合の標準。
次のJSON出力は、PDFファイルのメタデータ・オブジェクト構造の例です。
"metadata":{
"num_pages":28,
"title":"Put AI to work for HR and talent transformation for the retail industry",
"keywords":"",
"author":"IBM",
"publication_date":"",
"subject":"Apply AI capabilities to drive your HR and talent transformation and generate better business outcomes in the retail industry.",
"charset":"UTF-8"
}
構造
解析された文書のデータ構造を参照する2つのキーがある:
top_level_structures
: トップレベルのデータ構造のIDのリスト。all_structures
: すべてのデータ構造型のリスト。
all_structures
キーには、構文解析された文書に含まれるデータ構造のすべての可能な型のリストが含まれる。 これらの構造はオプションであるため、出力に含まれる場合も含まれない場合もあります。 パースされた文書に含まれる可能性のあるデータ構造には、以下のようなものがある:
sections
: ドキュメント内のすべてのセクションのリスト。section_titles
: 検出されたセクションのタイトルのリスト。lists
: ドキュメント内のすべてのリストのコレクション。list_items
: 検出されたリストオブジェクトに存在するリスト項目のコレクション。list_identifiers
: 検出されたリストオブジェクトのリスト識別子のコレクション。tables
: ドキュメント内のすべてのテーブルのリスト。table_rows
: 検出されたテーブルに存在するテーブル行のリスト。table_cells
: 検出されたテーブル行に存在するテーブルセルのリスト。tokens
: プレーン・テキスト・トークンのリスト。subscripts
: 文書内で検出されたトークンに関連する添え字テキストのインスタンスのリスト。superscripts
: ドキュメント内で検出されたトークンに関連する上付きテキストのインスタンスのリストfootnotes
: 脚注のリスト。paragraphs
: 段落のリスト。
抽出されたJSONを操作する
生成されたJSONファイルのさまざまな構造からテキストを抽出するには、JSONプロセッサ・ライブラリを使用できます。
以下のコマンドは、PDF内のページ数を返します。これは、1つのJSONオブジェクトに格納されている値です:
cat output_retail.json | jq '.metadata.num_pages'
テーブルやリストなど一部の構造では、抽出されたテキストは生成されたJSON内のさまざまなオブジェクトに格納される。 コードを使ってオブジェクトを走査し、興味のあるテキストを抽出することができる。
段落の表現方法
1つの段落は、複数のトークンが連続するのが一般的で、各トークンは1つの単語を表す。
場合によっては、段落はセクションやリストといった他の構造と関連付けられる。
以下のJSON出力は、テキストがPDFから抽出されたとき、段落とトークンが文、収集、整理、成長データの中でどのように関連しているかを示しています。
//The section is listed in the top_level_structures array.
"top_level_structures":["PARA_fbdcdd",...,"SECTION_a2ab08",...],
//The section has a list of parapraphs.
{"id":"SECTION_9a3dda","parent_id":"SECTION_a2ab08","children_ids":["PARA_09384c",...
//The paragraph contains a section title.
{"id":"PARA_09384c","parent_id":"SECTION_9a3dda",
"text_alignment":"left","children_ids":["SECTION_TITLE_a5e3c2"],
//Token IDs listed for the section title.
{"id":"SECTION_TITLE_a5e3c2","parent_id":"PARA_09384c",
"text_alignment":"TBD","children_ids":[
"TOKEN_48bbae","TOKEN_cc0b9c","TOKEN_d57d27","TOKEN_a7d6da"
]},
//Consecutive tokens with a shared parent_id contain the text from the sentence of interest.
{"id":"TOKEN_48bbae","parent_id":"SECTION_TITLE_a5e3c2","style_id":"IBM_Plex_Sans_Light_Black_32_0",
"text":"Collect,",
"bbox":{"page_number":8,"x":283.0,"y":775.2945,"width":106.43201,"height":21.44}},
{"id":"TOKEN_cc0b9c","parent_id":"SECTION_TITLE_a5e3c2","style_id":"IBM_Plex_Sans_Light_Black_32_0",
"text":"organize,",
"bbox":{"page_number":8,"x":396.984,"y":775.2945,"width":126.78082,"height":21.44}},
{"id":"TOKEN_d57d27","parent_id":"SECTION_TITLE_a5e3c2","style_id":"IBM_Plex_Sans_Light_Black_32_0",
"text":"grow",
"bbox":{"page_number":8,"x":531.31683,"y":775.2945,"width":69.823975,"height":21.44}},
{"id":"TOKEN_a7d6da","parent_id":"SECTION_TITLE_a5e3c2","style_id":"IBM_Plex_Sans_Light_Black_32_0",
"text":"data",
"bbox":{"page_number":8,"x":608.6928,"y":775.2945,"width":62.880005,"height":21.44}},
画像からのテキストの表現方法
watsonx.aiAPI のテ キ ス ト 抽出方式に画像付き PDF フ ァ イ ルま たは画像フ ァ イ ルを送信す る と 、 画像か ら 抽出 さ れたテ キ ス ト は 「tokens
で表 さ れます。 tokens
は通常、'paragraph
または'section
オブジェクトに含まれる。
以下のJSONの抜粋は、テキスト抽出メソッドに送信したPNGファイルがJSON出力でどのように表現されるかを示しています。 テキスト・トークンを含むパラグラフ・オブジェクトは、'top_level_structures
オブジェクトと'all_structures
ルート・オブジェクトの両方から利用できる。
"top_level_structures":
[
"PARA_bc9320","PARA_8e9e62","PARA_b7f5cc","PARA_c75980","PARA_61a6a5","PARA_c8c2a8","PARA_8b8dd6","PARA_8c7c77","PARA_61aa92","PARA_1e6d2a","PARA_6eaa8d","PARA_cc6df5","PARA_4a9fb2"
],
"all_structures":{"sections":[],"section_titles":[],"lists":[],
"list_items":[],"list_identifiers":[],"tables":[],"table_rows":[],
"table_cells":[],"subscripts":[],"superscripts":[],"footnotes":[],
"paragraphs":
[
{"id":"PARA_bc9320","parent_id":"root","text_alignment":"center",
"children_ids":["TOKEN_132783","TOKEN_f0e333","TOKEN_dd48c3",
"TOKEN_c9b25e","TOKEN_080303","TOKEN_ce1aa0","TOKEN_97bf62"]...
{"id":"PARA_8e9e62","parent_id":"root",...
...
{"id":"PARA_4a9fb2","parent_id":"root",...
]
抽出されたテキストは、段落内のトークンで指定される。 以下のトークンは、画像からThe AI Ladder®という単語を以下のように表現している:
"tokens":[
{"id":"TOKEN_132783","parent_id":"PARA_bc9320","style_id":"Arial_Black_10_0",
"text":"The","bbox":{"page_number":1,"x":250.65,"y":109.3,"width":38.880005,"height":21.48999}},
{"id":"TOKEN_f0e333","parent_id":"PARA_bc9320","style_id":"Arial_Black_10_0",
"text":"AI","bbox":{"page_number":1,"x":295.82,"y":114.67,"width":24.109985,"height":16.290009}},
{"id":"TOKEN_dd48c3","parent_id":"PARA_bc9320","style_id":"Arial_Black_10_0",
"text":"Ladder®","bbox":{"page_number":1,"x":325.74,"y":110.24,"width":82.66,"height":22.030006}}
リストの表現方法
リストの構造は、'all_structures
ルート・オブジェクトの一部である3つの別々のオブジェクトで表現される:
lists
: 箇条書きまたは番号付きリストとしてフォーマットされたリスト項目のセット。list_items
: テキスト、段落、ネストされたリストを持つトークンを含むことができるリストの単一アイテム。list_identifiers
: リスト項目を識別するハイフンや数字などの記号を持つトークンを含む。
以下のJSON出力は、リストの最初の箇条書きにある「透明性を提供する」というテキストがどのように表現されているかを示している。
//The lists object contains the list where the listitem is located.
"lists":[{"id":"LIST_ed036e","parent_id":"SECTION_9a3dda","children_ids":[
"LISTITEM_c802c4",...
//The list_item object contains the list item which contains a list ID followed by several tokens.
"list_items":[{"id":"LISTITEM_c802c4","parent_id":"LIST_ed036e","children_ids":[
"LIST_ID_781ee7","TOKEN_1df44f","TOKEN_1bcdbf",...
//The list_identifiers object contains list IDs with tokens.
"list_identifiers":[{"id":"LIST_ID_781ee7","parent_id":"LISTITEM_c802c4",
"children_ids":["TOKEN_4a66cb"]}
//The list ID token includes a token with a hyphen.
{"id":"TOKEN_4a66cb","parent_id":"LIST_ID_781ee7","style_id":"IBM_Plex_Sans_Black_20_0",
"text":"–","bbox":{"page_number":10,"x":994.0,"y":500.36,"width":11.76001,"height":13.639999}}
//The list item tokens include the text *Providing transparency* in them.
{"id":"TOKEN_1df44f","parent_id":"LISTITEM_c802c4","style_id":"IBM_Plex_Sans_Black_20_0",
"text":"Providing","bbox":{"page_number":10,"x":1014.0,"y":500.36,"width":83.55994,"height":13.639999}},
{"id":"TOKEN_1bcdbf","parent_id":"LISTITEM_c802c4","style_id":"IBM_Plex_Sans_Black_20_0",
"text":"transparency","bbox":{"page_number":10,"x":1102.2799,"y":500.36,"width":117.95801,"height":13.639999}}...
次のPythonコードは、リストからテキストを抽出し、リストを再構築して、リスト項目をループしてトークン・テキストを抽出する方法を説明します。
# Import required libraries
import json
import numpy as np
import pandas as pd
# Define helper functions
## Function, which finds entry in collection by key-value pair
def find_by_key(key: str, value, collection: list, unique=True):
find = list(filter(lambda x: x[key] == value, collection))
if unique:
if len(find) > 1:
raise ValueError(f"Found non-unique key-value pair.\n{find}")
return find[0]
else:
return find
## Function, which flattens iterable collection of dicts
def flatten_collection(collection):
result = []
for val in collection.values():
result.extend(val)
return result
# Load the file with the extracted text
with open("/Users/janedoe/Downloads/output_retail.json") as f:
raw_output = json.load(f)
# Get all list-related structures
all_lists = raw_output['all_structures']['lists']
all_list_items = raw_output['all_structures']['list_items']
all_list_identifiers = raw_output['all_structures']['list_identifiers']
# Get all list items from the first list in the file
list_1 = all_lists[0]
list_1_items = []
for list_item_id in list_1['children_ids']:
list_1_items.append(find_by_key('id', list_item_id, all_list_items))
# Reconstruct the list
recon_list = []
flat_col = flatten_collection(raw_output['all_structures'])
for list_item in list_1_items:
val = []
for list_value_id in list_item['children_ids']:
list_value = find_by_key('id', list_value_id, flat_col)
#print(list_value['id'])
if list_value['id'].startswith("LIST_ID"):
for list_id_value_id in list_value['children_ids']:
list_id_value = find_by_key('id', list_id_value_id, flat_col)
if 'text' in list_id_value:
val.append(list_id_value['text'])
elif list_value['id'].startswith("PARA"):
val.append("\n")
for para_value_id in list_value['children_ids']:
para_value = find_by_key('id', para_value_id, flat_col)
if 'text' in para_value:
val.append(para_value['text'])
elif list_value['id'].startswith("TOKEN"):
val.append(list_value['text'])
else:
pass
print(' '.join(val))
テーブルの表現方法
テーブルの構造は、'all_structures
ルート・オブジェクトの一部である3つの別々のオブジェクトで表現される:
tables
: 複数のテーブル行に関連する。table_rows
: 各テーブル行は複数のテーブルセルに関連付けられている。table_cells
: 各テーブルのセルには、トークンの列、段落とトークンの混在した列、リストと段落とトークンの混在した列が含まれる。
以下のJSON出力は、テーブルの列タイトル「Workflows」がどのように表現されているかを示している。
//The all_structures root object contains the table, which has many rows.
"all_structures":{
...
"tables":[{"id":"TABLE_3bfabb","children_ids":[
"ROW_39aa6f",...,"ROW_63472c"]}
//A separate table rows array contains table cells.
"all_structures":{
...
"table_rows":[{"id":"ROW_39aa6f","parent_id":"TABLE_3bfabb","children_ids":[
"CELL_bc1c4b","CELL_3a8cdd","CELL_03b6d3"]}
//One of the table cells is identified as a column header and contains a paragraph.
{"id":"CELL_3a8cdd","parent_id":"ROW_39aa6f","is_row_header":false,
"is_col_header":true,"col_span":1,"row_span":1,"col_start":2,"row_start":1,
"children_ids":["PARA_088d08"]}
//The paragraph has a token.
{"id":"PARA_088d08","parent_id":"CELL_3a8cdd","children_ids":[
"TOKEN_b99851"],"indentation":1}
//The token contains the text *Workflows*.
{"id":"TOKEN_b99851","parent_id":"PARA_088d08","style_id":"IBM_Plex_Sans_SmBld_Black_20_0_bold",
"text":"Workflows","bbox":{"page_number":14,"x":757.0,"y":291.44003,"width":99.15997,"height":13.96}}
以下のPythonコードは、テーブルからテキストを抽出し、テーブルを再構築して、テーブルの行とセルをループしてトークン・テキストを抽出する方法を説明します。
# Import required libraries
import json
import numpy as np
import pandas as pd
# Define helper functions
## Function, which finds entry in collection by key-value pair
def find_by_key(key: str, value, collection: list, unique=True):
find = list(filter(lambda x: x[key] == value, collection))
if unique:
if len(find) > 1:
raise ValueError(f"Found non-unique key-value pair.\n{find}")
return find[0]
else:
return find
## Function, which flattens iterable collection of dicts
def flatten_collection(collection):
result = []
for val in collection.values():
result.extend(val)
return result
# Load the file with the extracted text
with open("/Users/janedoe/Downloads/output_retail.json") as f:
raw_output = json.load(f)
# Get all table-related structures
all_tables = raw_output['all_structures']['tables']
all_table_rows = raw_output['all_structures']['table_rows']
all_table_cells = raw_output['all_structures']['table_cells']
# Get all of the cells from the first table
table_1 = all_tables[0]
table_1_cells = []
for row_id in table_1['children_ids']:
row = find_by_key('id', row_id, all_table_rows)
for cell_id in row['children_ids']:
table_1_cells.append(find_by_key('id', cell_id, all_table_cells))
# Reconstruct the first table
last_col = table_1_cells[-1]['col_start']
last_row = table_1_cells[-1]['row_start']
recon_table = np.empty([last_row, last_col], dtype=object)
flat_col = flatten_collection(raw_output['all_structures'])
for cell in all_table_cells:
cell_col, cell_row = cell['col_start'], cell['row_start']
for cell_value in cell['children_ids']:
value = find_by_key('id', cell_value, flat_col)
entries = []
for cell_entry in value['children_ids']:
entry = find_by_key('id', cell_entry, flat_col)
if 'text' in entry:
entries.append(entry['text'])
cell_content = " ".join(entries)
recon_table[cell_row-1][cell_col-1] = str(cell_content)
pd.DataFrame(data=recon_table[1:,:], columns=recon_table[0,:])
キーと値のペアの表現方法
ラベル付けされたデータは、 all_structures
ルート・オブジェクトの一部である3つの別々のオブジェクトのキー・バリュー・ペアとして表現される:
id
:キーと値の組み合わせのユニークID。key
:データの一部に対するユニークなラベル。value
:ラベルに関連するデータ。
以下のJSON出力は、California Personal Auto Applicationフォームの Contact nameと Phone フィールドがどのように表現されるかを示している。
"kvps": [
{
"id": "KVP_000034",
"type": "key_value",
"key": {
"id": "KEY_000034",
"semantic_label": "contact_name",
"raw_text": "CONTACT NAME",
"normalized_text": null,
"confidence_score": null,
"bbox": {
"x": 26.406426231269133,
"y": 178.04464285714283,
"width": 42.25028197003061,
"height": 15.482142857142861,
"page_number": 1
}
},
"value": {
"id": "VALUE_000034",
"raw_text": "John Smith",
"normalized_text": null,
"confidence_score": null,
"bbox": {
"x": 76.57863607068049,
"y": 178.04464285714283,
"width": 60.73478033191901,
"height": 10.321428571428584,
"page_number": 1
}
}
},
{
"id": "KVP_000035",
"type": "key_value",
"key": {
"id": "KEY_000035",
"semantic_label": "contact_phone",
"raw_text": "PHONE (A/C. No. Ext)",
"normalized_text": null,
"confidence_score": null,
"bbox": {
"x": 26.406426231269133,
"y": 196.10714285714283,
"width": 42.250283760672005,
"height": 14.837047751844239,
"page_number": 1
}
},
"value": {
"id": "VALUE_000035",
"raw_text": "(917) 555-2843",
"normalized_text": null,
"confidence_score": null,
"bbox": {
"x": 95.06313727469147,
"y": 196.10715158651936,
"width": 75.91847683596005,
"height": 12.256690608987071,
"page_number": 1
}
}
},
...
]
次の Python コードでは、組み立てられたJSON出力ファイル内のキーと値のペアからテキストを抽出し、構造化データをループしてコンテンツを再構築する方法を説明しています。
def extract_kvps(assembly_dict):
"""
Extract and print key-value pairs from the assembly dict
Works with both 'only_value' and 'key_value' type KVPs
Includes coordinate information and page dimensions
"""
try:
data = assembly_dict
# Get page metadata for dimensions
page_metadata = data.get("metadata", {}).get("pages_metadata", [])
if page_metadata:
print("Document Page Information:")
for page in page_metadata:
page_num = page.get("page_number", "Unknown")
page_width = page.get("page_pdf_width", "Unknown")
page_height = page.get("page_pdf_height", "Unknown")
page_image_width = page.get("page_image_width", "Unknown")
page_image_height = page.get("page_image_height", "Unknown")
print(f"Page {page_num}:")
print(f" PDF Dimensions: {page_width} x {page_height}")
print(f" Image Dimensions: {page_image_width} x {page_image_height}")
print()
else:
print("No page metadata found in the document\n")
# Extract KVPs if they exist in the data
kvps = data.get("kvps", [])
if not kvps:
print("No KVPs found in the JSON data")
return
print(f"Found {len(kvps)} Key-Value Pairs\n")
print("=" * 80)
# Process each KVP
for i, kvp in enumerate(kvps, 1):
kvp_id = kvp.get("id", "Unknown ID")
kvp_type = kvp.get("type", "Unknown type")
# Get key and value information
key_info = kvp.get("key", {})
value_info = kvp.get("value", {})
# Get semantic label (if any)
semantic_label = key_info.get("semantic_label", "N/A")
# Get key text (if any)
key_text = key_info.get("raw_text", "N/A")
# Get value text
value_text = value_info.get("raw_text", "N/A")
# Get coordinates (bounding boxes)
key_bbox = key_info.get("bbox", "N/A")
value_bbox = value_info.get("bbox", "N/A")
# Print the information
print(f"KVP #{i}: {kvp_id}")
print(f"Type: {kvp_type}")
if kvp_type == "only_value":
print(f"Semantic Label: {semantic_label}")
print(f"Value: {value_text}")
print(f"Value Coordinates:")
if value_bbox != "N/A":
print(f" x: {value_bbox['x']}, y: {value_bbox['y']}")
print(f" width: {value_bbox['width']}, height: {value_bbox['height']}")
print(f" page: {value_bbox['page_number']}")
else:
print(" No coordinates available")
else: # key_value type
print(f"Key Text: {key_text}")
print(f"Normalized key: {semantic_label}")
print(f"Value: {value_text}")
print(f"Key Coordinates:")
if key_bbox != "N/A":
print(f" x: {key_bbox['x']}, y: {key_bbox['y']}")
print(f" width: {key_bbox['width']}, height: {key_bbox['height']}")
print(f" page: {key_bbox['page_number']}")
else:
print(" No coordinates available")
print(f"Value Coordinates:")
if value_bbox != "N/A":
print(f" x: {value_bbox['x']}, y: {value_bbox['y']}")
print(f" width: {value_bbox['width']}, height: {value_bbox['height']}")
print(f" page: {value_bbox['page_number']}")
else:
print(" No coordinates available")
print("-" * 80)
except Exception as e:
print(f"Error processing KVPs: {e}")
テキスト抽出JSONスキーマ
ドキュメント用に生成されたJSONから情報を抽出するコードを書くときに、JSONスキーマを参照できる。
Not required
に言及する記述を持つ構造体は、スキーマの将来の反復で削除されるかもしれない。 コードの中でオプショナルな構造を参照することにした場合、スキーマに変更が加えられたときに、後でコードを更新する必要があるかもしれない。
{ "$defs": {
"AssemblyJsonOutput": {
"type": "object",
"properties": {
"metadata": {
"description": "Metadata about this document.",
"$ref": "#/$defs/Metadata"
},
"styles": {
"description": "Font styles used in this document. Not required.",
"type": "array",
"items": {
"$ref": "#/$defs/Style"
}
},
"kvps": {
"description": "Key value pairs found in the document.",
"type": "array",
"items": {
"$ref": "#/$defs/Kvp"
}
},
"top_level_structures": {
"type": "array",
"description": "Array of ids of the top level structures which belong directly under the document",
"items": {
"type": "string"
}
},
"all_structures": {
"type": "object",
"description": "An object containing lists of all structures identified in this document.",
"$ref": "#/$defs/Structures"
}
},
"required": [
"metadata",
"top_level_structures",
"all_structures"
]
},
"Metadata": {
"type": "object",
"additionalProperties": true,
"title": "Metadata",
"properties": {
"num_pages": {
"type": "integer",
"description": "Total number of pages in the document"
},
"title": {
"type": "string",
"description": "Document title as obtained from source document. Not required."
},
"language": {
"type": "string",
"description": "Determined by the lang specifier in the <html> tag, or <meta> tag"
},
"url": {
"type": "string",
"description": "url of the document"
},
"keywords": {
"type": "string",
"description": "Keywords associated with document. Not required."
},
"author": {
"type": "string",
"description": "Author of the document. Not required."
},
"publication_date": {
"type": "string",
"description": "Best effort bases for a publication date (may be the creation date). Not required."
},
"subject": {
"type": "string",
"description": "Subject as obtained from the source document. Not required."
},
"charset": {
"type": "string",
"description": "Character set used for the output"
},
"output_tokens_flag": {
"type": "boolean",
"description": "Whether individual tokens are output, as specified in the input to the API"
},
"output_bounding_boxes_flag": {
"type": "boolean",
"description": "Whether bounding boxes are output, as requested in the input to the API"
},
"pages_metadata": {
"type": "array",
"items": {
"$ref": "#/$defs/PageMetadata"
}
},
"required": [
"num_pages",
"charset"
]
},
"PageMetadata": {
"type": "object",
"title": "PageMetadata",
"properties": {
"page_number": {
"type": "integer",
"description": "Page number, starting from 1"
},
"page_image_width": {
"type": "integer",
"description": "Width of the page in pixels, assuming the page is an image with the DPI as specified in the dpi property "
},
"page_image_height": {
"type": "integer",
"description": "Height of the page in pixels, assuming the page is an image with DPI as specified in the dpi property"
},
"dpi": {
"type": "integer",
"description": "The DPI to use for the page image, as specified in the input to the API"
}
}
}
"Style": {
"type": "object",
"title": "Style",
"properties": {
"style_id": {
"type": "string",
"description": "Style Identifier which will be used for reference in other objects"
},
"font_size": {
"type": "string",
"description": "Font size"
},
"font_name": {
"type": "string",
"description": "Font name"
},
"is_bold": {
"type": "string",
"description": "Whether or not the the font is bold"
},
"is_italic": {
"type": "string",
"description": "Whether or not the the font is italic"
}
},
"required": [
"style_id",
"font_size",
"font_name",
"is_bold",
"is_italic"
]
},
"Kvp": {
"type": "object",
"title": "KVP",
"properties": {
"id": {
"type": "string",
"description": "A unique ID of the KVP prefixed with KVP_"
},
"type": {
"type": "string",
"description": "The type of the KVP"
},
"key": {
"type": "object",
"description": "The key data of the KVP",
"$ref": "#/$defs/KvpKey"
},
"value": {
"type": "object",
"description": "The value data of the KVP",
"$ref": "#/$defs/KvpValue"
}
},
"required": [
"id",
"type",
"value"
]
},
"KvpKey": {
"type": "object",
"title": "KvpKey",
"properties": {
"id": {
"type": "string",
"description": "A unique ID of the KVP key prefixed with KEY_"
},
"semantic_label": {
"type": "string",
"description": "The semantic label of the KVP"
},
"raw_text": {
"type": "string",
"description": "The original text of the key extracted in the document"
},
"normalized_text": {
"type": "string",
"description": "The normalized text of the key"
},
"confidence_score": {
"type": "float",
"description": "The confidence score of the key"
},
"bbox": {
"type": "object",
"description": "The bounding box of the key",
"$ref": "#/$defs/BoundingBox"
}
},
"required": [
"id",
"raw_text"
]
},
"KvpValue": {
"type": "object",
"title": "KvpKey",
"properties": {
"id": {
"type": "string",
"description": "A unique ID of the KVP key prefixed with VALUE_"
},
"raw_text": {
"type": "string",
"description": "The original text of the key extracted in the document"
},
"normalized_text": {
"type": "string",
"description": "The normalized text of the value"
},
"confidence_score": {
"type": "float",
"description": "The confidence score of the value"
},
"bbox": {
"type": "object",
"description": "The bounding box of the value",
"$ref": "#/$defs/BoundingBox"
}
},
"required": [
"id",
"raw_text"
]
},
"Structures": {
"type": "object",
"description": "An object containing of all flattened structures identified in the document.
None of the items in this object are required.",
"sections": {
"type": "array",
"items": {
"$ref": "#/$defs/Section"
}
},
"section_titles": {
"type": "array",
"items": {
"$ref": "#/$defs/SectionTitle"
}
},
"lists": {
"type": "array",
"items": {
"$ref": "#/$defs/List"
}
},
"list_items": {
"type": "array",
"items": {
"$ref": "#/$defs/ListItem"
}
},
"list_identifiers": {
"type": "array",
"items": {
"$ref": "#/$defs/ListIdentifier"
}
},
"tables": {
"type": "array",
"items": {
"$ref": "#/$defs/Table"
}
},
"table_rows": {
"type": "array",
"items": {
"$ref": "#/$defs/TableRow"
}
},
"table_cells": {
"type": "array",
"items": {
"$ref": "#/$defs/TableCell"
}
},
"subscripts": {
"type": "array",
"items": {
"$ref": "#/$defs/Subscript"
}
},
"superscripts": {
"type": "array",
"items": {
"$ref": "#/$defs/Superscript"
}
},
"footnotes": {
"type": "array",
"items": {
"$ref": "#/$defs/Footnote"
}
},
"paragraphs": {
"type": "array",
"items": {
"$ref": "#/$defs/Paragraph"
}
},
"code_snippets": {
"type": "array",
"items": {
"$ref": "#/$defs/CodeSnippet"
}
},
"pictures":{
"type": "array",
"items": {
"$ref": "#/$defs/Picture"
}
},
"tokens": {
"type": "array",
"items": {
"$ref": "#/$defs/Token"
}
}
},
"Section": {
"type": "object",
"title": "Section",
"properties": {
"id": {
"type": "string",
"description": "Unique identifier for the section"
},
"parent_id": {
"type": "string",
"description": "Unique identifier which denotes parent of this structure"
},
"children_ids": {
"type": "array",
"description": "Unique Ids of first level children structures under this structure in correct sequence"
},
"section_number": {
"type": "string",
"description": "Section identifier identified in the document"
},
"section_level": {
"type": "string",
"description": "Nesting level of section identified in the document"
}
},
"required": [
"id",
"parent_id",
"children_ids",
"section_number",
"section_level"
]
},
"SectionTitle": {
"type": "object",
"title": "SectionTitle",
"properties": {
"id": {
"type": "string",
"description": "Unique identifier for the section"
},
"parent_id": {
"type": "string",
"description": "Unique identifier which denotes parent of this structure"
},
"children_ids": {
"type": "array",
"description": "Unique Ids of first level children structures under this structure in correct sequence. Not required."
},
"text_alignment": {
"type": "string",
"description": "Text alignment of the section title. Not required."
},
"text": {
"type": "string",
"description": "Text property added to all objects"
}
},
"required": [
"id",
"parent_id",
"text"
]
},
"List": {
"type": "object",
"title": "List",
"properties": {
"id": {
"type": "string",
"description": "Unique identifier for the list "
},
"parent_id": {
"type": "string",
"description": "Unique identifier which denotes parent of this structure"
},
"children_ids": {
"type": "array",
"description": "Unique Ids of first level children structures under this structure in correct sequence"
}
},
"required": [
"id",
"parent_id",
"children_ids"
]
},
"ListItem": {
"type": "object",
"title": "ListItem",
"properties": {
"id": {
"type": "string",
"description": "Unique identifier for the list item"
},
"parent_id": {
"type": "string",
"description": "Unique identifier which denotes parent of this structure"
},
"children_ids": {
"type": "array",
"description": "Unique Ids of first level children structures under this structure in correct sequence. Not required."
},
"text": {
"type": "string",
"description": "Text property added to all objects"
}
},
"required": [
"id",
"parent_id",
"text"
]
},
"ListIdentifier": {
"type": "object",
"title": "ListIdentifier",
"properties": {
"id": {
"type": "string",
"description": "Unique identifier for the list item"
},
"parent_id": {
"type": "string",
"description": "Unique identifier which denotes parent of this structure"
},
"children_ids": {
"type": "array",
"description": "Unique Ids of first level children structures under this structure in correct sequence"
}
},
"required": [
"id",
"parent_id",
"children_ids"
]
},
"Table": {
"type": "object",
"title": "Table",
"properties": {
"id": {
"type": "string",
"description": "Unique identifier for the table"
},
"parent_id": {
"type": "string",
"description": "Unique identifier which denotes parent of this structure"
},
"children_ids": {
"type": "array",
"description": "Unique Ids of first level children structures under this structure in correct sequence, in this case, table rows"
}
},
"required": [
"id",
"parent_id",
"children_ids"
]
},
"TableRow": {
"type": "object",
"title": "TableRow",
"properties": {
"id": {
"type": "string",
"description": "Unique identifier for the table row"
},
"parent_id": {
"type": "string",
"description": "Unique identifier which denotes parent of this structure"
},
"children_ids": {
"type": "array",
"description": "Unique Ids of first level children structures under this structure in correct sequence, in this case, table cells"
}
},
"required": [
"id",
"parent_id",
"children_ids"
]
},
"TableCell": {
"type": "object",
"title": "TableCell",
"properties": {
"id": {
"type": "string",
"description": "Unique identifier for the table cell"
},
"parent_id": {
"type": "string",
"description": "Unique identifier which denotes parent of this structure"
},
"is_row_header": {
"type": "boolean",
"description": "Whether the cell is part of row header or not"
},
"is_column_header": {
"type": "boolean",
"description": "Whether the cell is part of column header or not"
},
"col_span": {
"type": "integer",
"description": "column span of the cell"
},
"row_span": {
"type": "integer",
"description": "row span of the cell"
},
"col_start": {
"type": "integer",
"description": "column start of the cell within the table"
},
"row_start": {
"type": "integer",
"description": "row start of the cell within the table"
},
"children_ids": {
"type": "array",
"description": "Unique Ids of first level children structures under this structure in correct sequence, in this case, underlying paragraphs. Not required."
},
"text": {
"type": "string",
"description": "Text property added to all objects"
}
},
"required": [
"id",
"parent_id",
"is_row_header",
"is_column_header",
"col_span",
"row_span",
"col_start",
"row_start",
"text"
]
},
"Subscript": {
"type": "object",
"title": "Subscript",
"properties": {
"id": {
"type": "string",
"description": "Unique identifier for the subscript"
},
"parent_id": {
"type": "string",
"description": "Unique identifier which denotes parent of this structure"
},
"children_ids": {
"type": "array",
"description": "Unique Ids of first level children structures under this structure in correct sequence. Not required."
},
"token_id_ref": {
"type": "string",
"description": "Id of the token to which the subscript belongs"
},
"text": {
"type": "string",
"description": "Text property added to all objects"
}
},
"required": [
"id",
"parent_id",
"text"
]
},
"Superscript": {
"type": "object",
"title": "Superscript",
"properties": {
"id": {
"type": "string",
"description": "Unique identifier for the superscript"
},
"parent_id": {
"type": "string",
"description": "Unique identifier which denotes parent of this structure"
},
"footnote_ref": {
"type": "string",
"description": "Matching footnote id found on the page"
},
"token_id_ref": {
"type": "string",
"description": "Id of the token to which the superscript belongs"
},
"children_ids": {
"type": "array",
"description": "Unique Ids of first level children structures under this structure in correct sequence. Not required."
},
"text": {
"type": "string",
"description": "Text property added to all objects"
}
},
"required": [
"id",
"parent_id",
"footnote_ref",
"text"
]
},
"Footnote": {
"type": "object",
"title": "Footnote",
"properties": {
"id": {
"type": "string",
"description": "Unique identifier for the footnote"
},
"parent_id": {
"type": "string",
"description": "Unique identifier which denotes parent of this structure"
},
"children_ids": {
"type": "array",
"description": "Unique Ids of first level children structures under this structure in correct sequence. Not required."
},
"text": {
"type": "string",
"description": "Text property added to all objects"
}
},
"required": [
"id",
"parent_id",
"text"
]
},
"Paragraph": {
"type": "object",
"title": "Paragraph",
"properties": {
"id": {
"type": "string",
"description": "Unique identifier for the paragraph"
},
"parent_id": {
"type": "string",
"description": "Unique identifier which denotes parent of this structure"
},
"children_ids": {
"type": "array",
"description": "Unique Ids of first level children structures under this structure in correct sequence, in this case, tokens. Not required."
},
"text_alignment": {
"type": "string",
"description": "Text alignment of the paragraph. Not required."
},
"indentation": {
"type": "integer",
"description": "Paragraph indentation. Not required."
},
"text": {
"type": "string",
"description": "Text property added to all objects"
}
},
"required": [
"id",
"parent_id",
"text"
]
},
"CodeSnippet": {
"type": "object",
"title": "CodeSnippet",
"properties": {
"id": {
"type": "string",
"description": "Unique identifier for the code snippet"
},
"parent_id": {
"type": "string",
"description": "Unique identifier which denotes parent of this structure"
},
"children_ids": {
"type": "array",
"description": "Unique Ids of first level children structures under this structure in correct sequence, in this case, tokens",
"items": {
"type": "string"
}
},
"text": {
"type": "string",
"description": "Text of the code snippet. It can contain multiple lines, including empty lines or lines with leading spaces."
}
},
"required": [
"id",
"parent_id",
"text"
]
},
"Picture": {
"type": "object",
"title": "Picture",
"properties": {
"id": {
"type": "string",
"description": "Unique identifier for the picture"
},
"parent_id": {
"type": "string",
"description": "Unique identifier which denotes parent of this structure"
},
"children_ids": {
"type": "array",
"description": "Unique identifiers of the tokens extracted from this picture, if any"
},
"text": {
"type": "string",
"description": "Text extracted from this picture"
},
"verbalization": {
"type": "string",
"description": "Verbalization of this picture"
},
"page_number": {
"type": "integer",
"description": "Page that contains this picture"
},
"path": {
"type": "string",
"description": "Path in the output location where the picture itself was saved"
},
"bbox": {
"type":"object",
"description": "The bounding box of the picture in the context of the page, expressed as pixel coordinates with respect to pages_metadata.page_image_height and pages_metadata.page_image_width",
"$ref": "#/$defs/BoundingBox"
}
},
"required": [
"id",
"parent_id"
]
},
"Token": {
"type": "object",
"title": "Token",
"properties": {
"id": {
"type": "string",
"description": "Unique identifier for the list identifier"
},
"parent_id": {
"type": "string",
"description": "Unique identifier which denotes parent of this structure"
},
"style_id": {
"type": "string",
"description": "Identifier of the style object associated with this token. Not required."
},
"text": {
"type": "string",
"description": "Actual text of the token"
},
"bbox": {
"type": "object",
"description": "The bounding box of the token in the context of the page, expressed as pixel coordinates with respect to pages_metadata.page_image_height and pages_metadata.page_image_width",
"$ref": "#/$defs/BoundingBox"
}
},
"required": [
"id",
"parent_id",
"text"
]
},
"BoundingBox": {
"type": "object",
"title": "BoundingBox",
"properties": {
"page_number": {
"description": "Which page this represents",
"type": "integer"
},
"x": {
"description": "X coordinate of the top left corner of the bounding box",
"type": "float"
},
"y": {
"description": "Y coordinate of the top left corner of the bounding box",
"type": "float"
},
"width": {
"description": "The width of the bounding box",
"type": "float"
},
"height": {
"description": "The height of the bounding box",
"type": "float"
}
},
"required": [
"page_number",
"x",
"y",
"width",
"height"
]
}
},
"$ref": "#/$defs/AssemblyJsonOutput"
}
親トピック RAG ソリューションに抽出テキストを追加する