텍스트 추출로 생성된 JSON 구조 구문 분석하기

영어 버전 문서로 돌아가기

마지막 업데이트 날짜: 2024년 11월 07일

텍스트 추출로 생성된 JSON 구조 구문 분석하기

텍스트 추출 API를 사용하여 문서 또는 이미지 파일에서 추출한 텍스트는 문서의 다양한 텍스트 및 시각적 요소에 대한 세부 정보가 포함된 JSON 파일에 기록됩니다. 생성된 JSON을 추가로 처리하여 원하는 정보를 추출할 수 있습니다.

텍스트 추출 API를 사용하여 텍스트를 추출할 때 항상 다음 JSON 객체가 반환됩니다. 이러한 루트 객체 내의 구조는 선택 사항이므로 출력에 반환될 수도 있고 반환되지 않을 수도 있습니다.

all_structures
metadata
top_level_structures

' styles ' 키는 종종 반환되는 또 다른 루트 수준 객체이지만 선택 사항입니다. ' styles ' 객체에는 문서에 사용된 글꼴 크기 및 글꼴 스타일과 같은 글꼴에 대한 세부 정보가 각각 포함된 사전 목록이 포함되어 있습니다.

관심 있는 구조에서 텍스트를 추출하는 코드를 작성할 수 있습니다. 자세한 내용은 다음 섹션을 참조하세요:

단락이 표시되는 방식
이미지의 텍스트가 표현되는 방식
목록이 표시되는 방식
테이블이 표현되는 방식

JSON 스키마에 대한 자세한 내용은 텍스트 추출 JSON 스키마로 이동하세요.

메타데이터

' metadata ' 키는 다음과 같이 처리된 문서의 메타데이터에 대한 세부 정보가 포함된 사전입니다:

num_pages': 문서의 페이지 수입니다.
title: 문서의 제목입니다.
keywords': 문서와 관련된 키워드입니다.
author: 문서 작성자.
publication_date': 문서가 생성 또는 게시된 날짜입니다.
subject' : 문서의 제목.
charset': 문서에 사용되는 문자 집합의 표준입니다.

다음 JSON 출력은 PDF 파일의 메타데이터 객체 구조의 예입니다.

"metadata":{
  "num_pages":28,
  "title":"Put AI to work for HR and talent transformation for the retail industry",
  "keywords":"",
  "author":"IBM",
  "publication_date":"",
  "subject":"Apply AI capabilities to drive your HR and talent transformation and generate better business outcomes in the retail industry.",
  "charset":"UTF-8"
}

구조

구문 분석된 문서의 데이터 구조를 참조하는 두 개의 키가 있습니다:

top_level_structures': 최상위 데이터 구조의 ID 목록입니다.
all_structures': 모든 데이터 구조 유형 목록입니다.

' all_structures ' 키에는 구문 분석된 문서에서 가능한 모든 유형의 데이터 구조 목록이 포함되어 있습니다. 구문 분석된 문서에 포함될 수 있는 일부 데이터 구조는 다음과 같습니다:

sections': 문서의 모든 섹션 목록입니다.
section_titles': 감지된 섹션의 제목 목록입니다.
lists': 문서에 있는 모든 목록의 모음입니다.
list_items': 감지된 목록 객체에 존재하는 목록 항목의 모음입니다.
list_identifiers': 감지된 목록 객체의 목록 식별자 모음입니다.
tables': 문서에 있는 모든 테이블 목록입니다.
table_rows': 감지된 테이블에 존재하는 테이블 행의 목록입니다.
table_cells': 감지된 테이블 행에 존재하는 테이블 셀 목록입니다.
tokens': 일반 텍스트 토큰 목록입니다.
subscripts': 문서에서 감지된 토큰과 관련된 아래 첨자 텍스트의 인스턴스 목록입니다.
superscripts: 문서에서 감지된 토큰과 관련된 위 첨자 텍스트의 인스턴스 목록입니다
footnotes': 각주 목록입니다.
paragraphs: 단락 목록입니다.

추출된 JSON으로 작업하기

JSON 프로세서 라이브러리를 사용하여 생성된 JSON 파일의 다양한 구조에서 텍스트를 추출할 수 있습니다.

다음 명령은 단일 JSON 객체에 저장된 값인 PDF의 페이지 수를 반환합니다:

cat output_retail.json | jq '.metadata.num_pages'

참고: 이 명령은 별도로 설치해야 하는 명령줄 JSON 프로세서인 jq를 사용합니다.

테이블 및 목록과 같은 일부 구조의 경우, 추출된 텍스트는 생성된 JSON 내의 다양한 객체에 저장됩니다. 코드를 사용하여 개체를 가로지르며 관심 있는 텍스트를 추출할 수 있습니다.

단락이 표시되는 방식

하나의 단락은 여러 개의 토큰을 순서대로 연결하는 것이 가장 일반적이며, 각 토큰은 하나의 단어를 나타냅니다.

경우에 따라 단락은 섹션 및 목록과 같은 다른 구조와 연결되어 있습니다.

다음 JSON 출력은 문장에서 단락과 토큰이 어떻게 연관되어 있는지, PDF에서 텍스트를 추출할 때 데이터를 수집, 정리, 성장시키는 방법을 보여줍니다.

문장이 포함된 PDF 스크린샷, 정리 성장 데이터를 수집하여 강조 표시합니다.

//The section is listed in the top_level_structures array.
"top_level_structures":["PARA_fbdcdd",...,"SECTION_a2ab08",...],

//The section has a list of parapraphs.
{"id":"SECTION_9a3dda","parent_id":"SECTION_a2ab08","children_ids":["PARA_09384c",...

//The paragraph contains a section title.
{"id":"PARA_09384c","parent_id":"SECTION_9a3dda",
"text_alignment":"left","children_ids":["SECTION_TITLE_a5e3c2"],

//Token IDs listed for the section title.
{"id":"SECTION_TITLE_a5e3c2","parent_id":"PARA_09384c",
"text_alignment":"TBD","children_ids":[
  "TOKEN_48bbae","TOKEN_cc0b9c","TOKEN_d57d27","TOKEN_a7d6da"
]},

//Consecutive tokens with a shared parent_id contain the text from the sentence of interest.
{"id":"TOKEN_48bbae","parent_id":"SECTION_TITLE_a5e3c2","style_id":"IBM_Plex_Sans_Light_Black_32_0",
"text":"Collect,",
"bbox":{"page_number":8,"x":283.0,"y":775.2945,"width":106.43201,"height":21.44}},
{"id":"TOKEN_cc0b9c","parent_id":"SECTION_TITLE_a5e3c2","style_id":"IBM_Plex_Sans_Light_Black_32_0",
"text":"organize,",
"bbox":{"page_number":8,"x":396.984,"y":775.2945,"width":126.78082,"height":21.44}},
{"id":"TOKEN_d57d27","parent_id":"SECTION_TITLE_a5e3c2","style_id":"IBM_Plex_Sans_Light_Black_32_0",
"text":"grow",
"bbox":{"page_number":8,"x":531.31683,"y":775.2945,"width":69.823975,"height":21.44}},
{"id":"TOKEN_a7d6da","parent_id":"SECTION_TITLE_a5e3c2","style_id":"IBM_Plex_Sans_Light_Black_32_0",
"text":"data",
"bbox":{"page_number":8,"x":608.6928,"y":775.2945,"width":62.880005,"height":21.44}},

이미지의 텍스트가 표현되는 방식

이미지가 포함된 PDF 파일 또는 이미지 파일을 watsonx.ai API의 텍스트 추출 메서드에 제출하면 이미지의 텍스트는 ' tokens로 표시됩니다. ' tokens '은 일반적으로 ' paragraph ' 또는 ' section ' 객체에 포함되어 있습니다.

텍스트가 포함된 PNG 파일의 스크린샷입니다.

다음 JSON 발췌문은 텍스트 추출 방법으로 제출한 PNG 파일이 JSON 출력에서 어떻게 표현되는지 보여줍니다. 텍스트 토큰이 포함된 단락 객체는 ' top_level_structures 객체와 ' all_structures ' 루트 객체 모두에서 사용할 수 있습니다.

"top_level_structures":
[
  "PARA_bc9320","PARA_8e9e62","PARA_b7f5cc","PARA_c75980","PARA_61a6a5","PARA_c8c2a8","PARA_8b8dd6","PARA_8c7c77","PARA_61aa92","PARA_1e6d2a","PARA_6eaa8d","PARA_cc6df5","PARA_4a9fb2"
],
"all_structures":{"sections":[],"section_titles":[],"lists":[],
  "list_items":[],"list_identifiers":[],"tables":[],"table_rows":[],
  "table_cells":[],"subscripts":[],"superscripts":[],"footnotes":[],
  "paragraphs":
  [
    {"id":"PARA_bc9320","parent_id":"root","text_alignment":"center",
    "children_ids":["TOKEN_132783","TOKEN_f0e333","TOKEN_dd48c3",
    "TOKEN_c9b25e","TOKEN_080303","TOKEN_ce1aa0","TOKEN_97bf62"]...
    {"id":"PARA_8e9e62","parent_id":"root",...
    ...
    {"id":"PARA_4a9fb2","parent_id":"root",...
  ]

추출된 텍스트는 단락 내의 토큰에 지정됩니다. 다음 토큰은 다음과 같이 이미지에서 The AI Ladder®라는 단어를 나타냅니다:

"tokens":[
  {"id":"TOKEN_132783","parent_id":"PARA_bc9320","style_id":"Arial_Black_10_0",
    "text":"The","bbox":{"page_number":1,"x":250.65,"y":109.3,"width":38.880005,"height":21.48999}},
  {"id":"TOKEN_f0e333","parent_id":"PARA_bc9320","style_id":"Arial_Black_10_0",
    "text":"AI","bbox":{"page_number":1,"x":295.82,"y":114.67,"width":24.109985,"height":16.290009}},
  {"id":"TOKEN_dd48c3","parent_id":"PARA_bc9320","style_id":"Arial_Black_10_0",
    "text":"Ladder®","bbox":{"page_number":1,"x":325.74,"y":110.24,"width":82.66,"height":22.030006}}

목록이 표시되는 방식

목록의 구조는 ' all_structures 루트 객체의 일부인 세 개의 개별 객체로 표현됩니다:

lists': 글머리 기호 또는 번호 매기기 목록으로 형식이 지정된 목록 항목의 집합입니다.
list_items': 텍스트, 단락 또는 중첩된 목록이 있는 토큰을 포함할 수 있는 목록의 단일 항목입니다.
list_identifiers': 목록 항목을 식별하는 하이픈이나 숫자 등의 기호가 있는 토큰을 포함합니다.

다음 JSON 출력은 목록의 첫 글머리 기호에 투명도 제공이라는 텍스트가 어떻게 표시되는지 보여줍니다.

목록의 첫 번째 항목에 강조 표시된 단어 투명성 제공이 포함된 글머리 기호 목록이 있는 PDF의 스크린샷입니다.

//The lists object contains the list where the listitem is located.
"lists":[{"id":"LIST_ed036e","parent_id":"SECTION_9a3dda","children_ids":[
  "LISTITEM_c802c4",...

//The list_item object contains the list item which contains a list ID followed by several tokens.
"list_items":[{"id":"LISTITEM_c802c4","parent_id":"LIST_ed036e","children_ids":[
  "LIST_ID_781ee7","TOKEN_1df44f","TOKEN_1bcdbf",...

//The list_identifiers object contains list IDs with tokens.
"list_identifiers":[{"id":"LIST_ID_781ee7","parent_id":"LISTITEM_c802c4",
  "children_ids":["TOKEN_4a66cb"]}

//The list ID token includes a token with a hyphen.
{"id":"TOKEN_4a66cb","parent_id":"LIST_ID_781ee7","style_id":"IBM_Plex_Sans_Black_20_0",
  "text":"–","bbox":{"page_number":10,"x":994.0,"y":500.36,"width":11.76001,"height":13.639999}}

//The list item tokens include the text *Providing transparency* in them.
{"id":"TOKEN_1df44f","parent_id":"LISTITEM_c802c4","style_id":"IBM_Plex_Sans_Black_20_0",
  "text":"Providing","bbox":{"page_number":10,"x":1014.0,"y":500.36,"width":83.55994,"height":13.639999}},
{"id":"TOKEN_1bcdbf","parent_id":"LISTITEM_c802c4","style_id":"IBM_Plex_Sans_Black_20_0",
  "text":"transparency","bbox":{"page_number":10,"x":1102.2799,"y":500.36,"width":117.95801,"height":13.639999}}...

다음 Python 코드는 목록에서 텍스트를 추출하고 목록을 다시 작성하여 목록 항목을 반복하여 토큰 텍스트를 추출하는 방법을 설명합니다.

# Import required libraries
import json
import numpy as np
import pandas as pd

# Define helper functions

## Function, which finds entry in collection by key-value pair
def find_by_key(key: str, value, collection: list, unique=True):
  find = list(filter(lambda x: x[key] == value, collection))
  if unique:
    if len(find) > 1:
      raise ValueError(f"Found non-unique key-value pair.\n{find}")
    return find[0]
  else:
    return find

## Function, which flattens iterable collection of dicts
def flatten_collection(collection):
  result = []
  for val in collection.values():
    result.extend(val)
  return result

# Load the file with the extracted text
with open("/Users/janedoe/Downloads/output_retail.json") as f:
  raw_output = json.load(f)

# Get all list-related structures
all_lists = raw_output['all_structures']['lists']
all_list_items = raw_output['all_structures']['list_items']
all_list_identifiers = raw_output['all_structures']['list_identifiers']

# Get all list items from the first list in the file
list_1 = all_lists[0]
list_1_items = []

for list_item_id in list_1['children_ids']:
  list_1_items.append(find_by_key('id', list_item_id, all_list_items))

# Reconstruct the list
recon_list = []

flat_col = flatten_collection(raw_output['all_structures'])
for list_item in list_1_items:
  val = []
  for list_value_id in list_item['children_ids']:
    list_value = find_by_key('id', list_value_id, flat_col)
    #print(list_value['id'])
    if list_value['id'].startswith("LIST_ID"):
      for list_id_value_id in list_value['children_ids']:
        list_id_value = find_by_key('id', list_id_value_id, flat_col)
        if 'text' in list_id_value:
          val.append(list_id_value['text'])
    elif list_value['id'].startswith("PARA"):
      val.append("\n")
      for para_value_id in list_value['children_ids']:
        para_value = find_by_key('id', para_value_id, flat_col)
        if 'text' in para_value:
          val.append(para_value['text'])
    elif list_value['id'].startswith("TOKEN"):
      val.append(list_value['text'])
    else:
      pass
  print(' '.join(val))

테이블이 표현되는 방식

테이블의 구조는 ' all_structures 루트 객체의 일부인 세 개의 개별 객체로 표현됩니다:

tables' : 여러 테이블 행과 연결됩니다.
table_rows' : 각 테이블 행은 여러 테이블 셀과 연결됩니다.
table_cells: 각 테이블 셀에는 토큰 시퀀스, 단락과 토큰의 혼합 시퀀스 또는 목록, 단락, 토큰의 혼합 시퀀스가 포함되어 있습니다.

다음 JSON 출력은 테이블 열 제목인 워크플로우가 어떻게 표시되는지 보여줍니다.

첫 번째 열 제목인 워크플로우라는 단어가 강조 표시된 두 개의 열로 된 표가 있는 PDF의 스크린샷입니다


//The all_structures root object contains the table, which has many rows.
"all_structures":{
  ...
  "tables":[{"id":"TABLE_3bfabb","children_ids":[
    "ROW_39aa6f",...,"ROW_63472c"]}

//A separate table rows array contains table cells.
"all_structures":{
  ...
  "table_rows":[{"id":"ROW_39aa6f","parent_id":"TABLE_3bfabb","children_ids":[
    "CELL_bc1c4b","CELL_3a8cdd","CELL_03b6d3"]}

//One of the table cells is identified as a column header and contains a paragraph.
{"id":"CELL_3a8cdd","parent_id":"ROW_39aa6f","is_row_header":false,
  "is_col_header":true,"col_span":1,"row_span":1,"col_start":2,"row_start":1,
  "children_ids":["PARA_088d08"]}

//The paragraph has a token.
{"id":"PARA_088d08","parent_id":"CELL_3a8cdd","children_ids":[
  "TOKEN_b99851"],"indentation":1}

//The token contains the text *Workflows*.
{"id":"TOKEN_b99851","parent_id":"PARA_088d08","style_id":"IBM_Plex_Sans_SmBld_Black_20_0_bold",
  "text":"Workflows","bbox":{"page_number":14,"x":757.0,"y":291.44003,"width":99.15997,"height":13.96}}

다음 Python 코드는 테이블에서 텍스트를 추출하고 테이블을 재구성하여 테이블 행과 셀을 반복하여 토큰 텍스트를 추출하는 방법을 설명합니다.

# Import required libraries
import json
import numpy as np
import pandas as pd

# Define helper functions
## Function, which finds entry in collection by key-value pair
def find_by_key(key: str, value, collection: list, unique=True):
  find = list(filter(lambda x: x[key] == value, collection))
  if unique:
    if len(find) > 1:
      raise ValueError(f"Found non-unique key-value pair.\n{find}")
    return find[0]
  else:
    return find

## Function, which flattens iterable collection of dicts
def flatten_collection(collection):
  result = []
  for val in collection.values():
    result.extend(val)
  return result

# Load the file with the extracted text
with open("/Users/janedoe/Downloads/output_retail.json") as f:
  raw_output = json.load(f)

# Get all table-related structures
all_tables = raw_output['all_structures']['tables']
all_table_rows = raw_output['all_structures']['table_rows']
all_table_cells = raw_output['all_structures']['table_cells']

# Get all of the cells from the first table
table_1 = all_tables[0]
table_1_cells = []

for row_id in table_1['children_ids']:
  row = find_by_key('id', row_id, all_table_rows)
  for cell_id in row['children_ids']:
    table_1_cells.append(find_by_key('id', cell_id, all_table_cells))

# Reconstruct the first table
last_col = table_1_cells[-1]['col_start']
last_row = table_1_cells[-1]['row_start']

recon_table = np.empty([last_row, last_col], dtype=object)

flat_col = flatten_collection(raw_output['all_structures'])
for cell in all_table_cells:
  cell_col, cell_row = cell['col_start'], cell['row_start']
  for cell_value in cell['children_ids']:
    value = find_by_key('id', cell_value, flat_col)
    entries = []
    for cell_entry in value['children_ids']:
      entry = find_by_key('id', cell_entry, flat_col)
      if 'text' in entry:
        entries.append(entry['text'])
    cell_content = " ".join(entries)
  recon_table[cell_row-1][cell_col-1] = str(cell_content)

pd.DataFrame(data=recon_table[1:,:], columns=recon_table[0,:])

텍스트 추출 JSON 스키마

문서에 대해 생성된 JSON에서 정보를 추출하는 코드를 작성할 때 JSON 스키마를 참조할 수 있습니다.

참고:

' Not required '을 언급하는 설명이 있는 모든 구조는 향후 스키마 반복에서 제거될 수 있습니다. 코드에서 선택적 구조를 참조하도록 선택한 경우 나중에 스키마가 변경될 때 코드를 업데이트해야 할 수 있습니다.

{ "$defs": {
    "AssemblyJsonOutput": {
      "type": "object",
      "properties": {
        "metadata": {
          "description": "Metadata about this document.",
          "$ref": "#/$defs/Metadata"
        },
        "styles": {
          "description": "Font styles used in this document. Not required.",
          "type": "array",
          "items": {
            "$ref": "#/$defs/Style"
          }
        },
        "top_level_structures": {
          "type": "array",
          "description": "Array of ids of the top level structures which belong directly under the document",
          "items": {
            "type": "string"
          }
        },
        "all_structures": {
          "type": "object",
          "description": "An object containing lists of all structures identified in this document.",
          "$ref": "#/$defs/Structures"
        }
      },
      "required": [
        "metadata",
        "top_level_structures",
        "all_structures"
      ]
    },
    "Metadata": {
      "type": "object",
      "additionalProperties": true,
      "title": "Metadata",
      "properties": {
        "num_pages": {
          "type": "integer",
          "description": "Total number of pages in the document"
        },
        "title": {
          "type": "string",
          "description": "Document title as obtained from source document. Not required."
        },
        "keywords": {
          "type": "string",
          "description": "Keywords associated with document. Not required."
        },
        "author": {
          "type": "string",
          "description": "Author of the document. Not required."
        },
        "publication_date": {
          "type": "string",
          "description": "Best effort bases for a publication date (may be the creation date). Not required."
        },
        "subject": {
          "type": "string",
          "description": "Subject as obtained from the source document. Not required."
        },
        "charset": {
          "type": "string",
          "description": "Character set used for the output"
        }
      },
      "required": []
    },
    "Style": {
      "type": "object",
      "title": "Style",
      "properties": {
        "style_id": {
          "type": "string",
          "description": "Style Identifier which will be used for reference in other objects"
        },
        "font_size": {
          "type": "string",
          "description": "Font size"
        },
        "font_name": {
          "type": "string",
          "description": "Font name"
        },
        "is_bold": {
          "type": "string",
          "description": "Whether or not the the font is bold"
        },
        "is_italic": {
          "type": "string",
          "description": "Whether or not the the font is italic"
        }
      }
    },
    "Structures": {
      "type": "object",
      "description": "An object containing all of the flattened structures identified in the document. None of the items in this object are required.",
      "sections":    {
        "type": "array",
        "items": {
          "$ref": "#/$defs/Section"
        }
      },
      "section_titles":    {
        "type": "array",
        "items": {
          "$ref": "#/$defs/SectionTitle"
        }
      },
      "lists":{
        "type": "array",
        "items": {
          "$ref": "#/$defs/List"
        }
      },
      "list_items":{
        "type": "array",
        "items": {
          "$ref": "#/$defs/ListItem"
        }
      },
      "list_identifiers":{
        "type": "array",
        "items": {
          "$ref": "#/$defs/ListIdentifier"
        }
      },
      "tables":{
        "type": "array",
        "items": {
          "$ref": "#/$defs/Table"
        }
      },
      "table_rows":{
        "type": "array",
        "items": {
          "$ref": "#/$defs/TableRow"
        }
      },
      "table_cells":{
        "type": "array",
        "items": {
          "$ref": "#/$defs/TableCell"
        }
      },
      "subscripts":{
        "type": "array",
        "items": {
          "$ref": "#/$defs/Subscript"
        }
      },
      "superscripts":{
        "type": "array",
        "items": {
          "$ref": "#/$defs/Superscript"
        }
      },
      "footnotes":{
        "type": "array",
        "items": {
          "$ref": "#/$defs/Footnote"
        }
      },
      "paragraphs":{
        "type": "array",
        "items": {
          "$ref": "#/$defs/Paragraph"
        }
      },
      "tokens":{
        "type": "array",
        "items": {
          "$ref": "#/$defs/Token"
        }
      }
    },
    "Section": {
      "type": "object",
      "title": "Section",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the section"
        },
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        },
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence"
        },
        "section_number": {
          "type": "string",
          "description": "Section identifier identified in the document"
        },
        "section_level": {
          "type": "string",
          "description": "Nesting level of section identified in the document"
        }
      }
    },
    "SectionTitle": {
      "type": "object",
      "title": "SectionTitle",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the section"
        },
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        },
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence. Not required."
        },
        "text_alignment": {
          "type": "string",
          "description": "Text alignment of the section title. Not required."
        }
      }
    },
    "List": {
      "type": "object",
      "title": "List",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the list "
        },
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        },
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence"
        }
      }
    },
    "ListItem": {
      "type": "object",
      "title": "ListItem",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the list item"
        },
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        },
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence. Not required."
        }
      }
    },
    "ListIdentifier": {
      "type": "object",
      "title": "ListIdentifier",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the list item"
        },
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        },
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence"
        }
      }
    },
    "Table": {
      "type": "object",
      "title": "Table",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the table"
        },
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        },
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence, in this case, table rows"
        }
      }
    },
    "TableRow": {
      "type": "object",
      "title": "TableRow",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the table row"
        },
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        },
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence, in this case, table cells"
        }
      }
    },
    "TableCell": {
      "type": "object",
      "title": "TableCell",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the table cell"
        },
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        },
        "is_row_header": {
          "type": "boolean",
          "description": "Whether the cell is part of row header or not"
        },
        "is_column_header": {
          "type": "boolean",
          "description": "Whether the cell is part of column header or not"
        },
        "col_span": {
          "type": "integer",
          "description": "column span of the cell"
        },
        "row_span": {
          "type": "integer",
          "description": "row span of the cell"
        },
        "col_start": {
          "type": "integer",
          "description": "column start of the cell within the table"
        },
        "row_start": {
          "type": "integer",
          "description": "row start of the cell within the table"
        },
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence, in this case, underlying paragraphs. Not required."
        }
      }
    },
    "Subscript": {
      "type": "object",
      "title": "Subscript",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the subscript"
        },
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        },
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence. Not required."
        },
        "token_id_ref": {
          "type": "string",
          "description": "Id of the token to which the subscript belongs"
        }
      }
    },
    "Superscript": {
      "type": "object",
      "title": "Superscript",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the superscript"
        },
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        },
        "footnote_ref": {
          "type": "string",
          "description": "Matching footnote id found on the page"
        },
        "token_id_ref": {
          "type": "string",
          "description": "Id of the token to which the superscript belongs"
        },
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence. Not required."
        }
      }
    },
    "Footnote": {
      "type": "object",
      "title": "Footnote",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the footnote"
        },
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        },
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence. Not required."
        }
      }
    },
    "Paragraph": {
      "type": "object",
      "title": "Paragraph",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the paragraph"
        },
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        },
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence, in this case, tokens. Not required."
        },
        "text_alignment": {
          "type": "string",
          "description": "Text alignment of the paragraph. Not required."
        },
        "indentation": {
          "type": "integer",
          "description": "Paragraph indentation. Not required."
        }
      }
    },
    "Token": {
      "type": "object",
      "title": "Token",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the list identifier"
        },
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        },
        "style_id": {
          "type": "string",
          "description": "Identifier of the style object associated with this token. Not required."
        },
        "text": {
          "type": "string",
          "description": "Actual text of the token"
        },
        "bbox": {
          "type":"object",
          "description": "Not required.",
          "$ref": "#/$defs/BoundingBox"
        }
      }
    },
    "BoundingBox": {
      "type": "object",
      "title": "BoundingBox",
      "properties": {
        "page_number": {
          "description": "which page this represents",
          "type": "integer"
        },
        "x": {
          "description": "X coordinate of the bounding box",
          "type": "float"
        },
        "y": {
          "description": "X coordinate of the bounding box",
          "type": "float"
        },
        "width": {
          "description": "width of the bounding box",
          "type": "float"
        },
        "height": {
          "description": "height of the bounding box",
          "type": "float"
        }
      }
    }
  },
  "$ref": "#/$defs/AssemblyJsonOutput"
}

자세히 알아보기

프로그래밍 방식으로 파일에서 텍스트 추출하기

상위 주제: 문서에서 텍스트 추출하기