Parsing delle strutture JSON generate dall'estrazione del testo

Torna alla versione inglese della documentazione

Parsing delle strutture JSON generate dall'estrazione del testo

Ultimo aggiornamento: 07 nov 2024

Parsing delle strutture JSON generate dall'estrazione del testo

Il testo estratto da un documento o da un file immagine utilizzando l'API di estrazione del testo viene scritto in un file JSON che contiene dettagli sui vari elementi testuali e visivi del documento. È possibile elaborare ulteriormente il JSON generato per estrarre le informazioni desiderate.

I seguenti oggetti JSON vengono sempre restituiti quando il testo viene estratto tramite l'API di estrazione del testo. Le strutture all'interno di questi oggetti radice sono opzionali, cioè possono essere restituite o meno nell'output.

all_structures
metadata
top_level_structures

La chiave " styles è un altro oggetto di livello radice che viene spesso restituito, ma è opzionale. L'oggetto 'styles contiene un elenco di dizionari, ognuno dei quali contiene dettagli sui caratteri, come la dimensione e lo stile dei caratteri, utilizzati nel documento.

È possibile scrivere codice per estrarre il testo dalle strutture di interesse. Per ulteriori informazioni, consultare le sezioni seguenti:

Come vengono rappresentati i paragrafi
Come viene rappresentato il testo delle immagini
Come vengono rappresentati gli elenchi
Come vengono rappresentate le tabelle

Per maggiori dettagli sullo schema JSON, vedere Schema JSON dell'estrazione del testo.

Metadati

La chiave 'metadata è un dizionario che contiene dettagli sui metadati del documento elaborato, come segue:

num_pages: Numero di pagine del documento.
title: Titolo del documento.
keywords: parole chiave relative al documento.
author: Autore del documento.
publication_date: Data di creazione o pubblicazione del documento.
subject: Oggetto del documento.
charset: standard dell'insieme di caratteri utilizzati nel documento.

Il seguente output JSON è un esempio della struttura dell'oggetto metadati per un file PDF.

"metadata":{
  "num_pages":28,
  "title":"Put AI to work for HR and talent transformation for the retail industry",
  "keywords":"",
  "author":"IBM",
  "publication_date":"",
  "subject":"Apply AI capabilities to drive your HR and talent transformation and generate better business outcomes in the retail industry.",
  "charset":"UTF-8"
}

Strutture

Ci sono due chiavi che si riferiscono alle strutture di dati nel documento analizzato:

top_level_structures: un elenco di ID di strutture di dati di primo livello.
all_structures: un elenco di tutti i tipi di struttura dati.

La chiave 'all_structures contiene un elenco di tutti i possibili tipi di strutture dati presenti nel documento analizzato. Alcune delle strutture di dati che potrebbero essere presenti in un documento analizzato sono le seguenti:

sections: Un elenco di tutte le sezioni del documento.
section_titles: elenco dei titoli delle sezioni rilevate.
lists: Una raccolta di tutti gli elenchi del documento.
list_items: Un insieme di elementi dell'elenco presenti negli oggetti dell'elenco rilevati.
list_identifiers: Una collezione di identificatori di lista degli oggetti di lista rilevati.
tables: Un elenco di tutte le tabelle del documento.
table_rows: Un elenco di righe di tabella presenti nelle tabelle rilevate.
table_cells: Un elenco delle celle di tabella presenti nelle righe di tabella rilevate.
tokens: Un elenco di token di testo semplice.
subscripts: Un elenco di istanze di testo pedice che si riferisce ai token rilevati nel documento.
superscripts: Un elenco di istanze di testo in apice relative ai token rilevati nel documento
footnotes: Un elenco di note a piè di pagina.
paragraphs: Un elenco di paragrafi.

Lavorare con il JSON estratto

È possibile utilizzare una libreria di elaborazione JSON per estrarre il testo da diverse strutture nel file JSON generato.

Il comando seguente restituisce il numero di pagine del PDF, un valore memorizzato in un singolo oggetto JSON:

cat output_retail.json | jq '.metadata.num_pages'

Nota: questo comando utilizza jq, un processore JSON a riga di comando che deve essere installato separatamente.

Per alcune strutture, come tabelle ed elenchi, il testo estratto viene memorizzato in vari oggetti all'interno del JSON generato. È possibile utilizzare il codice per attraversare gli oggetti ed estrarre il testo di interesse.

Come vengono rappresentati i paragrafi

Un singolo paragrafo è più comunemente associato a più token in sequenza, dove ogni token rappresenta una parola.

In alcuni casi, i paragrafi sono associati ad altre strutture, come le sezioni e gli elenchi.

Il seguente output JSON illustra come un paragrafo e i token sono correlati nella frase, Raccogliere, organizzare, far crescere i dati quando il testo viene estratto da un PDF.

Schermata di un PDF con evidenziata la frase "Raccogliere organizzare crescere i dati".

//The section is listed in the top_level_structures array.
"top_level_structures":["PARA_fbdcdd",...,"SECTION_a2ab08",...],

//The section has a list of parapraphs.
{"id":"SECTION_9a3dda","parent_id":"SECTION_a2ab08","children_ids":["PARA_09384c",...

//The paragraph contains a section title.
{"id":"PARA_09384c","parent_id":"SECTION_9a3dda",
"text_alignment":"left","children_ids":["SECTION_TITLE_a5e3c2"],

//Token IDs listed for the section title.
{"id":"SECTION_TITLE_a5e3c2","parent_id":"PARA_09384c",
"text_alignment":"TBD","children_ids":[
  "TOKEN_48bbae","TOKEN_cc0b9c","TOKEN_d57d27","TOKEN_a7d6da"
]},

//Consecutive tokens with a shared parent_id contain the text from the sentence of interest.
{"id":"TOKEN_48bbae","parent_id":"SECTION_TITLE_a5e3c2","style_id":"IBM_Plex_Sans_Light_Black_32_0",
"text":"Collect,",
"bbox":{"page_number":8,"x":283.0,"y":775.2945,"width":106.43201,"height":21.44}},
{"id":"TOKEN_cc0b9c","parent_id":"SECTION_TITLE_a5e3c2","style_id":"IBM_Plex_Sans_Light_Black_32_0",
"text":"organize,",
"bbox":{"page_number":8,"x":396.984,"y":775.2945,"width":126.78082,"height":21.44}},
{"id":"TOKEN_d57d27","parent_id":"SECTION_TITLE_a5e3c2","style_id":"IBM_Plex_Sans_Light_Black_32_0",
"text":"grow",
"bbox":{"page_number":8,"x":531.31683,"y":775.2945,"width":69.823975,"height":21.44}},
{"id":"TOKEN_a7d6da","parent_id":"SECTION_TITLE_a5e3c2","style_id":"IBM_Plex_Sans_Light_Black_32_0",
"text":"data",
"bbox":{"page_number":8,"x":608.6928,"y":775.2945,"width":62.880005,"height":21.44}},

Come viene rappresentato il testo delle immagini

Quando si invia un file PDF con immagini o un file immagine al metodo di estrazione del testo dell'API watsonx.ai, il testo dell'immagine è rappresentato da 'tokens. I 'tokens sono tipicamente contenuti dagli oggetti 'paragraph o 'section.

Schermata di un file PNG con testo.

Il seguente estratto JSON illustra come un file PNG inviato al metodo di estrazione del testo viene rappresentato nell'output JSON. Gli oggetti paragrafo che contengono i token di testo sono disponibili da entrambi gli oggetti radice 'top_level_structures e 'all_structures '.

"top_level_structures":
[
  "PARA_bc9320","PARA_8e9e62","PARA_b7f5cc","PARA_c75980","PARA_61a6a5","PARA_c8c2a8","PARA_8b8dd6","PARA_8c7c77","PARA_61aa92","PARA_1e6d2a","PARA_6eaa8d","PARA_cc6df5","PARA_4a9fb2"
],
"all_structures":{"sections":[],"section_titles":[],"lists":[],
  "list_items":[],"list_identifiers":[],"tables":[],"table_rows":[],
  "table_cells":[],"subscripts":[],"superscripts":[],"footnotes":[],
  "paragraphs":
  [
    {"id":"PARA_bc9320","parent_id":"root","text_alignment":"center",
    "children_ids":["TOKEN_132783","TOKEN_f0e333","TOKEN_dd48c3",
    "TOKEN_c9b25e","TOKEN_080303","TOKEN_ce1aa0","TOKEN_97bf62"]...
    {"id":"PARA_8e9e62","parent_id":"root",...
    ...
    {"id":"PARA_4a9fb2","parent_id":"root",...
  ]

Il testo estratto è specificato nei token all'interno del paragrafo. I seguenti token rappresentano le parole The AI Ladder® dall'immagine come segue:

"tokens":[
  {"id":"TOKEN_132783","parent_id":"PARA_bc9320","style_id":"Arial_Black_10_0",
    "text":"The","bbox":{"page_number":1,"x":250.65,"y":109.3,"width":38.880005,"height":21.48999}},
  {"id":"TOKEN_f0e333","parent_id":"PARA_bc9320","style_id":"Arial_Black_10_0",
    "text":"AI","bbox":{"page_number":1,"x":295.82,"y":114.67,"width":24.109985,"height":16.290009}},
  {"id":"TOKEN_dd48c3","parent_id":"PARA_bc9320","style_id":"Arial_Black_10_0",
    "text":"Ladder®","bbox":{"page_number":1,"x":325.74,"y":110.24,"width":82.66,"height":22.030006}}

Come vengono rappresentati gli elenchi

La struttura di un elenco è rappresentata da tre oggetti distinti che fanno parte dell'oggetto radice 'all_structures:

lists: Insieme di voci di elenco formattate come elenco puntato o numerato.
list_items: Singolo elemento in un elenco che può contenere token con testo, paragrafi o elenchi annidati.
list_identifiers: contiene un token con un simbolo, come un trattino o un numero, che identifica l'elemento dell'elenco.

Il seguente output JSON illustra come viene rappresentato il testo - Fornire trasparenza nel primo punto di un elenco.

Schermata di un PDF con un elenco puntato in cui il primo elemento dell'elenco include le parole evidenziate Fornendo la trasparenza.

//The lists object contains the list where the listitem is located.
"lists":[{"id":"LIST_ed036e","parent_id":"SECTION_9a3dda","children_ids":[
  "LISTITEM_c802c4",...

//The list_item object contains the list item which contains a list ID followed by several tokens.
"list_items":[{"id":"LISTITEM_c802c4","parent_id":"LIST_ed036e","children_ids":[
  "LIST_ID_781ee7","TOKEN_1df44f","TOKEN_1bcdbf",...

//The list_identifiers object contains list IDs with tokens.
"list_identifiers":[{"id":"LIST_ID_781ee7","parent_id":"LISTITEM_c802c4",
  "children_ids":["TOKEN_4a66cb"]}

//The list ID token includes a token with a hyphen.
{"id":"TOKEN_4a66cb","parent_id":"LIST_ID_781ee7","style_id":"IBM_Plex_Sans_Black_20_0",
  "text":"–","bbox":{"page_number":10,"x":994.0,"y":500.36,"width":11.76001,"height":13.639999}}

//The list item tokens include the text *Providing transparency* in them.
{"id":"TOKEN_1df44f","parent_id":"LISTITEM_c802c4","style_id":"IBM_Plex_Sans_Black_20_0",
  "text":"Providing","bbox":{"page_number":10,"x":1014.0,"y":500.36,"width":83.55994,"height":13.639999}},
{"id":"TOKEN_1bcdbf","parent_id":"LISTITEM_c802c4","style_id":"IBM_Plex_Sans_Black_20_0",
  "text":"transparency","bbox":{"page_number":10,"x":1102.2799,"y":500.36,"width":117.95801,"height":13.639999}}...

Il seguente codice Python estrae il testo da un elenco e ricostruisce l'elenco per illustrare come sia possibile scorrere gli elementi dell'elenco per estrarre il testo dei token.

# Import required libraries
import json
import numpy as np
import pandas as pd

# Define helper functions

## Function, which finds entry in collection by key-value pair
def find_by_key(key: str, value, collection: list, unique=True):
  find = list(filter(lambda x: x[key] == value, collection))
  if unique:
    if len(find) > 1:
      raise ValueError(f"Found non-unique key-value pair.\n{find}")
    return find[0]
  else:
    return find

## Function, which flattens iterable collection of dicts
def flatten_collection(collection):
  result = []
  for val in collection.values():
    result.extend(val)
  return result

# Load the file with the extracted text
with open("/Users/janedoe/Downloads/output_retail.json") as f:
  raw_output = json.load(f)

# Get all list-related structures
all_lists = raw_output['all_structures']['lists']
all_list_items = raw_output['all_structures']['list_items']
all_list_identifiers = raw_output['all_structures']['list_identifiers']

# Get all list items from the first list in the file
list_1 = all_lists[0]
list_1_items = []

for list_item_id in list_1['children_ids']:
  list_1_items.append(find_by_key('id', list_item_id, all_list_items))

# Reconstruct the list
recon_list = []

flat_col = flatten_collection(raw_output['all_structures'])
for list_item in list_1_items:
  val = []
  for list_value_id in list_item['children_ids']:
    list_value = find_by_key('id', list_value_id, flat_col)
    #print(list_value['id'])
    if list_value['id'].startswith("LIST_ID"):
      for list_id_value_id in list_value['children_ids']:
        list_id_value = find_by_key('id', list_id_value_id, flat_col)
        if 'text' in list_id_value:
          val.append(list_id_value['text'])
    elif list_value['id'].startswith("PARA"):
      val.append("\n")
      for para_value_id in list_value['children_ids']:
        para_value = find_by_key('id', para_value_id, flat_col)
        if 'text' in para_value:
          val.append(para_value['text'])
    elif list_value['id'].startswith("TOKEN"):
      val.append(list_value['text'])
    else:
      pass
  print(' '.join(val))

Come vengono rappresentate le tabelle

La struttura di una tabella è rappresentata da tre oggetti separati che fanno parte dell'oggetto radice 'all_structures:

tables: associato a più righe di tabella.
table_rows: Ogni riga di tabella è associata a più celle di tabella.
table_cells: Ogni cella della tabella contiene una sequenza di token, una sequenza mista di paragrafi e token o una sequenza mista di elenchi, paragrafi e token.

Il seguente output JSON illustra come viene rappresentata la colonna della tabella intitolata Workflows.

Schermata di un PDF con una tabella a due colonne in cui è evidenziato il titolo della prima colonna, ovvero la parola Workflows (flussi di lavoro)


//The all_structures root object contains the table, which has many rows.
"all_structures":{
  ...
  "tables":[{"id":"TABLE_3bfabb","children_ids":[
    "ROW_39aa6f",...,"ROW_63472c"]}

//A separate table rows array contains table cells.
"all_structures":{
  ...
  "table_rows":[{"id":"ROW_39aa6f","parent_id":"TABLE_3bfabb","children_ids":[
    "CELL_bc1c4b","CELL_3a8cdd","CELL_03b6d3"]}

//One of the table cells is identified as a column header and contains a paragraph.
{"id":"CELL_3a8cdd","parent_id":"ROW_39aa6f","is_row_header":false,
  "is_col_header":true,"col_span":1,"row_span":1,"col_start":2,"row_start":1,
  "children_ids":["PARA_088d08"]}

//The paragraph has a token.
{"id":"PARA_088d08","parent_id":"CELL_3a8cdd","children_ids":[
  "TOKEN_b99851"],"indentation":1}

//The token contains the text *Workflows*.
{"id":"TOKEN_b99851","parent_id":"PARA_088d08","style_id":"IBM_Plex_Sans_SmBld_Black_20_0_bold",
  "text":"Workflows","bbox":{"page_number":14,"x":757.0,"y":291.44003,"width":99.15997,"height":13.96}}

Il seguente codice Python estrae il testo da una tabella e ricostruisce la tabella per illustrare come sia possibile scorrere le righe e le celle della tabella per estrarre il testo dei token.

# Import required libraries
import json
import numpy as np
import pandas as pd

# Define helper functions
## Function, which finds entry in collection by key-value pair
def find_by_key(key: str, value, collection: list, unique=True):
  find = list(filter(lambda x: x[key] == value, collection))
  if unique:
    if len(find) > 1:
      raise ValueError(f"Found non-unique key-value pair.\n{find}")
    return find[0]
  else:
    return find

## Function, which flattens iterable collection of dicts
def flatten_collection(collection):
  result = []
  for val in collection.values():
    result.extend(val)
  return result

# Load the file with the extracted text
with open("/Users/janedoe/Downloads/output_retail.json") as f:
  raw_output = json.load(f)

# Get all table-related structures
all_tables = raw_output['all_structures']['tables']
all_table_rows = raw_output['all_structures']['table_rows']
all_table_cells = raw_output['all_structures']['table_cells']

# Get all of the cells from the first table
table_1 = all_tables[0]
table_1_cells = []

for row_id in table_1['children_ids']:
  row = find_by_key('id', row_id, all_table_rows)
  for cell_id in row['children_ids']:
    table_1_cells.append(find_by_key('id', cell_id, all_table_cells))

# Reconstruct the first table
last_col = table_1_cells[-1]['col_start']
last_row = table_1_cells[-1]['row_start']

recon_table = np.empty([last_row, last_col], dtype=object)

flat_col = flatten_collection(raw_output['all_structures'])
for cell in all_table_cells:
  cell_col, cell_row = cell['col_start'], cell['row_start']
  for cell_value in cell['children_ids']:
    value = find_by_key('id', cell_value, flat_col)
    entries = []
    for cell_entry in value['children_ids']:
      entry = find_by_key('id', cell_entry, flat_col)
      if 'text' in entry:
        entries.append(entry['text'])
    cell_content = " ".join(entries)
  recon_table[cell_row-1][cell_col-1] = str(cell_content)

pd.DataFrame(data=recon_table[1:,:], columns=recon_table[0,:])

Estrazione del testo Schema JSON

È possibile fare riferimento allo schema JSON quando si scrive codice per estrarre informazioni dal JSON generato per il documento.

Nota:

Tutte le strutture con una descrizione che menziona " Not required potrebbero essere rimosse nelle future iterazioni dello schema. Se si sceglie di fare riferimento a strutture opzionali nel codice, potrebbe essere necessario aggiornare il codice in un secondo momento, quando vengono apportate modifiche allo schema.

{ "$defs": {
    "AssemblyJsonOutput": {
      "type": "object",
      "properties": {
        "metadata": {
          "description": "Metadata about this document.",
          "$ref": "#/$defs/Metadata"
        },
        "styles": {
          "description": "Font styles used in this document. Not required.",
          "type": "array",
          "items": {
            "$ref": "#/$defs/Style"
          }
        },
        "top_level_structures": {
          "type": "array",
          "description": "Array of ids of the top level structures which belong directly under the document",
          "items": {
            "type": "string"
          }
        },
        "all_structures": {
          "type": "object",
          "description": "An object containing lists of all structures identified in this document.",
          "$ref": "#/$defs/Structures"
        }
      },
      "required": [
        "metadata",
        "top_level_structures",
        "all_structures"
      ]
    },
    "Metadata": {
      "type": "object",
      "additionalProperties": true,
      "title": "Metadata",
      "properties": {
        "num_pages": {
          "type": "integer",
          "description": "Total number of pages in the document"
        },
        "title": {
          "type": "string",
          "description": "Document title as obtained from source document. Not required."
        },
        "keywords": {
          "type": "string",
          "description": "Keywords associated with document. Not required."
        },
        "author": {
          "type": "string",
          "description": "Author of the document. Not required."
        },
        "publication_date": {
          "type": "string",
          "description": "Best effort bases for a publication date (may be the creation date). Not required."
        },
        "subject": {
          "type": "string",
          "description": "Subject as obtained from the source document. Not required."
        },
        "charset": {
          "type": "string",
          "description": "Character set used for the output"
        }
      },
      "required": []
    },
    "Style": {
      "type": "object",
      "title": "Style",
      "properties": {
        "style_id": {
          "type": "string",
          "description": "Style Identifier which will be used for reference in other objects"
        },
        "font_size": {
          "type": "string",
          "description": "Font size"
        },
        "font_name": {
          "type": "string",
          "description": "Font name"
        },
        "is_bold": {
          "type": "string",
          "description": "Whether or not the the font is bold"
        },
        "is_italic": {
          "type": "string",
          "description": "Whether or not the the font is italic"
        }
      }
    },
    "Structures": {
      "type": "object",
      "description": "An object containing all of the flattened structures identified in the document. None of the items in this object are required.",
      "sections":    {
        "type": "array",
        "items": {
          "$ref": "#/$defs/Section"
        }
      },
      "section_titles":    {
        "type": "array",
        "items": {
          "$ref": "#/$defs/SectionTitle"
        }
      },
      "lists":{
        "type": "array",
        "items": {
          "$ref": "#/$defs/List"
        }
      },
      "list_items":{
        "type": "array",
        "items": {
          "$ref": "#/$defs/ListItem"
        }
      },
      "list_identifiers":{
        "type": "array",
        "items": {
          "$ref": "#/$defs/ListIdentifier"
        }
      },
      "tables":{
        "type": "array",
        "items": {
          "$ref": "#/$defs/Table"
        }
      },
      "table_rows":{
        "type": "array",
        "items": {
          "$ref": "#/$defs/TableRow"
        }
      },
      "table_cells":{
        "type": "array",
        "items": {
          "$ref": "#/$defs/TableCell"
        }
      },
      "subscripts":{
        "type": "array",
        "items": {
          "$ref": "#/$defs/Subscript"
        }
      },
      "superscripts":{
        "type": "array",
        "items": {
          "$ref": "#/$defs/Superscript"
        }
      },
      "footnotes":{
        "type": "array",
        "items": {
          "$ref": "#/$defs/Footnote"
        }
      },
      "paragraphs":{
        "type": "array",
        "items": {
          "$ref": "#/$defs/Paragraph"
        }
      },
      "tokens":{
        "type": "array",
        "items": {
          "$ref": "#/$defs/Token"
        }
      }
    },
    "Section": {
      "type": "object",
      "title": "Section",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the section"
        },
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        },
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence"
        },
        "section_number": {
          "type": "string",
          "description": "Section identifier identified in the document"
        },
        "section_level": {
          "type": "string",
          "description": "Nesting level of section identified in the document"
        }
      }
    },
    "SectionTitle": {
      "type": "object",
      "title": "SectionTitle",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the section"
        },
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        },
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence. Not required."
        },
        "text_alignment": {
          "type": "string",
          "description": "Text alignment of the section title. Not required."
        }
      }
    },
    "List": {
      "type": "object",
      "title": "List",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the list "
        },
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        },
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence"
        }
      }
    },
    "ListItem": {
      "type": "object",
      "title": "ListItem",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the list item"
        },
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        },
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence. Not required."
        }
      }
    },
    "ListIdentifier": {
      "type": "object",
      "title": "ListIdentifier",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the list item"
        },
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        },
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence"
        }
      }
    },
    "Table": {
      "type": "object",
      "title": "Table",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the table"
        },
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        },
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence, in this case, table rows"
        }
      }
    },
    "TableRow": {
      "type": "object",
      "title": "TableRow",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the table row"
        },
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        },
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence, in this case, table cells"
        }
      }
    },
    "TableCell": {
      "type": "object",
      "title": "TableCell",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the table cell"
        },
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        },
        "is_row_header": {
          "type": "boolean",
          "description": "Whether the cell is part of row header or not"
        },
        "is_column_header": {
          "type": "boolean",
          "description": "Whether the cell is part of column header or not"
        },
        "col_span": {
          "type": "integer",
          "description": "column span of the cell"
        },
        "row_span": {
          "type": "integer",
          "description": "row span of the cell"
        },
        "col_start": {
          "type": "integer",
          "description": "column start of the cell within the table"
        },
        "row_start": {
          "type": "integer",
          "description": "row start of the cell within the table"
        },
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence, in this case, underlying paragraphs. Not required."
        }
      }
    },
    "Subscript": {
      "type": "object",
      "title": "Subscript",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the subscript"
        },
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        },
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence. Not required."
        },
        "token_id_ref": {
          "type": "string",
          "description": "Id of the token to which the subscript belongs"
        }
      }
    },
    "Superscript": {
      "type": "object",
      "title": "Superscript",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the superscript"
        },
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        },
        "footnote_ref": {
          "type": "string",
          "description": "Matching footnote id found on the page"
        },
        "token_id_ref": {
          "type": "string",
          "description": "Id of the token to which the superscript belongs"
        },
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence. Not required."
        }
      }
    },
    "Footnote": {
      "type": "object",
      "title": "Footnote",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the footnote"
        },
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        },
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence. Not required."
        }
      }
    },
    "Paragraph": {
      "type": "object",
      "title": "Paragraph",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the paragraph"
        },
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        },
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence, in this case, tokens. Not required."
        },
        "text_alignment": {
          "type": "string",
          "description": "Text alignment of the paragraph. Not required."
        },
        "indentation": {
          "type": "integer",
          "description": "Paragraph indentation. Not required."
        }
      }
    },
    "Token": {
      "type": "object",
      "title": "Token",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the list identifier"
        },
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        },
        "style_id": {
          "type": "string",
          "description": "Identifier of the style object associated with this token. Not required."
        },
        "text": {
          "type": "string",
          "description": "Actual text of the token"
        },
        "bbox": {
          "type":"object",
          "description": "Not required.",
          "$ref": "#/$defs/BoundingBox"
        }
      }
    },
    "BoundingBox": {
      "type": "object",
      "title": "BoundingBox",
      "properties": {
        "page_number": {
          "description": "which page this represents",
          "type": "integer"
        },
        "x": {
          "description": "X coordinate of the bounding box",
          "type": "float"
        },
        "y": {
          "description": "X coordinate of the bounding box",
          "type": "float"
        },
        "width": {
          "description": "width of the bounding box",
          "type": "float"
        },
        "height": {
          "description": "height of the bounding box",
          "type": "float"
        }
      }
    }
  },
  "$ref": "#/$defs/AssemblyJsonOutput"
}

Ulteriori informazioni

Estrarre il testo da un file in modo programmatico

Argomento principale: Estrazione di testo dai documenti