0 / 0

Data builder pipelines and seed data formats

Last updated: May 08, 2025
Data builder pipelines and seed data formats

Use various data builder pipelines to create synthetic datasets with unstructured data in different formats for tuning and evaluating models for your use case.

Choose from one of the following data builder pipelines to generate synthetic datasets:

You must provide the following inputs for the data builder pipeline you specify in your unstructured data generation request:

Seed data
Provide seed data in the form of question and answer pairs that are used as inputs to the foundation model that generates synthetic datasets. The seed data trains the model to generate additional synthetic datasets in the same format.
Reference documents
Some data pipelines, like Tool calling and Knowledge pipelines, require domain-specific documents which serve as grounding documents when the foundation model is prompted to generate synthetic datasets. For example, you can provide an API specification or multiple Markdown files containing information specific to your use case or business.

Data builder pipeline comparison

To help you choose the data builder pipeline that best fits your use case, review the comparison table.

Table 1. Differences between data builder pipelines
Data builder pipeline Seed data format Generated synthetic data usage
Tool calling • Instruction and response pairs
• API specification files containing function definitions for tools
Used to fine tune LLMs to automate workflows, interact with databases, tackle complex problem-solving tasks, make real-time decisions and more. Best suited for agentic AI applications.
Text to SQL • Database operation in plain text
• SQL statement
• Database schema
Used to train LLMs to translate a human-readable prompt into a precise database query that can be used by applications directly.
Knowledge • Question and answer (QnA) pairs based on a knowledge base Used to train LLMs to perform question-answering, summarization, and conversational tasks based on topics in a business taxonomy.

Tool calling data pipeline

The tool calling data pipeline generates datasets that contain sample instruction and response pairs and an API specification that define tools that a foundation model can use to generate a response. The API specification contains the list of available tools and the parameters the main function accepts.

Seed data format

Create input YAML files in the following format to define seed data and reference documents when you use the tool calling pipeline:

  • task.yaml containing seed data.

    The task YAML file contains sample question and answer pairs that are used to train a foundation model to generate synthetic datasets as follows:

    task_description: <Description of this task>
    min_func_count: < Integer. Minimum value 1>
    max_func_count: < Integer. Max value 4>
    created_by: <Your organization name>
    fc_spec_loaders:
      - type: fc
        file_path: <Path to API spec YAML file>
    seed_examples:
      - domain: <Your domain name>
        input: <Sample prompt 1>
        output: '<Sample response 1>'
      - domain: <Your domain name>
        input: <Sample prompt 2>
        output: '<Sample response 2>'
    
  • api-spec.yaml as a reference document.

    The API specification YAML file contains a API specification for your domain that defines the tools the foundation model uses to generate synthetic datasets.

    <Your domain-name>:
      <function-1-name>:
        description: <function-1-description>
        name: <function-1-name>
        parameters:
            properties:
              <parameter-1-name>:
                  description: <parameter-1-description>
                  type: <parameter-1-type>
              <parameter-2-name>:
                  description: <parameter-2-description>
                  type: <parameter-2-type>
            required:
            - <required parameter 1>
            - <required parameter 2>
      <function-2-name>:
        description: <function-2-description>
        name: <function-2-name>
        parameters:
            properties:
              <parameter-1-name>:
                description: <parameter-1-description>
                type: <parameter-1-type>
              <parameter-2-name>:
                description: <parameter-2-description>
                type: <parameter-2-type>
            required:
            - <required parameter 1>
            - <required parameter 2>
    
Note: You must specify the same domain name in the `task.yaml` and `api-spec.yaml` files.

Text to SQL data pipeline

The text to SQL data pipeline generates a synthetic SQL data triplet that contains an instruction for interacting with a database written in a natural language, a SQL query, and a database schema.

Seed data format

Create an input YAML file containing sample plain text statements that describe various operations to perform on data stored in a relational database, the corresponding SQL queries to execute the operations, and a database schema that defines how the data is organized and stored as follows:

task_description: <Description of this task>
seed_examples:
   - utterance: <input question 1>
     query: <sample SQL 1>
   - utterance: <input question 2>
     query: <sample SQL 2>
database:
   schema: "<Data Definition Language (DDL) statement of one or more tables. Separate each DDL by a semi-colon>"

Knowledge data pipeline

The knowledge data pipeline generates instruction and response pairs based on examples in the knowledge branch in the training taxonomy of a tuned foundation model.

Seed data format

Create an input YAML file containing sample question and answer (QnA) pairs that a person who is learning the subject might ask and grounding documents with content that serves as a knowledge base as follows:

Tip: Only use content that is available from the associated grounding document text to draft the answers in the QnA pairs.
domain: <A phrase denoting your use case's domain>
task_description: "<Description of this task>"
seed_examples:
  - answer: <sample answer 1>
    question: <sample question 1>
  - answer: <sample answer 2>
    question: <sample question 2>
include:
  documents:
    <doc-set-1-name>: <name of the knowledge document(s). Specify either one document or wildcard to refer to multiple documents>
    <doc-set-2-name>: <name of the knowledge document(s). Specify either one document or wildcard to refer to multiple documents>

Parent topic: Generating synthetic unstructured data