Data builder pipelines and seed data formats
Use various data builder pipelines to create synthetic datasets with unstructured data in different formats for tuning and evaluating models for your use case.
Choose from one of the following data builder pipelines to generate synthetic datasets:
You must provide the following inputs for the data builder pipeline you specify in your unstructured data generation request:
- Seed data
- Provide seed data in the form of question and answer pairs that are used as inputs to the foundation model that generates synthetic datasets. The seed data trains the model to generate additional synthetic datasets in the same format.
- Reference documents
- Some data pipelines, like Tool calling and Knowledge pipelines, require domain-specific documents which serve as grounding documents when the foundation model is prompted to generate synthetic datasets. For example, you can provide an API specification or multiple Markdown files containing information specific to your use case or business.
Data builder pipeline comparison
To help you choose the data builder pipeline that best fits your use case, review the comparison table.
Data builder pipeline | Seed data format | Generated synthetic data usage |
---|---|---|
Tool calling | • Instruction and response pairs • API specification files containing function definitions for tools |
Used to fine tune LLMs to automate workflows, interact with databases, tackle complex problem-solving tasks, make real-time decisions and more. Best suited for agentic AI applications. |
Text to SQL | • Database operation in plain text • SQL statement • Database schema |
Used to train LLMs to translate a human-readable prompt into a precise database query that can be used by applications directly. |
Knowledge | • Question and answer (QnA) pairs based on a knowledge base | Used to train LLMs to perform question-answering, summarization, and conversational tasks based on topics in a business taxonomy. |
Tool calling data pipeline
The tool calling data pipeline generates datasets that contain sample instruction and response pairs and an API specification that define tools that a foundation model can use to generate a response. The API specification contains the list of available tools and the parameters the main function accepts.
Seed data format
Create input YAML files in the following format to define seed data and reference documents when you use the tool calling pipeline:
-
task.yaml
containing seed data.The task YAML file contains sample question and answer pairs that are used to train a foundation model to generate synthetic datasets as follows:
task_description: <Description of this task> min_func_count: < Integer. Minimum value 1> max_func_count: < Integer. Max value 4> created_by: <Your organization name> fc_spec_loaders: - type: fc file_path: <Path to API spec YAML file> seed_examples: - domain: <Your domain name> input: <Sample prompt 1> output: '<Sample response 1>' - domain: <Your domain name> input: <Sample prompt 2> output: '<Sample response 2>'
-
api-spec.yaml
as a reference document.The API specification YAML file contains a API specification for your domain that defines the tools the foundation model uses to generate synthetic datasets.
<Your domain-name>: <function-1-name>: description: <function-1-description> name: <function-1-name> parameters: properties: <parameter-1-name>: description: <parameter-1-description> type: <parameter-1-type> <parameter-2-name>: description: <parameter-2-description> type: <parameter-2-type> required: - <required parameter 1> - <required parameter 2> <function-2-name>: description: <function-2-description> name: <function-2-name> parameters: properties: <parameter-1-name>: description: <parameter-1-description> type: <parameter-1-type> <parameter-2-name>: description: <parameter-2-description> type: <parameter-2-type> required: - <required parameter 1> - <required parameter 2>
Text to SQL data pipeline
The text to SQL data pipeline generates a synthetic SQL data triplet that contains an instruction for interacting with a database written in a natural language, a SQL query, and a database schema.
Seed data format
Create an input YAML file containing sample plain text statements that describe various operations to perform on data stored in a relational database, the corresponding SQL queries to execute the operations, and a database schema that defines how the data is organized and stored as follows:
task_description: <Description of this task>
seed_examples:
- utterance: <input question 1>
query: <sample SQL 1>
- utterance: <input question 2>
query: <sample SQL 2>
database:
schema: "<Data Definition Language (DDL) statement of one or more tables. Separate each DDL by a semi-colon>"
Knowledge data pipeline
The knowledge data pipeline generates instruction and response pairs based on examples in the knowledge branch in the training taxonomy of a tuned foundation model.
Seed data format
Create an input YAML file containing sample question and answer (QnA) pairs that a person who is learning the subject might ask and grounding documents with content that serves as a knowledge base as follows:
domain: <A phrase denoting your use case's domain>
task_description: "<Description of this task>"
seed_examples:
- answer: <sample answer 1>
question: <sample question 1>
- answer: <sample answer 2>
question: <sample question 2>
include:
documents:
<doc-set-1-name>: <name of the knowledge document(s). Specify either one document or wildcard to refer to multiple documents>
<doc-set-2-name>: <name of the knowledge document(s). Specify either one document or wildcard to refer to multiple documents>
Parent topic: Generating synthetic unstructured data