Generating synthetic unstructured data (beta)
With the watsonx.ai synthetic data generation API, you can create large, high-quality unstructured text datasets that mimic your organization's real-time data. Use the generated synthetic datasets to tune and evaluate foundation models for your specific use case.
Overview
You can use large language models (LLMs) that are trained with large datasets to generate output that is customized for your organization. However, you must tune the models with a large amount of helpful and accurate training data. A small or low-quality dataset is insufficient to successfully train models to generate output that is relevant to your specific use case.
Use the synthetic data generation API to create large unstructured text datasets by using data builder pipelines and data validators that are optimized for generating data for tuning and evaluating foundation models.
A data builder pipeline generates synthetic data in different formats that mimics the sample seed data and reference documents you provide as an input to the pipeline. Based on your use case, you can choose from the following data builder pipelines:
- Tool calling
- The tool calling data builder pipeline creates training datasets that can be used to train AI models to interact with external tools, application programming interfaces (APIs), or systems to enhance their capabilities.
- Text to SQL
- The text to SQL data builder pipeline generates synthetic SQL data triplets that contain a natural language statment describing a database operation, an equivalent SQL statement to perform the database operation, and the database schema.
- Knowledge
- The knowledge data pipeline generates question and answer (QnA) pairs based on examples in documents that are specific to a business domain.
For more information about seed data formats and choosing a data builder pipeline, see Data builder pipelines and seed data formats.
REST API
You can use the synthetic data generation (SDG) API to administer synthetic unstructured data generation. The synthetic data is generated with foundation models that are provided in watsonx.ai. The format of the generated data is based on sample seed data you provide and the data builder pipeline you use. After the foundation model generates the dataset, the data is validated against the data builder pipeline's quality requirements and stored in your project asset.
For API method details, see the watsonx.ai API reference documentation.
For more information about best practices to follow when you tune and evaluate foundation models by using the data that is generated with the API, see Best practices.
The following diagram shows the REST API workflow to generate synthetic unstructured data by providing sample seed data in a format that suits your use case.
Before you begin
In order to generate synthetic unstructured data programmatically, you must first complete the following setup:
-
Create a project and have the Admin or Editor role in the project. Your project must have an associated watsonx.ai Runtime service instance.
-
Create an IBM Cloud user API key and IBM Cloud Identity and Access Management (IAM) token. For details, see Credentials for programmatic access.
-
Create a task credential.
A task credential is an API key that is used to authenticate long-running jobs that are started by steps that you will take during the synthetic data generation procedure. You do not need to pass the task credential in the API request. For details, see Creating task credentials.
-
Optional: Choose a foundation model to use to generate synthetic datasets.
The following models are certified for use with the Synthetic Data Generator service:
granite-3-8b-instruct
mistral-large
The API uses the
granite-3-8b-instruct
model by default. For model details including billing information and API model IDs, see Supported foundation models.
Procedure
Follow these high-level steps to generate synthetic unstructured text data by using the REST API:
-
Choose a data builder pipeline and upload the input seed data files to your project asset.
The format of the sample input data depends on the data builder pipeline you select. For all data builders, you must provide seed data as an input for the data generation request. For some pipelines, you must also provide reference documents. For details, see Data builder pipelines and seed data formats.
-
Use the Create a synthetic unstructured data generation job REST API method to create the job configuration for your synthetic data generator asset type. You must specify the following settings in your request:
- The data builder pipeline
- Reference to your input seed data
- The number of QnA pairs to generate
You can optionally specify the API model ID of a foundation model to override the default model setting.
-
Run the synthetic unstructured data generation job in one of the following ways:
- From the project's Jobs page in the UI. For details, see Creating and managing jobs.
- Using the Job Runs REST API. For details, see Data and AI Common Core API.
A job run can take a few minutes or hours to complete, depending on the volume of the generated output, the data builder pipeline, and the model. You can monitor the status of the synthetic unstructured data generation job by clicking the job run to access the log from the Job run details page.
Attention:You incur charges for tokens that the foundation model generates. For details, see Supported foundation models. -
Download the generated output JSONL files that contain the synthetic unstructured data from your project's data asset. The generated data is formatted according to the data builder pipeline you specified in the API request to create the synthetic unstructured data generation job.
Request example
For example, the following command submits a request to generate synthetic unstructured data generation request:
curl -X POST \
'https://api.{region}.dai.cloud.ibm.com/v1/synthetic_data/generation/unstructured?version=2025-04-17' \
--header 'Accept: application/json' \
--header 'Content-Type: application/json' \'
--header 'Authorization: Bearer eyJraWQiOi...' \'
--data @payload.json'
The following is an example payload.json
file that contains a request body that overrides the default foundation model:
{
"project_id": "<Your project ID>",
"name": "<Name of the job that you want to create>",
"description": "<Description of your project>",
"pipeline": "<Data builder pipeline>",
"model_id": "mistralai/mistral-large",
"parameters": {
"num_outputs_to_generate": < A value between 1 to 1000 >,
},
"seed_data_reference": {
"type": "container",
"location": {
"path": "<Input seed data file name in project asset>"
}
},
"results_reference": {
"type": "container",
"location": {
"path": "<Generated data output file name in project asset>"
}
}
}
Output details
During the beta period, you can generate a maximum of 1000 QnA pairs of synthetic data with each REST API request. To generate a larger dataset, contact the support team by opening a case in the IBM Cloud Support portal. For details, see Creating support cases in the IBM Cloud documentation.
Best practices
Use the following guidelines while working with the synthetic data generation API:
-
To select the foundation model best suited for your use case, experiment by generating a small number of QnA pairs with multiple certified foundation models. Change the following setting in your API request to adjust the amount of generated datasets:
"parameters": { "num_outputs_to_generate": 10 }
After verifying the quality of the generated output, choose a certified foundation model, and proceed with generating larger datasets.
-
Make sure to review the synthetic unstructured data that is produced with the API before you use the data to train your models.
-
To use synthetic data to train models in the Tuning Studio, the dataset must contain
input
andoutput
attributes.Based on the data builder pipeline you use to generate the synthetic data, complete the following steps to make your dataset compatible with Tuning Studio:
- Tool calling pipeline: No changes required, ready to use.
- Text to SQL pipeline: Rename the
utterance
attribute toinput
. Rename thequery
attribute tooutput
. - Knowledge pipeline: Rename the
question
attribute toinput
. Rename theanswer
attribute tooutput
.
Learn more
- Creating a project
- Adding associated services to a project
- IBM watsonx.ai API reference documentation
- Supported foundation models
- Tuning Studio
Parent topic: Preparing data