Creating synthetic data from production data

Last updated: Jun 21, 2024

Using the Synthetic Data Generator graphical editor flow tool, you can generate a structured synthetic data set based on your production data. You can import data, anonymize, mimic (to generate synthetic data), export, and review your data.

Before you can use mimic and mask to create synthetic data, you need to create a task.

1. The Generate synthetic tabular data flow window opens. Select use case Leverage your existing data. Click Next.

2. Select Import data. You can also drag-and-drop a data file into your project. You can also select data from a project. For more information, see Importing data.

3. Once you have imported your data, you can use the Synthetic Data Generator graphical flow editor tool to anonymize your production data, masking the data. You can disguise column names, column values, or both, when working with data that is to be included in a model downstream of the node. For example, you can use bank customer data and hide marital status. Anonymize data

4. You can then use the Synthetic Data Generator tool to mimic your production data. This will generate synthetic data, based on your production data, using a set of candidate statistical distributions to modify each column in your data. Mimic data

5. You can export your synthetic data and review it. For more information, see Exporting synthetic data. Export data

Using differential privacy

Differential privacy protects user data from being traced back to individual users. The parameters involved are known as the privacy budget. This is a metric of privacy loss based on adding or removing one entry in a data set.

To implement differential privacy in your synthetic data created from production data:

1. Select the Mimic node. Select Edit. Select Edit on Mimic node

2. Scroll down and select Privacy. In the Privacy section, turn on Enable differential privacy. This will ensure that no sensitive data specific to any individual is exposed in the synthetic output. You can control the level of privacy protection by adjusting the privacy budget (epsilon) and leakage (delta) parameters.

3. Adjust the Privacy budget (epsilon). The privacy budget allows you to tune the level of privacy protection required in your synthetic output. A smaller value provides greater privacy protection, with some loss in accuracy. A larger value provides greater accuracy, with less privacy protection.

4. Adjust the Privacy leakage probability (delta). Delta is usually referred to as the maximum allowable probability of a privacy leakage. Delta should be less than or equal to 1/n*n, where n = sample size. The smaller the delta is, the better the privacy is preserved.

5. Generate a Random seed. When differential privacy is enabled, this random seed value will enable you to reproduce your differentially private synthetic output. When differential privacy is disabled, the random seed value can be adjusted in the Generate node.

6. Manually adjust the Column bounds (optional). Column bounds are automatically applied, but you can manually adjust these bounds to restrict the range of values used for fitting. You can only select numeric columns.

7. After updating the Privacy options, select Save. Save privacy options

8. Select Run all.

Note that parameters that are based on the synthetically generated dataset where differential privacy has been enabled will differ from the parameters in your original dataset.

Note that, after a flow run, in the Generate node results, the column bounds are not updated, even though they were set in the differential privacy settings. This is expected behavior. If you enter a value larger or smaller than the real data column bounds, then the differential privacy values will be adjusted to the new values. However, the minimum/maximum column bounds will only be applied to the real data and not to the generated synthetic data. The benefit of this is that the differential privacy results will not be disrupted by a specified minimum/maximum column bounds during the Generate node. Manually set minimum and maximum could potentially result in privacy leakage.

Learn more

Creating synthetic data from a custom data schema