Evaluating synthetic data

Last updated: Aug 22, 2024

The effectiveness of synthetic data hinges on its quality, which necessitates the development and utilization of appropriate metrics for evaluation. In this regard, synthetic data metrics play a crucial role in assessing the fidelity, diversity, and utility of generated data.

In the realm of data science and machine learning, the availability of high-quality data is paramount for building accurate and robust models. However, in many real-world scenarios, obtaining sufficient and diverse data can be a challenging task due to various constraints such as privacy concerns, data scarcity, or expensive data acquisition processes. To address these challenges, the concept of synthetic data generation has gained traction, offering a promising solution to augment or replace real-world data with artificially generated data.

Synthetic Data Generator uses quality, privacy, and utility metrics to help you evaluate your synthetic data.

How to evaluate your synthetic data

To evaluate your synthetic data, you can connect your Evaluate node between an Import node and a Generate node. How to connect to evaluate your synthetic data

You can also connect your Evaluate node between two Import nodes or between two Generate nodes.

After you have connected your Evaluate node, then click the Edit button. How to edit your Evaluate node

Evaluate node options

The following subtopics explain how to choose the options for evaluating your synthetic data.

Important: Duplicate records can occur in synthetic data. You can choose the option Remove duplicate records, which will remove duplicate records if they exceed 5 per cent of the dataset, keeping only the first occurrence.

Important: If you have not connected the nodes properly, then you will get the error: Baseline input is required

Quality metrics

Fidelity score

Aggregates multiple metrics that reflect the similarity between real data and synthetic data of distributions for individual columns, along with similarity of correlations for all pairs of columns.

Data distinguishability

Captures the ability for a binary classifier to separate real data from synthetic data. The harder to train such a classifier, the better the quality of the synthetic data with respect to its ability to reflect statistical properties of the real data.

Privacy metrics

Leakage score

Measures the fraction of rows in the synthetic data that are identical to some rows in the real data.

Proximity score

Computed from the distance between points in the synthetic data and the real data. The smaller this distance, the easier it is to isolate some rows from the real data, which increases privacy risk.

Utility metrics

Predictive utility

Measures the usefulness of the synthetic data for predictive downstream tasks. It evaluates the performance of predictive models trained from the synthetic data to accurately predict a selected target using real data as test data.

Assessment level

Simple assessment

In simple assessment mode, metrics are run on one single ML (machine learning) model.

Full assessment

In full assessment mode, metrics are evaluated and averaged against multiple ML (machine learning) models whenever possible.