The effectiveness of synthetic data hinges on its quality, which necessitates the development and utilization of appropriate metrics for evaluation. In this regard, synthetic data metrics play a crucial role in assessing the fidelity, diversity, and utility of generated data.
In the realm of data science and machine learning, the availability of high-quality data is paramount for building accurate and robust models. However, in many real-world scenarios, obtaining sufficient and diverse data can be a challenging task due to various constraints such as privacy concerns, data scarcity, or expensive data acquisition processes. To address these challenges, the concept of synthetic data generation has gained traction, offering a promising solution to augment or replace real-world data with artificially generated data.
Synthetic Data Generator uses quality, privacy, and utility metrics to help you evaluate your synthetic data.
How to evaluate your synthetic data
To evaluate your synthetic data, you can connect your Evaluate node between an Import node and a Generate node.
You can also connect your Evaluate node between two Import nodes or between two Generate nodes.
After you have connected your Evaluate node, then click the Edit button.
The following subtopics explain how to choose the options for evaluating your synthetic data.
Quality metrics
Fidelity score
Aggregates multiple metrics that reflect the similarity between real data and synthetic data of distributions for individual columns, along with similarity of correlations for all pairs of columns.
Data distinguishability
Captures the ability for a binary classifier to separate real data from synthetic data. The harder to train such a classifier, the better the quality of the synthetic data with respect to its ability to reflect statistical properties of the real data.
Privacy metrics
Leakage score
Measures the fraction of rows in the synthetic data that are identical to some rows in the real data.
Proximity score
Computed from the distance between points in the synthetic data and the real data. The smaller this distance, the easier it is to isolate some rows from the real data, which increases privacy risk.
Utility metrics
Predictive utility
Measures the usefulness of the synthetic data for predictive downstream tasks. It evaluates the performance of predictive models trained from the synthetic data to accurately predict a selected target using real data as test data.
Assessment level
Simple assessment
In simple assessment mode, metrics are run on one single ML (machine learning) model.
Full assessment
In full assessment mode, metrics are evaluated and averaged against multiple ML (machine learning) models whenever possible.