The effectiveness of synthetic data hinges on its quality, which necessitates the development and utilization of appropriate metrics for evaluation. In this regard, synthetic data metrics play a crucial role in assessing the fidelity, diversity,
and utility of generated data.
In the realm of data science and machine learning, the availability of high-quality data is paramount for building accurate and robust models. However, in many real-world scenarios, obtaining sufficient and diverse data can be a challenging task
due to various constraints such as privacy concerns, data scarcity, or expensive data acquisition processes. To address these challenges, the concept of synthetic data generation has gained traction, offering a promising solution to augment
or replace real-world data with artificially generated data.
Synthetic Data Generator uses quality, privacy, and utility metrics to help you evaluate your synthetic data.
How to evaluate your synthetic data
Copy link to section
To evaluate your synthetic data, you can connect your Evaluate node between an Import node and a Generate node.
You can also connect your Evaluate node between two Import nodes or between two Generate nodes.
After you have connected your Evaluate node, then click the Edit button.
The following subtopics explain how to choose the options for evaluating your synthetic data.
Important: Duplicate records can occur in synthetic data. You can choose the option Remove duplicate records, which will remove duplicate records if they exceed 5 per cent of the dataset, keeping only the
first occurrence.
Important: If you have not connected the nodes properly, then you will get the error: Baseline input is required
Quality metrics
Copy link to section
Fidelity score
Copy link to section
Aggregates multiple metrics that reflect the similarity between real data and synthetic data of distributions for individual columns, along with similarity of correlations for all pairs of columns.
Data distinguishability
Copy link to section
Captures the ability for a binary classifier to separate real data from synthetic data. The harder to train such a classifier, the better the quality of the synthetic data with respect to its ability to reflect statistical properties of the
real data.
Privacy metrics
Copy link to section
Leakage score
Copy link to section
Measures the fraction of rows in the synthetic data that are identical to some rows in the real data.
Proximity score
Copy link to section
Computed from the distance between points in the synthetic data and the real data. The smaller this distance, the easier it is to isolate some rows from the real data, which increases privacy risk.
Utility metrics
Copy link to section
Predictive utility
Copy link to section
Measures the usefulness of the synthetic data for predictive downstream tasks. It evaluates the performance of predictive models trained from the synthetic data to accurately predict a selected target using real data as test data.
Assessment level
Copy link to section
Simple assessment
Copy link to section
In simple assessment mode, metrics are run on one single ML (machine learning) model.
Full assessment
Copy link to section
In full assessment mode, metrics are evaluated and averaged against multiple ML (machine learning) models whenever possible.
About cookies on this siteOur websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising.For more information, please review your cookie preferences options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.