The Simulation Fitting node fits a set of candidate statistical distributions to each
field in the data. The fit of each distribution to a field is assessed using a goodness of fit
criterion. When a Simulation Fitting node runs, a Simulation Generate node is built (or an existing
node is updated). Each field is assigned its best fitting distribution. The Simulation Generate node
can then be used to generate simulated data for each field.
Although the Simulation Fitting node is a terminal node, it does not add output to the Outputs
panel, or export data.
Note: If the historical data is sparse (that is, there are many missing values), it may be difficult
for the fitting component to find enough valid values to fit distributions to the data. In cases
where the data is sparse, before fitting you should either remove the sparse fields if they are not
required, or impute the missing values. Using the QUALITY options in the Data
Audit node, you can view the number of complete records, identify which fields are sparse, and
select an imputation method. If there are an insufficient number of records for distribution
fitting, you can use a Balance node to increase the number of records.
Using a Sim Fit node to automatically create a Sim Gen node
Copy link to section
The first time the Simulation Fitting node is run, a Simulation Generate node is generated with
an update link to the Simulation Fitting node. If the Simulation Fitting node is run again, a new
Simulation Generate node will be generated only if the update link has been removed. You can also
use a Simulation Fitting node to update a connected Simulation Generate node. The result depends on
whether the same fields are present in both nodes, and if the fields are unlocked in the Simulation
Generate node. See Sim Gen node for more information.
A Simulation Fitting node can only have an update link to a Simulation Generate node. To define
an update link to a Simulation Generate node, follow these steps:
Right-click the Simulation Fitting node and select Define Update
Link.
Click the Simulation Generate node to which you want to define an update link.
To remove an update link between a Simulation Fitting node and a Simulation Generate node,
right-click the update link and select Remove Link.
Distribution fitting
Copy link to section
A statistical distribution is the theoretical frequency of the occurrence of values that a
variable can take. In the Simulation Fitting node, a set of theoretical statistical distributions is
compared to each field of data.
The parameters of the theoretical distribution are adjusted to give the best fit to the data
according to a measurement of the goodness of fit; either the Anderson-Darling criterion or the Kolmogorov-Smirnov criterion. The
results of the distribution fitting by the Simulation Fitting node show which distributions were
fitted, the best estimates of the parameters for each distribution, and how well each distribution
fits the data. During distribution fitting, correlations between fields with numeric storage types,
and contingencies between fields with a categorical distribution, are also calculated. The results
of the distribution fitting are used to create a Simulation Generate node.
Before any distributions are fitted to your data, the first 1000 records are examined for missing
values. If there are too many missing values, distribution fitting is not possible. If so, you must
decide whether either of the following options are appropriate:
Use an upstream node to remove records with missing values
Use an upstream node to impute values for missing value.
Distribution fitting does not exclude user-missing values. If your data has user-missing values
and you want those values to be excluded from distribution fitting, then you should set those values
to system missing.
The role of a field is not taken into account when the distributions are fitted. For example,
fields with the role Target are treated the same as fields with roles of
Input, None, Both,
Partition, Split, Frequency,
and ID.
Fields are treated differently during distribution fitting according to their storage type and
measurement level. The treatment of fields during distribution fitting is described in the following
table.
Table 1. Distribution fitting according to storage type and measurement
level of fields
Storage type
Measurement Level
Continuous
Categorical
Flag
Nominal
Ordinal
Typeless
String
Impossible
Categorical, dice and fixed distributions are
fitted
Integer
Real
Time
All distributions are fitted. Correlations and
contingencies are calculated.
The categorical distribution is fitted.
Correlations are not calculated.
Binomial, negative binomial and Poisson
distributions are fitted, and correlations are calculated.
Field is ignored and not passed to the Simulation
Generate node.
Date
Timestamp
Unknown
Appropriate storage type is determined from the
data.
Fields with the measurement level ordinal are treated like continuous fields and are included in
the correlations table in the Simulation Generate node. If you want a distribution other than
binomial, negative binomial, or Poisson to be fitted to an ordinal field, you must change the
measurement level of the field to continuous. If you have previously defined a label for each value
of an ordinal field, and then change the measurement level to continuous, the labels will be
lost.
Fields that have single values are not treated differently during distribution fitting to fields
with multiple values. Fields with the storage type time, date, or timestamp are treated as
numeric.
Fitting distributions to split fields
Copy link to section
If your data contains a split field, and you want distribution fitting to be carried out
separately for each split, you must transform the data by using an upstream Restructure node. Using
the Restructure node, generate a new field for each value of the split field. You can then use this
restructured data for distribution fitting in the Simulation Fitting node.
About cookies on this siteOur websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising.For more information, please review your cookie preferences options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.