The Simulation Fitting node fits a set of candidate statistical distributions to each field in the data. The fit of each distribution to a field is assessed using a goodness of fit criterion. When a Simulation Fitting node runs, a Simulation Generate node is built (or an existing node is updated). Each field is assigned its best fitting distribution. The Simulation Generate node can then be used to generate simulated data for each field.
Although the Simulation Fitting node is a terminal node, it does not add output to the Outputs panel, or export data.
Using a Sim Fit node to automatically create a Sim Gen node
The first time the Simulation Fitting node is run, a Simulation Generate node is generated with an update link to the Simulation Fitting node. If the Simulation Fitting node is run again, a new Simulation Generate node will be generated only if the update link has been removed. You can also use a Simulation Fitting node to update a connected Simulation Generate node. The result depends on whether the same fields are present in both nodes, and if the fields are unlocked in the Simulation Generate node. See Sim Gen node for more information.
A Simulation Fitting node can only have an update link to a Simulation Generate node. To define an update link to a Simulation Generate node, follow these steps:
- Right-click the Simulation Fitting node and select Define Update Link.
- Click the Simulation Generate node to which you want to define an update link.
To remove an update link between a Simulation Fitting node and a Simulation Generate node, right-click the update link and select Remove Link.
Distribution fitting
A statistical distribution is the theoretical frequency of the occurrence of values that a variable can take. In the Simulation Fitting node, a set of theoretical statistical distributions is compared to each field of data. The parameters of the theoretical distribution are adjusted to give the best fit to the data according to a measurement of the goodness of fit; either the Anderson-Darling criterion or the Kolmogorov-Smirnov criterion. The results of the distribution fitting by the Simulation Fitting node show which distributions were fitted, the best estimates of the parameters for each distribution, and how well each distribution fits the data. During distribution fitting, correlations between fields with numeric storage types, and contingencies between fields with a categorical distribution, are also calculated. The results of the distribution fitting are used to create a Simulation Generate node.
- Use an upstream node to remove records with missing values
- Use an upstream node to impute values for missing value.
The role of a field is not taken into account when the distributions are fitted. For example, fields with the role Target are treated the same as fields with roles of Input, None, Both, Partition, Split, Frequency, and ID.
Fields are treated differently during distribution fitting according to their storage type and measurement level. The treatment of fields during distribution fitting is described in the following table.
Storage type | Measurement Level | |||||
---|---|---|---|---|---|---|
Continuous | Categorical | Flag | Nominal | Ordinal | Typeless | |
String | Impossible | Categorical, dice and fixed distributions are fitted | ||||
Integer | ||||||
Real | ||||||
Time | All distributions are fitted. Correlations and contingencies are calculated. | The categorical distribution is fitted. Correlations are not calculated. | Binomial, negative binomial and Poisson distributions are fitted, and correlations are calculated. | Field is ignored and not passed to the Simulation Generate node. | ||
Date | ||||||
Timestamp | ||||||
Unknown | Appropriate storage type is determined from the data. |
Fields with the measurement level ordinal are treated like continuous fields and are included in the correlations table in the Simulation Generate node. If you want a distribution other than binomial, negative binomial, or Poisson to be fitted to an ordinal field, you must change the measurement level of the field to continuous. If you have previously defined a label for each value of an ordinal field, and then change the measurement level to continuous, the labels will be lost.
Fields that have single values are not treated differently during distribution fitting to fields with multiple values. Fields with the storage type time, date, or timestamp are treated as numeric.
Fitting distributions to split fields
If your data contains a split field, and you want distribution fitting to be carried out separately for each split, you must transform the data by using an upstream Restructure node. Using the Restructure node, generate a new field for each value of the split field. You can then use this restructured data for distribution fitting in the Simulation Fitting node.