Handling fields with missing values

If the majority of missing values are concentrated in a small number of fields, you can address them at the field level rather than at the record level. This approach also allows you to experiment with the relative importance of particular fields before deciding on an approach for handling missing values. If a field is unimportant in modeling, it probably isn't worth keeping, regardless of how many missing values it has.

For example, a market research company may collect data from a general questionnaire containing 50 questions. Two of the questions address age and political persuasion, information that many people are reluctant to give. In this case, Age and Political_persuasion have many missing values.

Field measurement level

In determining which method to use, you should also consider the measurement level of fields with missing values.

Numeric fields. For numeric field types, such as Continuous, you should always eliminate any non-numeric values before building a model, because many models won't function if blanks are included in numeric fields.

Categorical fields. For categorical fields, such as Nominal and Flag, altering missing values isn't necessary but will increase the accuracy of the model. For example, a model that uses the field Sex will still function with meaningless values, such as Y and Z, but removing all values other than M and F will increase the accuracy of the model.

Screening or removing fields

To screen out fields with too many missing values, you have several options:

  • You can use a Data Audit node to filter fields based on quality
  • You can use a Feature Selection node to screen out fields with more than a specified percentage of missing values and to rank fields based on importance relative to a specified target
  • Instead of removing the fields, you can use a Type node to set the field role to None. This will keep the fields in the data set but exclude them from the modeling processes