Handling missing values

You should decide how to treat missing values in light of your business or domain knowledge. To ease training time and increase accuracy, you may want to remove blanks from your data set. On the other hand, the presence of blank values may lead to new business opportunities or additional insights.

In choosing the best technique, you should consider the following aspects of your data:

  • Size of the data set
  • Number of fields containing blanks
  • Amount of missing information

In general terms, there are two approaches you can follow:

  • You can exclude fields or records with missing values
  • You can impute, replace, or coerce missing values using a variety of methods

Both of these approaches can be largely automated using the Data Audit node. For example, you can generate a Filter node that excludes fields with too many missing values to be useful in modeling, and generate a SuperNode that imputes missing values for any or all of the fields that remain. This is where the real power of the audit comes in, allowing you not only to assess the current state of your data, but to take action based on the assessment.