Understanding and preparing data

Last updated: Dec 20, 2024

Before you start mining data and biding models in SPSS Modeler, you need to prepare your data. Preparing your data means taking the time to understand the data and processing it so that it is optimized for use in data mining.

The quality of your data can determine the quality of your models. Preparing your data ensures that your data is clean and ready for analysis.

SPSS Modeler is built around the Cross-Industry Standard Process for Data Mining (CRISP-DM) methodology. which has the following phases.

Business understanding
Data understanding
Data preparation
Modeling
Evaluation
Deployment

The first three of these phases are where data is collected, assessed, and prepared. Some of this work can be done in SPSS Modeler, but part of the work in these phases happens even before working in SPSS Modeler.

Business understanding

Before starting in SPSS Modeler, it is important to gain as much insight as possible into the business goals for data mining. For example, understand the business perspective to determine pain points, project requirements, business objectives for data mining, and how data mining can provide useful information that solves business problems.

This phase of data collection and preparation happens outside of SPSS Modeler. But this work can determine what data needs to be collected and what data might be worth focusing on.

Data understanding

Understanding your data involves assessing the data and exploring it to determine the quality of the data. Take the time to understand the data structure, relationships, and patterns by using techniques such as data visualization, summary statistics, and correlation analysis. This step is critical in avoiding unexpected problems during data preparation.

SPSS Modeler has an Audit node, which you can use for a comprehensive first look at the data. It can generate information such as summary statistics, histograms, box plots, bar charts, pie charts, and more. This information can be useful in gaining a preliminary understanding of the data. It is also able to generate information about outliers, extremes, and missing values.

If you have access to these other services on Cloud Pak for Data, they can also be useful;

Data Refinery: You can use Data Refinery to understand and visualize your data.
MANTA Automated Data Lineage: You can use MANTA Automated Data Lineage for tracking and finding the origin of data.
RStudio®: RStudio is helpful for running commands in R to explore your data.

Data preparation

Data preparation is one of the most important parts of data mining, and it can be a significant amount of the work required for the overall project. Putting effort into the earlier business understanding and data understanding phases can minimize some one this work, but you still need to expend effort preparing and packaging the data for mining.

Work through the following activities to prepare your data. These activities are required to ensure that the data is well-prepared, clean, and ready for analysis.

Data Cleaning: It's essential to handle missing values, remove duplicates, and correct formatting issues.
Data Transformation: Standardize and normalize your data to ensure consistency and reduce noise. These steps can involve scaling, z-score normalization, or one-hot encoding.
Data Reduction: Reduce the dimensionality of your data by selecting the most relevant features. You can use techniques such as Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), or t-distributed Stochastic Neighbor Embedding (t-SNE).
Data Integration: Merge data from different sources to create a more comprehensive view of your data. You might need to join tables, merge data sets, or use data fusion techniques.
Data Validation: Validate your data to ensure that it is accurate and reliable. You can check for outliers, assess variability, or compare the data to external sources.
Data Storage: Store your data in a secure, accessible, and reproducible manner. You can use databases, data warehouses, or cloud storage solutions to store your data.

SPSS Modeler has several nodes that you can use for these data preparation activities. You can use a combination of Record Operations nodes and Field Operations nodes to create flows that prepare the data.

If you have access to the following services, they can also be used to prepare data.

Data Refinery: You can use Data Refinery for cleaning and transforming data without requiring programming skills.
DataStage: You can use DataStage for data integration and developing flows that process and transform data.
IBM® Knowledge Catalog: You can use IBM Knowledge Catalog for analyzing and improving the quality of the data, and it can also be used for assigning classifications, data classes, and business terms to your data assets
RStudio: You can use RStudio for running commands in R to explore your data.

Even the data is not their own, users should perform the same activities to understand that data.