Before you start mining data and biding models in SPSS Modeler, you need to prepare your data. Preparing your data
means taking the time to understand the data and processing it so that it is optimized for
use in data mining.
The quality of your data can determine the quality of your models. Preparing your data
ensures that your data is clean and ready for analysis.
SPSS Modeler is built around the Cross-Industry Standard
Process for Data Mining (CRISP-DM) methodology. which has the following phases.
Business understanding
Data understanding
Data preparation
Modeling
Evaluation
Deployment
The first three of these phases are where data is collected, assessed, and prepared. Some
of this work can be done in SPSS Modeler, but part of the
work in these phases happens even before working in SPSS Modeler.
Business understanding
Copy link to section
Before starting in SPSS Modeler, it is important to gain
as much insight as possible into the business goals for data mining. For example,
understand the business perspective to determine pain points, project requirements,
business objectives for data mining, and how data mining can provide useful
information that solves business problems.
This phase of data collection and preparation happens outside of SPSS Modeler. But this work can determine what data needs to
be collected and what data might be worth focusing on.
Data understanding
Copy link to section
Understanding your data involves assessing the data and exploring it to determine the
quality of the data. Take the time to understand the data structure, relationships,
and patterns by using techniques such as data visualization, summary statistics, and
correlation analysis. This step is critical in avoiding unexpected problems during
data preparation.
SPSS Modeler has an Audit node,
which you can use for a comprehensive first look at the data. It can generate
information such as summary statistics, histograms, box plots, bar charts, pie
charts, and more. This information can be useful in gaining a preliminary
understanding of the data. It is also able to generate information about outliers,
extremes, and missing values.
If you have access to these other services on watsonx.ai, they can also be
useful;
Data Refinery
You can use Data Refinery to understand
and visualize your data.
RStudio®
RStudio is helpful for running commands
in R to explore your data.
Data preparation
Copy link to section
Data preparation is one of the most important parts of data mining, and it can be a
significant amount of the work required for the overall project. Putting effort into
the earlier business understanding and data understanding phases can minimize some
one this work, but you still need to expend effort preparing and packaging the data
for mining.
Work through the following activities to prepare your data. These activities are
required to ensure that the data is well-prepared, clean, and ready for
analysis.
Data Cleaning
It's essential to handle missing values, remove duplicates, and correct
formatting issues.
Data Transformation
Standardize and normalize your data to ensure consistency and reduce noise.
These steps can involve scaling, z-score normalization, or one-hot
encoding.
Data Reduction
Reduce the dimensionality of your data by selecting the most relevant
features. You can use techniques such as Principal Component Analysis (PCA),
Linear Discriminant Analysis (LDA), or t-distributed Stochastic Neighbor
Embedding (t-SNE).
Data Integration
Merge data from different sources to create a more comprehensive view of
your data. You might need to join tables, merge data sets, or use data
fusion techniques.
Data Validation
Validate your data to ensure that it is accurate and reliable. You can check
for outliers, assess variability, or compare the data to external
sources.
Data Storage
Store your data in a secure, accessible, and reproducible manner. You can
use databases, data warehouses, or cloud storage solutions to store your
data.
SPSS Modeler has several nodes that you can use for these
data preparation activities. You can use a combination of Record
Operations nodes and Field Operations nodes
to create flows that prepare the data.
If you have access to the following services, they can also be used to prepare
data.
Data Refinery
You can use Data Refinery for cleaning
and transforming data without requiring programming skills.
RStudio
You can use RStudio for running
commands in R to explore your data.
Even the data is not their own, users should perform the same activities to
understand that data.
About cookies on this siteOur websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising.For more information, please review your cookie preferences options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.