This tutorial provides an example of preparing data for analysis. Preparing data is one
of the most important steps in any data-mining project, and traditionally, one of the most time
consuming. The Auto Data Prep node handles the task for you, analyzing your data and
identifying fixes, screening out fields that are problematic or not likely to be useful, deriving
new attributes when appropriate, and improving performance through intelligent screening
techniques.
You can use the Auto Data Prep node in a fully automated fashion, allowing the node to
choose and apply fixes, or you can preview the changes before they're made and accept or reject
them. With this node, you can ready your data for data mining quickly and easily, without the need
for prior knowledge of the statistical concepts involved. If you run the node with the default
settings, models tend to build and score more quickly.
Preview the tutorial
Watch this video to preview the steps in this tutorial. There might
be slight differences in the user interface that is shown in the video. The video is intended to be
a companion to the written tutorial. This video provides a visual method to learn the concepts and
tasks in this documentation.
This tutorial uses the Automated Data Preparation flow in the sample project. The data
file used is telco.csv. This example demonstrates the increased accuracy that you can find by
using the default Auto Data Prep node settings when building models. The following image
shows the sample modeler flow.
Figure 1. Sample modeler flow
The following image shows the sample data set.Figure 2. Sample data set
Task 1: Open the sample project
The sample project contains several data sets and sample modeler flows. If you don't already have
the sample project, then refer to the Tutorials topic to create the sample project. Then follow these steps to open the sample
project:
In Cloud Pak for Data, from the Navigation menu, choose
Projects > View all Projects.
Click SPSS Modeler Project.
Click the Assets tab to see the data sets and modeler flows.
Check your progress
The following image shows the project Assets tab. You are now ready to work with the sample
modeler flow associated with this tutorial.
Automated Data Preparation includes several nodes. Follow these steps to examine the
Data Asset and Type nodes:
From the Assets tab, open the Automated Data Preparation modeler
flow, and wait for the canvas to load.
Double-click the telco.csv node. This node is a Data Asset node that points to the
telco.csv file in the project.
Review the File format properties.
Optional: Click Preview data to see the full data set.
Double-click the Type node. Notice that the measure for the churn field
is set to Flag, and the role is set to Target.
Makesure that the role for all other fields is set to Input.Figure 3. Set the measurement level and role
Optional: Click Preview data to see the data set with the Type properties
applied.
Check your progress
The following image shows the Type node. You are now ready to build the model.
You will build two models, one model without and one model with automated data preparation.
Follow these steps to build the models:
Double-click the No ADP - churn node that is connected to the Type node to see its properties.
Expand the Model Settings section
Verify that the Procedure is set to Binomial.
Verify that the Model Name is set to Custom, and
the name is No ADP - churn.Figure 4. Logistic node Model Settings section
Hover over the No ADP - churn node, and click the Run icon .
In the Outputs and models pane, click the model with the name No ADP - churn to
view the results.
View the Model summary page, which shows predictor fields that are used by the model and
the percentage of the predictions that are correct.
View the Case Processing Summary, which shows the number and percentage of records that
are included in the analysis. In addition, it lists the number of missing cases (if any) where one
or more of the input fields are unavailable and any cases that were not selected.
Close the model details.
Double-click the Auto Data Prep node that is connected to the Type node to see its
properties. Automated Data Preparation handles the data preparation task for you, analyzing your
data and identifying fixes, screening out fields that are problematic or not likely to be useful,
deriving new attributes when appropriate, and improving performance through intelligent screening techniques.
In the Objectives section, leave the default settings in place to analyze
and prepare your data by balancing both speed and accuracy. Other Auto Data Prep node
properties provide the option to specify that you want to concentrate more on accuracy, more on the
speed of processing, or to fine-tune many of the processing steps for data preparation.
Note: If you
want to adjust the node properties and run the flow again in the future, since the model already
exists, you must first click Clear old analysis, under Objectives
before running the flow again.
Optional: Click Preview data to see the data set with the Auto Data
Prep properties that are applied.
Click Cancel.
Double-click the After ADP - churn node that is connected to the Auto Data Prep
node to see its properties.
Expand the Model Settings section
Verify that the Procedure is set to Binomial.
Verify that the Model Name is set to Custom, and
the name is After ADP - churn.
Hover over the After ADP - churn node, and click the Run icon .
In the Outputs and models pane, click the model with the name After ADP - churn to
view the results.
View the Model summary page, which shows predictor fields that are used by the model and
the percentage of the predictions that are correct.
View the Case Processing Summary, which shows the number and percentage of records that
are included in the analysis. In addition, it lists the number of missing cases (if any) where one
or more of the input fields are unavailable and any cases that were not selected.
Close the model details.
Check your progress
The following image shows model details. You are now ready to compare the models.
Now that both models are configured, follow these steps to generate and compare the models:
Hover over the No ADP - LogReg (Analysis) node, and click the Run icon .
Hover over the After ADP - LogReg (Analysis) node, and click the Run icon .
In the Outputs and models pane, click the output results with the name No ADP -
LogReg to view the results.
Compare the models:
Click Compare.
In the Select output field, select After ADP - LogReg.
The analysis of the non-derived Auto Data Prep model shows that just running the data
through the Logistic Regression node with its default settings gives a model with low
accuracy - just 10.6%.Figure 5. Non ADP-derived model
results
The Analysis of the Auto-Data Prep-derived model shows that by running the data
through the default Auto Data Prep settings, you have built a much more accurate model that's
78.3% correct.Figure 6. ADP-derived model
results
By running the Auto Data Prep node to fine-tune the processing of your data, you were able
to build a more accurate model with little direct data manipulation.
Obviously, if you're interested in proving or disproving a certain theory, or want to build
specific models, you might find it beneficial to work directly with the model settings. However, if
you have limited time or a large amount of data to prepare, the Auto Data Prep node may give
you an advantage.
The results in this example are based on the training data only. To assess how well models
generalize to other data in the real world, you can use a Partition node to hold out a subset of
records for purposes of testing and validation.
Use this interactive map to learn about the relationships between your tasks, the tools you need, the services that provide the tools, and where you use the tools.
Select any task, tool, service, or workspace
You'll learn what you need, how to get it, and where to use it.
Tasks you'll do
Some tasks have a choice of tools and services.
Tools you'll use
Some tools perform the same tasks but have different features and levels of automation.
Create a notebook in which you run Python, R, or Scala code to prepare, visualize, and analyze data, or build a model.
Automatically analyze your tabular data and generate candidate model pipelines customized for your predictive modeling problem.
Create a visual flow that uses modeling algorithms to prepare data and build and train a model, using a guided approach to machine learning that doesn’t require coding.
Create and manage scenarios to find the best solution to your optimization problem by comparing different combinations of your model, data, and solutions.
Create a flow of ordered operations to cleanse and shape data. Visualize data to identify problems and discover insights.
Automate the model lifecycle, including preparing data, training models, and creating deployments.
Work with R notebooks and scripts in an integrated development environment.
Create a federated learning experiment to train a common model on a set of remote data sources. Share training results without sharing data.
Deploy and run your data science and AI solutions in a test or production environment.
Find and share your data and other assets.
Import asset metadata from a connection into a project or a catalog.
Enrich imported asset metadata with business context, data profiling, and quality assessment.
Measure and monitor the quality of your data.
Create and run masking flows to prepare copies of data assets that are masked by advanced data protection rules.
Create your business vocabulary to enrich assets and rules to protect data.
Track data movement and usage for transparency and determining data accuracy.
Track AI models from request to production.
Create a flow with a set of connectors and stages to transform and integrate data. Provide enriched and tailored information for your enterprise.
Create a virtual table to segment or combine data from one or more tables.
Measure outcomes from your AI models and help ensure the fairness, explainability, and compliance of all your models.
Replicate data to target systems with low latency, transactional integrity and optimized data capture.
Consolidate data from the disparate sources that fuel your business and establish a single, trusted, 360-degree view of your customers.
Services you can use
Services add features and tools to the platform.
Develop powerful AI solutions with an integrated collaborative studio and industry-standard APIs and SDKs. Formerly known as Watson Studio.
Quickly build, run and manage generative AI and machine learning applications with built-in performance and scalability. Formerly known as Watson Machine Learning.
Discover, profile, catalog, and share trusted data in your organization.
Create ETL and data pipeline services for real-time, micro-batch, and batch data orchestration.
View, access, manipulate, and analyze your data without moving it.
Monitor your AI models for bias, fairness, and trust with added transparency on how your AI models make decisions.
Provide efficient change data capture and near real-time data delivery with transactional integrity.
Improve trust in AI pipelines by identifying duplicate records and providing reliable data about your customers, suppliers, or partners.
Increase data pipeline transparency so you can determine data accuracy throughout your models and systems.
Where you'll work
Collaborative workspaces contain tools for specific tasks.
Where you work with data.
> Projects > View all projects
Where you find and share assets.
> Catalogs > View all catalogs
Where you deploy and run assets that are ready for testing or production.
> Deployments
Where you manage governance artifacts.
> Governance > Categories
Where you virtualize data.
> Data > Data virtualization
Where you consolidate data into a 360 degree view.
About cookies on this siteOur websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising.For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.