This tutorial uses the Feature Selection node to help you identify the fields that
are most important in predicting a certain outcome. From a set of hundreds or even thousands of
predictors, the Feature Selection node screens, ranks, and selects the predictors that might
be most important. Ultimately, you might end up with a quicker, more efficient model; one that uses
fewer predictors, runs more quickly, and might be easier to understand.
Preview the tutorial
Watch this video to preview the steps in this tutorial. There might
be slight differences in the user interface that is shown in the video. The video is intended to be
a companion to the written tutorial. This video provides a visual method to learn the concepts and
tasks in this documentation.
This tutorial uses the Screening Predictors flow in the sample project. The data file used
is customer_dbase.csv. The following image shows the sample modeler flow.
Figure 1. Sample modeler flow
This example focuses on only one of the offers as a target. It uses the CHAID
tree-building node to develop a model to describe which customers are most likely to respond to the
promotion. It contrasts two approaches:
Without feature selection. All predictor fields in the dataset are used
as inputs to the CHAID tree.
With feature selection. The Feature Selection node is used to
select the best 10 predictors. These predictors are input into the CHAID tree.
By comparing the two resulting tree models, you can see how feature selection can produce
effective results.
The following image shows the sample data set.Figure 2. Sample data set
Task 1: Open the sample project
The sample project contains several data sets and sample modeler flows. If you don't already have
the sample project, then refer to the Tutorials topic to create the sample project. Then follow these steps to open the sample
project:
In Cloud Pak for Data, from the Navigation menu, choose
Projects > View all Projects.
Click SPSS Modeler Project.
Click the Assets tab to see the data sets and modeler flows.
Check your progress
The following image shows the project Assets tab. You are now ready to work with the sample
modeler flow associated with this tutorial.
Double-click the response_01 (Feature Selection) node to see its properties.
Expand the Build Options section to see the defined rules and criteria that are used for
screening or disqualifying fields.Figure 4. Feature
Selection Build Options
Hover over the response_01 (Feature Selection) node, and click the Run icon .
In the Outputs and models pane, click the model with the name response_01 to view
the model. The results show the fields that are found to be useful in the prediction, ranked by
importance. By examining these fields, you can decide which ones to use in subsequent modeling
sessions.
To compare results without feature selection, you must use two CHAID modeling
nodes in the flow: one that uses feature selection and one that doesn't.
Double-click the With All Fields (CHAID) node to see its properties.
Under Objectives, verify that Build new model and
Create a standard model are selected.
Expand the Basic section, and verify that Maximum Tree Depth is set to
Custom and the number of levels is set to 5.
Click Save.
Double-click the Using Top 10 Fields (CHAID) node to see its properties
Verify the same properties as the With All Fields (CHAID) node.
Click Save.
Check your progress
The following image shows the Modeling node. You are now ready to run the flow and view
the results.
Follow these steps to run the flow and view the results of the two models with and without
feature selection:
Click Run all . As it
runs, notice how long it takes each model to finish building.
In the Outputs and models pane, click the model with the name With All fields to
view the results.
Click the Tree Diagram page.
Zoom out to see the scope of the tree diagram.
Close the model details window.
In the Outputs and models pane, click the modelrun with the name Using Top 10
fields to view the results.
Click the Tree Diagram page.
Zoom out to see the scope of the tree diagram.
It might be hard to tell, but the second model ran faster than the first one. Because this
dataset is relatively small, the difference in run times is probably only a few seconds; but for
larger real-world datasets, the difference might be noticeable; minutes or even hours. Using feature
selection might speed up your processing times dramatically.
You might instead use a
tree-building algorithm to do the feature selection work, allowing the tree to identify the most
important predictors for you. In fact, the CHAID algorithm is often used for this purpose, and it's
even possible to grow the tree level-by-level to control its depth and complexity. However, the
Feature Selection node is faster and easier to use. It ranks all predictors in one fast step,
assisting you to identify the most important fields quickly.
Check your progress
The following image shows the tree diagram from the model.
The second tree also contains fewer tree nodes than the first. It's easier to comprehend. Using
fewer predictors is less expensive. It means that you have less data to collect, process, and feed
into your models. Computing time is improved. In this example, even with the extra feature selection
step, model building was faster with the smaller set of predictors. With a larger real-world
dataset, the time savings might be greatly amplified.
Using fewer predictors results in simpler scoring. For example, you might identify only four
profiles of customers who are likely to respond to the promotion. With larger numbers of predictors,
you run the risk of overfitting your model. The simpler model might generalize better to other
datasets (although you need to test this approach to be sure).
Use this interactive map to learn about the relationships between your tasks, the tools you need, the services that provide the tools, and where you use the tools.
Select any task, tool, service, or workspace
You'll learn what you need, how to get it, and where to use it.
Tasks you'll do
Some tasks have a choice of tools and services.
Tools you'll use
Some tools perform the same tasks but have different features and levels of automation.
Create a notebook in which you run Python, R, or Scala code to prepare, visualize, and analyze data, or build a model.
Automatically analyze your tabular data and generate candidate model pipelines customized for your predictive modeling problem.
Create a visual flow that uses modeling algorithms to prepare data and build and train a model, using a guided approach to machine learning that doesn’t require coding.
Create and manage scenarios to find the best solution to your optimization problem by comparing different combinations of your model, data, and solutions.
Create a flow of ordered operations to cleanse and shape data. Visualize data to identify problems and discover insights.
Automate the model lifecycle, including preparing data, training models, and creating deployments.
Work with R notebooks and scripts in an integrated development environment.
Create a federated learning experiment to train a common model on a set of remote data sources. Share training results without sharing data.
Deploy and run your data science and AI solutions in a test or production environment.
Find and share your data and other assets.
Import asset metadata from a connection into a project or a catalog.
Enrich imported asset metadata with business context, data profiling, and quality assessment.
Measure and monitor the quality of your data.
Create and run masking flows to prepare copies of data assets that are masked by advanced data protection rules.
Create your business vocabulary to enrich assets and rules to protect data.
Track data movement and usage for transparency and determining data accuracy.
Track AI models from request to production.
Create a flow with a set of connectors and stages to transform and integrate data. Provide enriched and tailored information for your enterprise.
Create a virtual table to segment or combine data from one or more tables.
Measure outcomes from your AI models and help ensure the fairness, explainability, and compliance of all your models.
Replicate data to target systems with low latency, transactional integrity and optimized data capture.
Consolidate data from the disparate sources that fuel your business and establish a single, trusted, 360-degree view of your customers.
Services you can use
Services add features and tools to the platform.
Develop powerful AI solutions with an integrated collaborative studio and industry-standard APIs and SDKs. Formerly known as Watson Studio.
Quickly build, run and manage generative AI and machine learning applications with built-in performance and scalability. Formerly known as Watson Machine Learning.
Discover, profile, catalog, and share trusted data in your organization.
Create ETL and data pipeline services for real-time, micro-batch, and batch data orchestration.
View, access, manipulate, and analyze your data without moving it.
Monitor your AI models for bias, fairness, and trust with added transparency on how your AI models make decisions.
Provide efficient change data capture and near real-time data delivery with transactional integrity.
Improve trust in AI pipelines by identifying duplicate records and providing reliable data about your customers, suppliers, or partners.
Increase data pipeline transparency so you can determine data accuracy throughout your models and systems.
Where you'll work
Collaborative workspaces contain tools for specific tasks.
Where you work with data.
> Projects > View all projects
Where you find and share assets.
> Catalogs > View all catalogs
Where you deploy and run assets that are ready for testing or production.
> Deployments
Where you manage governance artifacts.
> Governance > Categories
Where you virtualize data.
> Data > Data virtualization
Where you consolidate data into a 360 degree view.
About cookies on this siteOur websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising.For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.