Quick start: Build a model using SPSS Modeler
You can create, train, and deploy models using SPSS Modeler. Read about SPSS Modeler, then watch a video and follow a tutorial that’s suitable for beginners and requires no coding.
Required service Watson Studio (which includes SPSS Modeler)
Your basic workflow includes these tasks:
- Create a project. Projects are where you can collaborate with others to work with data.
- Add an SPSS Modeler flow to the project.
- Configure the nodes on the canvas, and run the flow.
- Review the model details and save the model.
- Deploy and test your model.
Read about SPSS Modeler
With SPSS Modeler flows, you can quickly develop predictive models using business expertise and deploy them into business operations to improve decision making. Designed around the long-established SPSS Modeler client software and the industry-standard CRISP-DM model it uses, the flows interface supports the entire data mining process, from data to better business results.
SPSS Modeler offers a variety of modeling methods taken from machine learning, artificial intelligence, and statistics. The methods available on the node palette allow you to derive new information from your data and to develop predictive models. Each method has certain strengths and is best suited for particular types of problems.
Watch a video about creating a model using SPSS Modeler
Watch this video to see how to create and run an SPSS Modeler flow to train a machine learning model.
This video provides a visual method as an alternative to following the written steps in this documentation.
Try a tutorial to create a model using SPSS Modeler
In this tutorial, you will complete these tasks:
- Create a project.
- Add a data set to your project.
- Create the SPSS Modeler flow.
- Add the nodes to the SPSS Modeler flow.
- Run the SPSS Modeler flow and explore the model details.
- Evaluate the model.
- Deploy and test the model with new data.
This tutorial will take approximately 30 minutes to complete.
The data set used in this tutorial is from the University of California, Irvine, and is the result of an extensive study based on hospital admissions over a period of time. The model will use three important factors to help predict chronic kidney disease.
Task 1: Create a project
You need a project to store the SPSS Modeler flow.
- If you have an existing project, open it. If you don't have an existing project, click Create a project on the home page or click New project on your Projects page.
- Select Create an empty project.
- On the Create a project screen, add a name and optional description for the project.
- Choose an existing object storage service instance or create a new one.
- Click Create.
For more information or to watch a video, see Creating a project.
Task 2: Add the data set to your project
The data set used in this tutorial is available in the Gallery.
- Access the UCI ML Repository: Chronic Kidney Disease Data Set in the Gallery.
- Click Preview. There are three important factors that help predict chronic kidney disease which are available as part of this analysis: the age of the test subject, the serum creatinine test results, and diabetes test results. And the class value indicates if the patient has been previously diagnosed for kidney disease.
- Click Add to project.
- Select the project from the list, and click Add.
- Click View Project.
- From your project's Assets page, locate the UCI ML Repository Chronic Kidney Disease Data Set.csv file.
Task 3: Create the SPSS Modeler flow
Now add the SPSS Modeler flow to the project.
- Click Add to project, and select Modeler flow.
- Type a name and description for the flow.
- For the runtime definition, accept the Default SPSS Modeler S definition.
- Click Create. This opens up the Flow Editor that you'll use to create the flow.
Task 4: Add the nodes to the SPSS Modeler flow
After you load the data, you must transform the data. You'll be creating a simple flow by dragging transformers and estimators onto the canvas and connecting them to the data source. Use the following nodes from the palette:
- Data Asset: loads the csv file from the project
- Partition: divides the data into training and testing segments
- Type: sets the data type. Use it to designate the
classfield as a
- C5.0: a classification algorithm
- Analysis: view the model and check its accuracy
Table: preview the data with predictions
From the Import section, drag the Data Asset node onto the canvas.
- Double-click the Data Asset node to select the data set.
- Select UCI ML Repository Chronic Kidney Disease Data Set.csv.
- Click Select.
- View the Data Asset properties.
- Click Save.
- From the Field Operations section, drag the Partition node onto the canvas.
- Connect the Data Asset node to the Partition node.
- Double-click the Partition node to view its properties. The default partition divides half of the data for training and the other half for testing.
- Click Save.
- From the Field Operations section, drag the Type node onto the canvas.
- Connect the Partition node to the Type node.
- Double-click the Type node to view its properties. The Type node specifies the measurement level for each field. This source data file uses four different measurement levels: Continuous, Categorical, Nominal, Ordinal, and Flag.
- Search for the
classfield. For each field, the role indicates the part that each field plays in modeling. Change the
classRole to Target - the field you want to predict.
- Click Save.
- From the Modeling section, drag the C5.0 node onto the canvas.
- Connect the Type node to the C5.0 node.
- Double-click the C5.0 node to view its properties. By default, the C5.0 algorithm builds a decision tree. A C5.0 model works by splitting the sample based on the field that provides the maximum information gain. Each sub-sample defined by the first split is then split again, usually based on a different field, and the process repeats until the subsamples can't be split any further. Finally, the lowest-level splits are reexamined, and those that don't contribute significantly to the value of the model are removed.
- Check Use custom field roles.
- For Target, select class.
- In the Inputs section, click Add columns.
- Select age, sc, dm.
- Click OK.
- Click Save.
When you're done creating the flow, it should look like the following image.
Task 5: Run the SPSS Modeler flow and explore the model details
Now that you have designed the flow, you can run the flow and examine the tree diagram to see the decision points.
- Right-click the C5.0 node and select Run. Running the flow generates a new model nugget on the canvas.
- Right-click the model nugget and select View Model to view the model details.
- View the Model Information which provides a model summary.
- Click Top Decision Rules. A table displays a series of rules that were used to assign individual records to child nodes based on the values of different input fields.
- Click Feature Importance. A chart shows the relative importance of each predictor in estimating the model. From this, you can see that serum creatinine is easily the most significant factor, with diabetes being the next most significant factor.
- Click Tree Diagram. The same model is displayed in the form of a tree, with a node at each decision point.
- Select the Display labels on branches option.
- Hover over Node 0 which provides a summary for all the records in the data set. Just under 40% of the cases in the data set are classified as not diagnosed with kidney disease. The tree can provide additional clues as to what factors might be responsible.
- Notice the two branches stemming from Node 0, which indicates a split by serum creatinine.
- Hover over Node 6 which shows records where the serum creatinine is greater than 1.25. In this case, 100% of those patients have a positive kidney disease diagnosis.
- Hover over Node 1 which shows records where the serum creatinine is less than or equal to 1.25. Almost 80% of those patients don't have a positive kidney disease diagnosis, but almost 20% with lower serum creatinine were still diagnosed with kidney disease.
- The branch from Node 1 is split by diabetes. Hover over Node 2 which shows patients with low serum creatinine and diagnosed diabetes. 100% of these patients were also diagnosed with kidney disease.
- Hover over Node 3. For patients with low serum creatinine and no diabetes, over 85% were not diagnosed with kidney disease, but 15% of them were still diagnosed with kidney disease.
- The branch from Node 3 is split by the last significant factor, age. Hover over Node 4 to see that 75% of young patients with low serum creatinine and no diabetes were at risk of getting kidney disease.
- Hover over Node 5. Only 11% of patients over 16 years old with low serum creatinine and no diabetes were at risk of getting kidney disease.
- Close the model details.
Task 6: Evaluate the model
Use the Analysis and Table nodes to evaluate the model.
- From the Outputs section, drag the Analysis node onto the canvas.
- Connect the Model nugget to the Analysis node.
- Right-click the Analysis node, and select Run.
- From the Outputs panel, open the Analysis, which shows that the model correctly predicted a kidney disease diagnosis alomst 95% of the time. Close the Analysis.
- (Optional) On the toolbar, click the Download icon to save the model as an .str file.
- Right-click the Analysis node, and select Save branch as a model.
- For the Model name, type
Kidney Disease Analysis.
- Click Save.
- For the Model name, type
- From the Outputs section, drag the Table node onto the canvas.
- Connect the Model nugget to the Table node.
- Right-click the Table node, and select Preview.
- When the Preview displays, scroll to the last two columns. The $C-Class column contains the prediction of kidney disease, and the $CC-Class column indicates the confidence score for that prediction.
- Close the Preview.
Task 7: Deploy and test the model with new data
Lastly, you can deploy this model and predict the outcome with new data.
- Return to the Project's Assets tab.
- Scroll to the Models section, and open the Kidney Disease Analysis model.
- Click Promote to deployment space.
- Choose an existing deployment space. If you don't have a deployment space, you can create a new one:
- Provide a space name.
- Select a storage service.
- Select a machine learning service.
- Click Create.
- Click Close.
- Select Go to the model in the space after promoting it.
- Click Promote.
- When the model displays inside the deployment space, click New deployment.
- Select Online as the Deployment type.
- Specify a name for the deployment.
- Click Create.
- Go to the Deployments tab and wait for the model to be deployed.
- When the deployment is complete, click the deployment name to view the deployment details page.
- Go to the Test tab. You can test the deployed model from the deployment details page in two ways: test with a form or test with JSON code.
Click the icon to Provide input data as JSON, then copy the following test data and paste it in the area for the JSON text:
- Click Predict to predict whether a 62 year old with diabetes and a serum creatinine ratio of 1.8 would likely be diagnosed with kidney disease. The resulting prediction indicates that this patient has a high probability of a kidney disease diagnosis.
Now you can use this data set for further analysis. For example, you can perform tasks such as:
- Find more SPSS Modeler tutorials
- Try these other methods to build models:
- View videos about machine learning
- Find sample data sets and notebooks to gain hands-on experience building models in the Gallery
- Contribute to the SPSS Modeler community