Tutorial: Build a predictive analytic model to determine whether a person has chronic kidney disease by using IBM Watson Machine Learning and the Flow Editor
You will use classification transformers with publicly available data about metabolic diseases from the University of California, Irvine to determine whether someone has chronic kidney disease or not.
Prerequisites: Ensure that you have at least one Spark service instance available and at least one project. Some familiarity with Python and Apache Spark machine learning concepts, such as transformers, features, and labels is recommended. For more detailed information about setting up your machine learning environment, see Setting up your machine learning environment.
This tutorial has the following main parts:
Watch this video to see how to build this data flow, then follow the tutorial steps in your own enviroment.
This tutorial machine learning model in IBM Watson Studio contains steps to get data fom an external source by using the Watson Studio machine learning flow creation tool: the Flow Editor. You can then explore the data before you deploy it.
Add research data to your Watson Studio project
The original source for this data* is from the University of California, Irvine (UCI) and is available at the UCI Machine Learning Repository.
You can download a CSV version of the data from the Watson Studio community. Be sure to log in to Watson Studio so you can download the file!
The data set is the result of an extensive study based on hospital admissions over a period of time. Because of the types of tests that were performed as part of hospital admissions, the following items are available as part of this analysis:
- serum creatinine (sc): A serum creatinine test measures the level of creatinine in your blood and provides an estimate of how well your kidneys filter.
- age (age): The age of the test subject.
- diabetes mellitus (dm): A group of metabolic diseases in which there are high blood sugar levels over a prolonged period
These items are strikingly important when predicting chronic kidney disease. Some other fields are relatively important (e.g., blood pressure), but were omitted from the features list arbitrarily.
- Download the
- Add the file to the data assets for the project. Click add data assets, click browse, select the data file, click Open, and then click Apply.
Load the data.
After you add the research data to your project, you must load the data. To do this, use the IBM Watson Studio Machine Learning user interface.
- Create a machine learning flow. From the project overview page, click Add to project, and then click Modeler flow. Type a name and description for the machine learning flow. For the runtime environment, select IBM SPSS Modeler and click Create. This opens up the Flow Editor that you will use to create a machine learning flow.
- Click the Find and Add Data icon and drag the
chronic_kidney_disease_full.csvdata asset from the data palette to the Flow Editor.
Tip: If you don't see the
chronic_kidney_disease_full.csv file, you can click Browse, go to the location where you un-zipped the archive file, select the
chronic_kidney_disease_full.csv data asset and click Open. Then, click Apply.
Transform and train the data
After you load the data, you must transform the data by using transformers. You will be creating a simple machine learning flow by dragging transformers and estimators onto the Flow Editor and connecting them to the data source. Use the following nodes from the palette:
- Partition: divides the data into training and testing segments
- Type: sets the data type. Use it to designate the
classfield as a
- C5.0: a classification algorithm
- Analysis: view the model and check its accuracy
- Open the palette by clicking the Palette icon.
- From the palette, click Field Operations and then, select and drag the Partition transformer onto the Flow Editor. The default partition divides half of the data for training and the other half for testing.
- Connect the data source to the Partition node by clicking the output handle and dragging to the input handle.
- From the palette, click Field Operations and then, select and drag the Type node to the Flow Editor.
- Connect the Partition node to the Type node.
Use the Type node to set the target.
- Double-click the Type node.
- Expand the Settings section.
- Click Configure Types.
- Click Add Columns.
- Select all the columns by selecting the check box before the Field name and click OK.
- Scroll to the
classrow and in the Role column, select Target and then click OK.
From the palette, click Modeling and then, select and drag the C5.0 node to the Flow Editor.
- Connect the Type node to the C5.0. node.
- Double-click the C5.0 node and expand the Fields section.
- For the Target select the
- Click Add Columns and select the
- Click OK.
When you complete preparation of the data, your machine learning flow should look like the following screen:
Run and deploy the model
After you create the machine learning flow, you must run and then deploy it. This is also a good time to do a check on the data and the results. To check the data, use the Analysis node.
From the Flow Editor toolbar click the Run icon. A model nugget appears. This is your trained model.
- From the palette, click Outputs, drag the Analysis node to the Flow Editor.
- Connect the model node to the Analysis node.
- To view the model, right-click the model nugget, and then click View Model.
- To view the model accuracy, right-click the Analysis node, click Run, and then click the View Output and Versions icon, double-click the Analysis output. Notice that the accuracy for this particular model is quite high.
- To save the machine learning flow as a model that you can deploy, right-click the model nugget, click Save as Model.
To deploy the model, see Deploy an IBM SPSS Model from Flow Editor. After you deploy the model, you should have a tool for predicting the liklihood that someone has chronic kidney disease. To use this model to create predictions, go to the project Assets tab, to the Models section and click the model. On the Predictions tab you can test and use your model by entering ad hoc data in the input fields and then clicking Predict.
What does this tell us? While this example is basic in nature, it provides insight into the fact that older people tend to have a higher probability of getting chronic kidney disease than younger people, controlling for other factors. It also shows the importance of serum creatinine in diagnosing kidney disease.
Summary and next steps
You successfully completed this machine learning tutorial! You learned how to use the Flow Editor to create a machine learning flow to predict the likelihood that someone has chronic kidney disease.
Check out our content pages for more samples, tutorials, documentation, how-tos, and blog posts.
*Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. For complete licensing information, see the Citation Policy page.