Tutorial: Build a predictive analytic model to determine whether a person has chronic kidney disease by using IBM Watson Machine Learning, and the Flow Editor with Spark MLlib nodes
You will use classification transformers with publicly available data about metabolic diseases from the University of California, Irvine to determine whether someone has chronic kidney disease or not.
Prerequisites: Ensure that you have at least one Spark service instance available and at least one project. Some familiarity with Python and Apache Spark machine learning concepts, such as transformers, features, and labels is recommended. For more detailed information about setting up your machine learning environment, see Setting up your machine learning environment.
This tutorial has the following main parts:
This tutorial machine learning model in IBM Watson Studio contains steps to get data fom an external source by using the Watson Studio flow creation tool: the Flow Editor. You can then explore the data before you deploy it.
Add research data to your Watson Studio project
The original source for this data* is from the University of California, Irvine (UCI) and is available at the UCI Machine Learning Repository.
You can download a CSV version of the data from the Watson Studio community. Be sure to log in to Watson Studio so you can download the file!
The data set is the result of an extensive study based on hospital admissions over a period of time. Because of the types of tests that were performed as part of hospital admissions, the following items are available as part of this analysis:
- serum creatinine (sc): A serum creatinine test measures the level of creatinine in your blood and provides an estimate of how well your kidneys filter.
- age (age): The age of the test subject.
- diabetes mellitus (dm): A group of metabolic diseases in which there are high blood sugar levels over a prolonged period
These items are strikingly important when predicting chronic kidney disease. Some other fields are relatively important (e.g., blood pressure), but were omitted from the features list arbitrarily.
- Download the
- Add the file to the data assets for the project. Click add data assets, click browse, select the data file, click Open, and then click Apply.
Load the data.
After you add the research data to your project, you must add the data to your Flow Editor. To do this, use the IBM Watson Studio Machine Learning user interface.
- Open a project in IBM Watson Studio.
- Create a machine learning flow. From the project overview page, click Add to project, and then click SPSS Modeler flow. Type a name and description for the machine learning flow.
- For the runtime environment, select Spark ML 2.0.
- Choose a Machine Learning and Spark instance and then, click Create. This opens the Flow Editor that you use to create a machine learning flow.
- Click the Find and Add Data icon and drag the
chronic_kidney_disease_full.csvdata asset from the data palette to the Flow Editor.
Tip: If you don't see the
chronic_kidney_disease_full.csv file, you must return to the Project Assets tab to add the CSV file to your data assets.
Transform and train the data
After you load the data, you must transform the data by using transformers. You will be creating a simple machine learning flow by dragging transformers and estimators onto the Flow Editor and connecting them to the data source.
Open the palette by clicking the palette icon.
Use the following nodes from the palette:
- Filter Rows: create logic to limit rows for processing to the high-risk group over the age of 40 years. Enter a conditional statement, such as
age > 40to limit our analysis.
- Select Columns: remove columns that we don't need for analysis that are missing data points.
- Decision Tree Classifier: add a classification algorithm to segment patients into groups that reflect their liklihood of having chronic kidney disease.
- Add a Filter Rows node.
- Click the Node palette icon.
- From the node palette, click the Transformations tab and drag the Filter Rows node to the Flow Editor.
- Connect the data source node to the Filter Rows node by hovering near the data source node, clicking the highlighted area and dragging a connector to the Filter Rows node.
- Add a condition to the Filter Rows node to limit the data to the high-risk group of people over the age of 40 years.
- Double-click the Filter Rows node and expand the Settings section.
- In the Condition box, type
age > 40and click OK.
To clean up the data, you can remove some of the columns that are not necessary to the analysis or that have missing data points. For this, you use the Select Columns node.
- From the palette, add the Select Columns node and connect it to the Filter Rows node.
- Double-click the Select Columns node, expand the Settings section, and click Add Columns.
- Because the following fields have many missing values, remove them by selecting them in the list: sg, al, su, rbc, pc, pcc, ba, bgr, bu, sod, pot, hemo, pcv, wbcc, rbcc, htn, cad, appet, pe, and ane, and then click OK.
- In the Mode section, click the remove radio button and then click Save.
This removes this column of data from the outgoing data set. This can also be checked with the Preview function. You can ensure that the ‘pot’ field is no longer present in the table.
When you finish this portion of the tutorial, your machine learning flow should look like the following image:
Add a classification algorithm
Because the Decision Tree Classifier node maps observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves), it is a good match for this analysis. It not only supports both binary and multiclass labels, but both continuous and categorical features as well.
- Add a Decision Tree Classifier node. From the Node Palette, click the Modelling tab and drag the Decision Tree Classifier node to the Flow Editor. Then, connect it to the Select Columns node.
Configure the node properties.
- Double-click the Decision Tree Classifier node and expand the Fields section.
- From the Target column box, select the
classcolumn and in the Input columns list, click Add Columns.
- Select all the columns by selecting the check box before the Field name heading in the table header and click OK.
- Click Save.
Optional: You can build this machine learning flow with default settings, however, to change the maximum depth of the tree, go to the Advanced Parameters tab, and change the value. A higher value would be used for a more complex model, whereas A lower value would be used for a less complex model.
When you finish, your machine learning flow should look like the following image:
Run and deploy the model
After you create the machine learning flow, you must run and then deploy it. This is also a good time to do a check on the data and the results.
- From the Flow Editor toolbar, click the Run icon. A model nugget appears. This is your trained model.
- To view the model, right-click the model nugget, and then click View Model.
- To save the model, right-click the model nugget node, click Save as a model, in the Model Name box type a name, and then click Save.
- To deploy this model, return to the project, from the Assets tab, in the Models section, click the model and then click Add Deployment.
When you complete this tutorial, you should have a tool for predicting the liklihood that someone has chronic kidney disease. To use this model to create predictions, go to the project Assets tab, to the Models section and click the model. On the Predictions tab you can test and use your model by entering ad hoc data in the input fields and then clicking Predict.
What does this tell us? While this example is basic in nature, it provides insight into the fact that older people tend to have a higher probability of getting chronic kidney disease than younger people, controlling for other factors. It also shows the importance of serum creatinine in diagnosing kidney disease.
Summary and next steps
You successfully completed this machine learning tutorial! You learned how to use the Flow Editor to create a machine learning flow to predict the likelihood that someone has chronic kidney disease.
Check out our content pages for more samples, tutorials, documentation, how-tos, and blog posts.
*Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. For complete licensing information, see the Citation Policy page.