This notebook explains how to use machine learning to classify tumor data. You develop your solution in three parts as follows:
An intuitive introduction to supervised learning concepts
A basic example of a machine learning model.
A deep dive into model stacking and parameter tuning, both of which are used in practice to significantly improve predictive accuracy
Some guidelines for reading this notebook:
If you have no experience with machine learning: follow the entire notebook for a comprehensive walkthrough.
If you would like to read about decision trees, go to section 2.
If you would like to read about using XGBoost in practice, go to section 3.
Some familiarity with Python is recommended
1.0 Introduction to supervised learning
2.0 Basic model: decision trees
3.0 Ensemble model: gradient boosting
4.0 XGBoost: parameter tuning
1.1 What is machine learning?
1.2 Defining the task
1.3 How does an algorithm learn?
1.4 Data preview
1.5 Pre-processing
1.7 Create train and test sets
You use an algorithm (code) to create a model (mathematical function). The algorithm tries to find patterns in a sample of training data that can be used on future data. The model itself is a summary of these patterns, distilled into mathematical relationships between variables. The algorithm depends on hyperparameters that control how it looks for these patterns (that is, how it learns).
The difficult part is to find parameters that balance accuracy and precision of the model with its ability to generalize.
Note:
This balancing act is called the bias-variance tradeoff, because a very complex model will represent all the specific nuances of its training data, therefore failing to generalize. In the opposite case, a simple model does not pick up enough of a pattern to ensure accuracy and precision of future predictions.
Your job* is to pick an efficient algorithm and suitable model for our data set and learning task, and fine-tune its parameters to produce a balanced model.
*apart from the arduous task of data pre-processing, which this notebook skims over for the sake of brevity (and sanity!)
The goal is to use an augmented version this data set to develop a predictive model which classifies breast tumors as malignant or benign depending on measurements of the tumor cells. The augmented data set includes statistical analysis values, such as the mean, for some of the measurements. In machine learning terms, the target (what you want to predict) is the diagnosis, and the features are the measurements. The data sample contains observations of tumor cases described by these two components.
Remember that the algorithm is only using a training sample to perform step 3. If the algorithm makes an adjustment so that its predictions are exactly the targets, then it is overfitted to the training data. This means that it has lost the ability to generalize to new (unseen) observations, because it picked up a pattern that only applies to the data points it learned from.
To load the data:
BreastCancerWisconsonDataSet.csv
file into your notebook. To do that, click the Data icon on the notebook action bar. Drop the file into the box or browse to select the file. The file is loaded to your object storage and appears in the Data Assets section of the project. For more information, see Load and access data.BreastCancerWisconsonDataSet.csv
file into a pandas DataFrame, click in the next code cell and select Code snippets > Read data. Then locate the file in your project. Finally, choose the option to download the file as a pandas DataFrame, and then click Insert code to cell.df
instead of df_data_1
in the last two lines of the Insert to Code block and then run the cell.# empty cell
In the following data table, each row corresponds to one observation of a tumor. The columns contain the measurement features, with one column containing the diagnosis target.
from IPython.display import display
print (df.shape)
display(df[200:210])
The data contains features extracted from 569 diagnostic images of breast tumors. The diagnosis column indicates whether the mass was benign (B), or malignant (M). The rest of the columns contain features which are structured as follows:
10 variables describe the cell nuclei of each mass, and for each variable, the mean, standard deviation, and 'worst' (mean of three largest measurements) are calculated. The variables are briefly described on the original data set page.
The following code cell gently cleans the data and generates a summary that can be used to check for outliers. This data set is well prepared for analysis and does not require further manipulation.
If you don't have scikit-learn library installed, install it.
from sklearn.preprocessing import LabelEncoder
# checking for missing values
df.isnull().any()
# dropping Unnamed column
#df2 = df.drop(['Unnamed: 32'], axis=1)
# converting diagnosis M/B to numerical
lenc = LabelEncoder()
lenc.fit(df['diagnosis'])
df['diagnosis'] = lenc.transform(df['diagnosis'])
# overview of data sample
df.describe()
Take 80% of the total data to train the model, and reserve 20% of it for testing. After the model is built and tuned upon the training set, you can use the testing set to observe how well the model is able to generalize to new 'unseen' data.
from sklearn.model_selection import train_test_split
## X are all the features (columns) that might be useful to the model
## y is the target (diagnosis column)
X, y = df.loc[:,'radius_mean':], df[['diagnosis']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, stratify=y)
2.1 Mimicking human decision-making
2.2 Interpreting a tree graphic
2.3 How a decision tree learns
2.3 Evaluating the model
A decision tree is an algorithm that can be used for machine learning. It is analogous to the game 20 questions - each 'question' is a splitting feature chosen from the data based on how useful it is for identifying the target. The algorithm decides on these features by optimizing a variety of mathematical functions that quantify prediction error (for the curious).
Trees are considered weak learners, because their predictions are usually only slightly better than chance. The following code builds a tree based on the tumor data, and displays the process graphically.
If you don't have pydotplus library installed, install it.
!pip install pydotplus
from sklearn import tree
import pydotplus
from IPython.display import Image
# list of features we want to consider
splitting_features = [x for x in X_train.columns if x not in ['id']]
# initializing the tree model and training it
tree_model = tree.DecisionTreeClassifier()
tree_model = tree_model.fit(X_train, y_train)
# generating a graphic for the tree
dot_data = tree.export_graphviz(tree_model, out_file=None,
feature_names=splitting_features,
class_names = ['Benign', 'Malignant'],
filled=True, rounded=True,
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data)
Image(graph.create_png())
Each rectangle represents a node of the tree.
Applying the general learning outline (section 1.3) to a decision tree:
1. Initialize the algorithm to produce a default model, and give it the feature data. The algorithm makes its first set of predictions (close to a random guess).
The default model is the very first splitting feature - the algorithm chose 'concave points mean <= 0.0492'. The observations are divided into benign or malignant based on this decision, making up the first set of predictions!
2. The algorithm measures the error between its previous prediction and the true value of the targets.
This error is quantified by the gini value that appears in the first node of the tree.
3. The algorithm adjusts its model-building to make the error smaller. Each adjustment represents a part of the pattern it is learning, and is used to process future data.
The adjustment is the next splitting feature that is chosen for each subgroup that resulted from the concave points mean decision. The algorithm chose 'radius worst' for observations that had a concave points mean less than the threshold, and 'concavity worst' for observations with concave points mean greater than the threshold.
4. It continues to predict, calculate error, and adjust.
The next set of predictions are the classifications that result from the second round of decision making. Each time, a gini value is calculated to indicate how far the classifications are from the true targets, and another set of splitting features are chosen to further refine the class groupings.
Pick a metric to measure predictive accuracy or performance of your model, so that you can compare different models.
Models can differ in two ways:
This case refers to (1), you want to know how the tree model performs, so that it can be compared to the gradient boosting model in the next section.
The choice of evaluation metric depends on the type of learning problem. For binary classification, you can use a metric called the AUC score, which measures the probability that the model correctly identifies malignant tumors only. This metric is used as an indication of performance because misclassifying a malignant tumor as benign is the worst prediction scenario (compared to correct classification of benign/malignant and misclassification of benign).
You also calculate the general accuracy of the model - the percentage of observations that have been correctly classified overall. The following code uses your simple tree model to predict on the testing set, and evaluates the accuracy and AUC score of its predictions:
from sklearn import metrics
#Predict test set:
test_predictions = tree_model.predict(X_test[splitting_features])
test_predprob = tree_model.predict_proba(X_test[splitting_features])[:,1]
#Print model report:
print("Accuracy : %.4g" % metrics.accuracy_score(y_test['diagnosis'].values, test_predictions))
print("AUC Score: %f" % metrics.roc_auc_score(y_test['diagnosis'], test_predprob))
The model gives ~94% accuracy for all cases, and a ~94% probability of catching malignant cases in the data. Each time the code is run, the algorithm may learn slightly differently (that is, select different splitting features), which results in slight variations in accuracy and AUC scores. In the following section, you work on improving the model by stacking together many trees, and tuning parameters for better generalization ability.
Stacking, or meta-ensembling, is a method of joining multiple predictive models so that the strengths of each can cover the weaknesses of others. Gradient boosting is one technique used to stack weak learners together in an ensemble to achieve better predictions with each additional model.
3.1 An intuitive explanation
3.2 XGBoost
3.3 K-fold cross validation
Note: target - prediction = error, so by adding 'predicted error' to the base prediction, you are getting closer to the target. 5. Another set of pseudo-residuals are calculated for the predictions of the updated model (base tree + first boosting tree). Each round of boosting corrects the base prediction by a predicted amount of error, gradually inching towards the true value.
You use the XGBoost library, an implementation of gradient tree boosting popularized by usage in machine learning competitions. It is a valuable addition to any machine learning toolkit, as it significantly outperforms other algorithms in speed, accuracy, and flexibility.
If you don't have xgboost library installed, install it.
Note: The xgboost package uses an older version of sklearn. When you run import xgboost, ignore the DeprecationWarning.
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
Cross-validation is used to measure how well a model generalizes using the training set (no need to bring in the testing set yet!). The k-fold method divides the training data into even smaller train/test sets to gauge how the model performs on 'unseen' data.
You use cross-validation during parameter tuning so that you can directly observe the effect of a parameter on a model's generalization ability.
# from sklearn import cross_validation
4.1 Default model
4.2 Tuning number of estimators
4.3 Evaluating default model
4.4 Interpreting evaluation
4.5 Grid search
4.6 Tuning tree depth and min child weight
4.7 Tuning gamma
4.8 Evaluating updated model
4.9 Tuning sampling parameters
4.10 Tuning lambda & alpha
4.11 Evaluating final model
4.12 Predict on the test set
You need to initialize a model with some default parameters as a starting point. The selection here depends on the nature of your data and your experience. Recall that parameters tell the algorithm how to learn in one of two ways: how complex the model will be, and how quickly it should conclude on a pattern in the data.
Luckily, XGBoost performs well with its default values, so you only need to define the following three parameters:
# initializing our first model with an objective and learning rate
xgb0 = XGBClassifier(
objective= 'binary:logistic',
learning_rate = 0.1,
n_estimators = 30)
The 'number of estimators' parameter in XGBoost refers to the maximum number of boosting trees (or rounds) to be used in building the model. Intuitively, the more trees that are added to the model, the more complex it is. This parameter is tuned first, so you can broadly adjust the complexity of your model before making adjustments that have smaller impacts on conservative learning.
The function below performs the following actions to find the best number of boosting trees to use on your data:
import matplotlib.pylab as plt
%matplotlib inline
def evaluate_model(alg, train, target, predictors, cv_folds=5, early_stopping_rounds=1):
xgb_param = alg.get_xgb_params()
xgtrain = xgb.DMatrix(train[predictors].values, target['diagnosis'].values)
cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
metrics='auc', early_stopping_rounds=early_stopping_rounds, verbose_eval=True)
alg.set_params(n_estimators=cvresult.shape[0])
#Fit the algorithm on the data
alg.fit(train[predictors], target['diagnosis'], eval_metric='auc')
#Predict training set:
dtrain_predictions = alg.predict(train[predictors])
dtrain_predprob = alg.predict_proba(train[predictors])[:,1]
#Print model report:
print("\nModel Report")
print("Accuracy : %.4g" % metrics.accuracy_score(target['diagnosis'].values, dtrain_predictions))
print("AUC Score (Train): %f" % metrics.roc_auc_score(target['diagnosis'], dtrain_predprob))
feat_imp = pd.Series(alg.get_booster().get_fscore()).sort_values(ascending=False)
feat_imp.plot(kind='bar', title='Feature Importance', color='g')
plt.ylabel('Feature Importance Score')
# a list of features to be used for training the model
features = [x for x in X_train.columns if x not in ['id']]
# evaluating the first model
evaluate_model(xgb0, X_train, y_train, features)
To tune the remaining XGBoost parameters, use grid search with cross-validation.
from sklearn.model_selection import GridSearchCV
max depth: The size of each new decision tree. Smaller trees = less complexity.
min child weight: The minimum number of observations that must be in the children after a split. Smaller weight = more conservative.
# updating our default model with the optimal number of estimators
xgb1 = XGBClassifier(
objective = 'binary:logistic',
learning_rate =0.1,
n_estimators=10)
# array of values for max_depth and min_child_weight parameters
param_test1 = {'max_depth':range(3,7,1),'min_child_weight':range(1,3,1)}
# grid search with cross-validation using the updated model and parameter value array
gsearch1 = GridSearchCV(estimator = xgb1, param_grid = param_test1, scoring='roc_auc', cv=5)
gsearch1.fit(X_train[features],y_train['diagnosis'], eval_metric='auc')
gsearch1.cv_results_['params'], gsearch1.best_params_, gsearch1.best_score_
The grid search found that the model works best with a max depth of 5, and a minimum child weight of 2. You can update our model accordingly and continue tuning.
Increasing gamma makes the algorithm more conservative (less prone to overfitting).
# updating our current model with the max_depth and min_child weight parameter values found in last grid search
xgb2 = XGBClassifier(
objective='binary:logistic',
learning_rate =0.1,
n_estimators=10,
max_depth=5,
min_child_weight=2,
gamma=0)
# array of values for the gamma parameter
gamma_test = {'gamma':[i/100.0 for i in range(0,6)]}
# grid search with cross-validation using the updated model and gamma value array
gsearch2 = GridSearchCV(estimator = xgb2, param_grid = gamma_test, scoring='roc_auc', cv=5)
gsearch2.fit(X_train[features],y_train['diagnosis'], eval_metric='auc')
gsearch2.cv_results_['params'], gsearch2.best_params_, gsearch2.best_score_
The grid search found that the model works best when gamma is 0.03.
Revisit your evaluation function to see if you need to update the number of boosting rounds. Use the parameter values found for max_depth, min_child_weight, and gamma, and set the number of estimators as 30 to see when it stops running.
xgb_check = XGBClassifier(
objective='binary:logistic',
learning_rate =0.1,
n_estimators=30,
max_depth=5,
min_child_weight=2,
gamma=0.03)
evaluate_model(xgb_check, X_train, y_train, features)
subsample: The ratio of data that is randomly selected for growing trees
colsample_bytree: The ratio of the subsample from which the algorithm selects splitting features
Note: You have reduced your data set several times now; originally, you divided it into 80/20 train and test sets. You input only the training set to XGBoost, which are segmented twice more by subsample and colsample. Perhaps less than 50% of the original data will be used to train the model, reducing the chance of overfitting and improving the model's ability to generalize. It is one of the reasons why you see a better test score for XGBoost compared to the single decision tree model.
# updating our current model with the most recent n_estimators
xgb3 = XGBClassifier(
objective='binary:logistic',
learning_rate =0.1,
n_estimators=23,
max_depth=5,
min_child_weight=2,
gamma=0.03)
# array of values for subsample and colsample_bytree parameters
sample_test = {
'subsample':[i/10.0 for i in range(5,9)],
'colsample_bytree':[i/10.0 for i in range(5,9)]
}
# grid search with cross validation for sampling parameters
gsearch3 = GridSearchCV(estimator = xgb3, param_grid = sample_test, scoring='roc_auc', cv=5)
gsearch3.fit(X_train[features],y_train['diagnosis'], eval_metric='auc')
gsearch3.cv_results_['params'], gsearch3.best_params_, gsearch3.best_score_
The grid search found that the model works best with a colsample_bytree of 0.5, and a subsample of 0.7.
Lambda and alpha are both regularization parameters, which mathematically reduce the impact of features that might be too dominant in the model. The difference between the two is in how they apply these penalties.
lambda: Applies L2 regularization, which shrinks the weights of all selected features equally. The default value is 0.
alpha: Applies L1 regularization, which can shrink the weights down to 0 - essentially discarding features that have little impact on the model. The default is 1.
These parameters have more significant effects on large models (with more features). Tune them to see how it impacts your model.
# updating our current model with the sampling parameters found in the last grid search
xgb4 = XGBClassifier(
objective='binary:logistic',
learning_rate =0.1,
n_estimators=23,
max_depth=5,
min_child_weight=2,
gamma=0.03,
subsample=0.7,
colsample_bytree=0.5)
# array of values for subsample and colsample_bytree parameters
reg_test = {'reg_alpha':[0, 0.01, 0.1], 'reg_lambda':[1, 1.1, 1.2, 1.3]}
# grid search with cross validation for regularization parameters
gsearch4 = GridSearchCV(estimator = xgb4, param_grid = reg_test, scoring='roc_auc', cv=5)
gsearch4.fit(X_train[features],y_train['diagnosis'], eval_metric='auc')
gsearch4.cv_results_['params'], gsearch4.best_params_, gsearch4.best_score_
The grid search found that the model works best when the reg_alpha regualization is 0.01 and the reg_lambda regulization is 1.
Run the current model through your evaluation function a final time. Compared to the model with default parameters, tuning has improved accuracy from 0.9824 to 0.9912, and the AUC score from 0.998772 to 0.999195. You can see why XGBoost is most popular for competition use, where those fourth decimal places really count!
# updated model with alpha value found with last grid search. (lambda does not need to be set, since default is 1)
xgb_final = XGBClassifier(
objective='binary:logistic',
learning_rate =0.1,
n_estimators=30,
max_depth=5,
min_child_weight=2,
gamma=0.03,
reg_alpha=0.01)
evaluate_model(xgb_final, X_train, y_train, features)
Finally, predict on the test data you reserved at the beginning. The XGBoost ensemble gives you approximately a 3% increase in accuracy and a 5% increase in AUC score over a single decision tree.
test_features = [x for x in X_test.columns if x not in ['id']]
#Predict test set:
test_predictions = xgb_final.predict(X_test[test_features])
test_predprob = xgb_final.predict_proba(X_test[test_features])[:,1]
#Print model report:
print("Accuracy : %.4g" % metrics.accuracy_score(y_test['diagnosis'].values, test_predictions))
print("AUC Score (Test): %f" % metrics.roc_auc_score(y_test['diagnosis'], test_predprob))
If you want to increase the accuracy and AUC score even further, you can restart the parameter tuning process with a smaller learning rate. The next value to try is 0.01.
Natalie Ho studies statistics at the University of Waterloo.
Lichman, M. (2013). [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science.
Copyright © IBM Corp. 2017-2019. This notebook and its source code are released under the terms of the MIT License.