German credit risk prediction with Scikit-learn for model monitoring¶

This notebook should be run in a Watson Studio project, using Default Python 3.11 runtime environment. It requires service credentials for the following Cloud services:

  • Watson Machine Learning

The notebook will train, create and deploy a German Credit Risk model.

Learning goals¶

In this notebook, you will learn how to:

  • Explore data
  • Prepare data for training and evaluation
  • Create a scikit-learn pipeline
  • Train and evaluate a model
  • Store a model in the Watson Machine Learning (WML) repository
  • Deploy and score the model

Contents¶

  • Setup
  • Explore Data
  • Create a model
  • Publish the model
  • Deploy and score
  • Clean up
  • Summary

1. Set up the environment¶

Before you use the sample code in this notebook, you must perform the following setup tasks:

  • Create a Watson Machine Learning (WML) Service instance (a free plan is offered and information about how to create the instance can be found here).

Install and import the ibm-watsonx-ai and dependecies¶

Note: ibm-watsonx-ai documentation can be found here.

In [ ]:
!pip install -U ibm-watsonx-ai | tail -n 1
!pip install "scikit-learn==1.3.2" | tail -n 1

Connection to WML¶

Authenticate the Watson Machine Learning service on IBM Cloud. You need to provide platform api_key and instance location.

You can use IBM Cloud CLI to retrieve platform API Key and instance location.

API Key can be generated in the following way:

ibmcloud login
ibmcloud iam api-key-create API_KEY_NAME

In result, get the value of api_key from the output.

Location of your WML instance can be retrieved in the following way:

ibmcloud login --apikey API_KEY -a https://cloud.ibm.com
ibmcloud resource service-instance WML_INSTANCE_NAME

In result, get the value of location from the output.

Tip: Your Cloud API key can be generated by going to the Users section of the Cloud console. From that page, click your name, scroll down to the API Keys section, and click Create an IBM Cloud API key. Give your key a name and click Create, then copy the created key and paste it below. You can also get a service specific url by going to the Endpoint URLs section of the Watson Machine Learning docs. You can check your instance location in your Watson Machine Learning (WML) Service instance details.

You can also get service specific apikey by going to the Service IDs section of the Cloud Console. From that page, click Create, then copy the created key and paste it below.

Action: Enter your api_key and location in the following cell.

In [ ]:
api_key = 'PASTE YOUR PLATFORM API KEY HERE'
location = 'PASTE YOUR INSTANCE LOCATION HERE'
In [2]:
from ibm_watsonx_ai import Credentials

credentials = Credentials(
    api_key=api_key,
    url='https://' + location + '.ml.cloud.ibm.com'
)
In [3]:
from ibm_watsonx_ai import APIClient

client = APIClient(credentials)

Working with spaces¶

First of all, you need to create a space that will be used for your work. If you do not have space already created, you can use Deployment Spaces Dashboard to create one.

  • Click New Deployment Space
  • Create an empty space
  • Select Cloud Object Storage
  • Select Watson Machine Learning instance and press Create
  • Copy space_id and paste it below

Tip: You can also use SDK to prepare the space for your work. More information can be found here.

Action: Assign space ID below

In [4]:
space_id = 'PASTE YOUR SPACE ID HERE'

You can use list method to print all existing spaces.

In [ ]:
client.spaces.list(limit=10)

To be able to interact with all resources available in Watson Machine Learning, you need to set space which you will be using.

In [5]:
client.set.default_space(space_id)
Out[5]:
'SUCCESS'

Connections to COS¶

In next cell we read the COS credentials from the space.

In [10]:
cos_credentials = client.spaces.get_details(space_id=space_id)['entity']['storage']['properties']

Run the notebook¶

At this point, the notebook is ready to run. You can either run the cells one at a time, or click the Kernel option above and select Restart and Run All to run all the cells.

In this section you will learn how to train Scikit-learn model and next deploy it as web-service using Watson Machine Learning service.

Load the training data from github¶

In [ ]:
!rm german_credit_data_biased_training.csv
!wget https://raw.githubusercontent.com/pmservice/ai-openscale-tutorials/master/assets/historical_data/german_credit_risk/wml/german_credit_data_biased_training.csv
In [12]:
import numpy as np
import pandas as pd 

training_data_file_name = "german_credit_data_biased_training.csv"
data_df = pd.read_csv(training_data_file_name)

Explore data ¶

In [13]:
data_df.head()
Out[13]:
CheckingStatus LoanDuration CreditHistory LoanPurpose LoanAmount ExistingSavings EmploymentDuration InstallmentPercent Sex OthersOnLoan ... OwnsProperty Age InstallmentPlans Housing ExistingCreditsCount Job Dependents Telephone ForeignWorker Risk
0 0_to_200 31 credits_paid_to_date other 1889 100_to_500 less_1 3 female none ... savings_insurance 32 none own 1 skilled 1 none yes No Risk
1 less_0 18 credits_paid_to_date car_new 462 less_100 1_to_4 2 female none ... savings_insurance 37 stores own 2 skilled 1 none yes No Risk
2 less_0 15 prior_payments_delayed furniture 250 less_100 1_to_4 2 male none ... real_estate 28 none own 2 skilled 1 yes no No Risk
3 0_to_200 28 credits_paid_to_date retraining 3693 less_100 greater_7 3 male none ... savings_insurance 32 none own 1 skilled 1 none yes No Risk
4 no_checking 28 prior_payments_delayed education 6235 500_to_1000 greater_7 3 male none ... unknown 57 none own 2 skilled 1 none yes Risk

5 rows × 21 columns

In [14]:
print('Columns: ', list(data_df.columns))
print('Number of columns: ', len(data_df.columns))
Columns:  ['CheckingStatus', 'LoanDuration', 'CreditHistory', 'LoanPurpose', 'LoanAmount', 'ExistingSavings', 'EmploymentDuration', 'InstallmentPercent', 'Sex', 'OthersOnLoan', 'CurrentResidenceDuration', 'OwnsProperty', 'Age', 'InstallmentPlans', 'Housing', 'ExistingCreditsCount', 'Job', 'Dependents', 'Telephone', 'ForeignWorker', 'Risk']
Number of columns:  21

As you can see, the data contains twenty one fields. Risk field is the one you would like to predict using feedback data.

In [15]:
print('Number of records: ', data_df.Risk.count())
Number of records:  5000
In [16]:
target_count = data_df.groupby('Risk')['Risk'].count()
target_count
Out[16]:
Risk
No Risk    3330
Risk       1670
Name: Risk, dtype: int64

Visualize data¶

In [18]:
target_count.plot.pie(figsize=(8, 8));
No description has been provided for this image

Save training data to Cloud Object Storage¶

In [22]:
import ibm_boto3
from ibm_botocore.client import Config

cos_client = ibm_boto3.resource("s3",
    ibm_api_key_id=cos_credentials['credentials']['editor']['api_key'],
    ibm_service_instance_id=cos_credentials['resource_crn'],
    ibm_auth_endpoint='https://iam.cloud.ibm.com/identity/token',
    config=Config(signature_version="oauth"),
    endpoint_url=cos_credentials['endpoint_url']
)
In [23]:
with open(training_data_file_name, "rb") as file_data:
    cos_client.Object(cos_credentials['bucket_name'], training_data_file_name).upload_fileobj(
        Fileobj=file_data
    )

Create a model ¶

In this section you will learn how to:

  • Prepare data for training a model
  • Create machine learning pipeline
  • Train a model
In [32]:
MODEL_NAME = "Scikit German Risk Model WML V4"

DEPLOYMENT_NAME = "Scikit German Risk Deployment WML V4"

You will start with importing required libraries¶

In [24]:
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

Splitting the data into train and test¶

In [25]:
train_data, test_data = train_test_split(data_df, test_size=0.2)

Preparing the pipeline¶

In [26]:
features_idx = np.s_[0:-1]
all_records_idx = np.s_[:]
first_record_idx = np.s_[0]

In this step you will encode target column labels into numeric values. You can use inverse_transform to decode numeric predictions into labels.

In [27]:
string_fields = [type(fld) is str for fld in train_data.iloc[first_record_idx, features_idx]]
ct = ColumnTransformer([("ohe", OneHotEncoder(), list(np.array(train_data.columns)[features_idx][string_fields]))])
clf_linear = SGDClassifier(loss='log', penalty='l2', max_iter=1000, tol=1e-5)

pipeline_linear = Pipeline([('ct', ct), ('clf_linear', clf_linear)])

Train a model¶

In [28]:
risk_model = pipeline_linear.fit(train_data.drop('Risk', axis=1), train_data.Risk)

Evaluate the model¶

In [29]:
from sklearn.metrics import roc_auc_score

predictions = risk_model.predict(test_data.drop('Risk', axis=1))
indexed_preds = [0 if prediction=='No Risk' else 1 for prediction in predictions]

real_observations = test_data.Risk.replace('Risk', 1)
real_observations = real_observations.replace('No Risk', 0).values

auc = roc_auc_score(real_observations, indexed_preds)
print(auc)
0.7140884968445209

Publish the model ¶

In this section, the notebook uses the supplied Watson Machine Learning credentials to save the model (including the pipeline) to the WML instance. Previous versions of the model are removed so that the notebook can be run again, resetting all data for another demo.

In [33]:
software_spec_id = client.software_specifications.get_id_by_name("runtime-24.1-py3.11")
print("Software Specification ID: {}".format(software_spec_id))
model_props = {
    client.repository.ModelMetaNames.NAME: "{}".format(MODEL_NAME),
    client.repository.ModelMetaNames.TYPE: 'scikit-learn_1.3',
    client.repository.ModelMetaNames.SOFTWARE_SPEC_ID: software_spec_id
}
Software Specification ID: 336b29df-e0e1-5e7d-b6a5-f6ab722625b2
In [35]:
print("Storing model ...")

published_model_details = client.repository.store_model(model=risk_model, meta_props=model_props, training_data=data_df.drop(["Risk"], axis=1), training_target=data_df.Risk)
model_id = client.repository.get_model_id(published_model_details)
print("Done")
print("Model ID: {}".format(model_id))
Storing model ...
Done
Model ID: 727c0550-b0aa-47c7-8ebc-dec52f2ff578

Deploy and score ¶

The next section of the notebook deploys the model as a RESTful web service in Watson Machine Learning. The deployed model will have a scoring URL you can use to send data to the model for predictions.

In [36]:
print("Deploying model...")
metadata = {
    client.deployments.ConfigurationMetaNames.NAME: DEPLOYMENT_NAME,
    client.deployments.ConfigurationMetaNames.ONLINE: {}
}
deployment = client.deployments.create(model_id, meta_props=metadata)
deployment_id = client.deployments.get_id(deployment)
    
print("Model id: {}".format(model_id))
print("Deployment id: {}".format(deployment_id))
Deploying model...


#######################################################################################

Synchronous deployment creation for uid: '727c0550-b0aa-47c7-8ebc-dec52f2ff578' started

#######################################################################################


initializing
Note: online_url and serving_urls are deprecated and will be removed in a future release. Use inference instead.

ready


------------------------------------------------------------------------------------------------
Successfully finished deployment creation, deployment_uid='e27bd69c-47e3-4375-8a67-a0d0322f6053'
------------------------------------------------------------------------------------------------


Model id: 727c0550-b0aa-47c7-8ebc-dec52f2ff578
Deployment id: e27bd69c-47e3-4375-8a67-a0d0322f6053

Score the model¶

In [37]:
fields = ["CheckingStatus", "LoanDuration", "CreditHistory", "LoanPurpose", "LoanAmount", "ExistingSavings",
                  "EmploymentDuration", "InstallmentPercent", "Sex", "OthersOnLoan", "CurrentResidenceDuration",
                  "OwnsProperty", "Age", "InstallmentPlans", "Housing", "ExistingCreditsCount", "Job", "Dependents",
                  "Telephone", "ForeignWorker"]
values = [
            ["no_checking", 13, "credits_paid_to_date", "car_new", 1343, "100_to_500", "1_to_4", 2, "female", "none", 3,
             "savings_insurance", 46, "none", "own", 2, "skilled", 1, "none", "yes"],
            ["no_checking", 24, "prior_payments_delayed", "furniture", 4567, "500_to_1000", "1_to_4", 4, "male", "none",
             4, "savings_insurance", 36, "none", "free", 2, "management_self-employed", 1, "none", "yes"],
        ]

scoring_payload = {"input_data": [{"fields": fields, "values": values}]}
In [38]:
predictions = client.deployments.score(deployment_id, scoring_payload)
predictions
Out[38]:
{'predictions': [{'fields': ['prediction', 'probability'],
   'values': [['No Risk', [0.569000245132717, 0.43099975486728304]],
    ['No Risk', [0.7041741561003128, 0.2958258438996873]]]}]}

Clean up¶

If you want to clean up all created assets:

  • experiments
  • trainings
  • pipelines
  • model definitions
  • models
  • functions
  • deployments

please follow up this sample notebook.

Summary and next steps¶

You successfully completed this notebook!

You have finished the hands-on lab for IBM Watson Machine Learning. You created, published and deployed Scikit-Learn german credit risk model.

Check out our Online Documentation for more samples, tutorials, documentation, how-tos, and blog posts.

You can now run the model monitoring notebook. You need to pass deployed model id in mentioned notebook

Authors¶

Lukasz Cmielowski, PhD, is an Automation Architect and Data Scientist at IBM with a track record of developing enterprise-level applications that substantially increases clients' ability to turn data into actionable knowledge.

Szymon Kucharczyk, Software Engineer at IBM Watson Machine Learning.

Mateusz Szewczyk, Software Engineer at Watson Machine Learning.

Copyright © 2020-2024 IBM. This notebook and its source code are released under the terms of the MIT License.