0 / 0
Federated Learning XGBoost tutorial for UI
Last updated: Oct 09, 2024
Federated Learning XGBoost tutorial for UI

This tutorial demonstrates the usage of Federated Learning with the goal of training a machine learning model with data from different users without having users share their data. The steps are done in a low code environment with the UI and with an XGBoost framework.

In this tutorial you learn to:

Notes:

  • This is a step-by-step tutorial for running a UI driven Federated Learning experiment. To see a code sample for an API driven approach, go to Federated Learning XGBoost samples.
  • In this tutorial, admin refers to the user that starts the Federated Learning experiment, and party refers to one or more users who send their model results after the experiment is started by the admin. While the tutorial can be done by the admin and multiple parties, a single user can also complete a full run through as both the admin and the party. For a simpler demonstrative purpose, in the following tutorial only one data set is submitted by one party. For more information on the admin and party, see Terminology.

Step 1: Start Federated Learning

In this section, you learn to start the Federated Learning experiment.

Before you begin

  1. Log in to IBM Cloud. If you don't have an account, create one with any email.

  2. Create a Watson Machine Learning service instance if you do not have it set up in your environment.

  3. Log in to watsonx.

  4. Use an existing project or create a new one. You must have at least admin permission.

  5. Associate the Watson Machine Learning service with your project.

    1. In your project, click the Manage > Service & integrations.
    2. Click Associate service.
    3. Select your Watson Machine Learning instance from the list, and click Associate; or click New service if you do not have one to set up an instance.

    Screenshot of associating the service

Start the aggregator

  1. Create the Federated learning experiment asset:

    1. Click the Assets tab in your project.

    2. Click New task > Train models on distributed data.

    3. Type a Name for your experiment and optionally a description.

    4. Verify the associated Watson Machine Learning instance under Select a machine learning instance. If you don't see a Watson Machine Learning instance associated, follow these steps:

      1. Click Associate a Machine Learning Service Instance.

      2. Select an existing instance and click Associate, or create a New service.

      3. Click Reload to see the associated service.

        Screenshot of associating the service

      4. Click Next.

  2. Configure the experiment.

    1. On the Configure page, select a Hardware specification.

    2. Under the Machine learning framework dropdown, select scikit-learn.

    3. For the Model type, select XGBoost.

    4. For the Fusion method, select XGBoost classification fusion

      Screenshot of selecting XGBoost classification

  3. Define the hyperparameters.

    1. Set the value for the Rounds field to 5.

    2. Accept the default values for the rest of the fields.

      Screenshot of selecting hyperparameters

    3. Click Next.

  4. Select remote training systems.

    1. Click Add new systems.

    Screenshot of Add RTS UI

    1. Give your Remote Training System a name.

    2. Under Allowed identities, select the user that will participate in the experiment, and then click Add. You can add as many allowed identities as participants in this Federated Experiment training instance. For this tutorial, choose only yourself.
      Any allowed identities must be part of the project and have at leastAdmin permission.

    3. When you are finished, click Add systems.

      Screenshot of creating an RTS

    4. Return to the Select remote training systems page, verify that your system is selected, and then click Next.

      Screenshot of selecting RTS

  5. Review your settings, and then click Create.

  6. Watch the status. Your Federated Learning experiment status is Pending when it starts. When your experiment is ready for parties to connect, the status will change to Setup – Waiting for remote systems. This may take a few minutes.

Step 2: Train model as a party

  1. Ensure that you are using the same Python version as the admin. Using a different Python version might cause compatibility issues. To see Python versions compatible with different frameworks, see Frameworks and Python version compatibility.

  2. Create a new local directory.

  3. Download the Adult data set into the directory with this command: wget https://api.dataplatform.cloud.ibm.com/v2/gallery-assets/entries/5fcc01b02d8f0e50af8972dc8963f98e/data -O adult.csv.

  4. Download the data handler by running wget https://raw.githubusercontent.com/IBMDataScience/sample-notebooks/master/Files/adult_sklearn_data_handler.py -O adult_sklearn_data_handler.py.

  5. Install Watson Machine Learning.

    • If you are using Linux, run pip install 'ibm-watson-machine-learning[fl-rt22.2-py3.10]'.
    • If you are using Mac OS with M-series CPU and Conda, download the installation script and then run ./install_fl_rt22.2_macos.sh <name for new conda environment>.
      You now have the party connector script, mnist_keras_data_handler.py, mnist-keras-test.pkl and mnist-keras-train.pkl, data handler in the same directory.
  6. Go back to the Federated Learning experiment page, where the aggregator is running. Click View Setup Information.

  7. Click the download icon next to the remote training system, and select Party connector script.

  8. Ensure that you have the party connector script, the Adult data set, and the data handler in the same directory. If you run ls -l, you should see:

    adult.csv
    adult_sklearn_data_handler.py
    rts_<RTS Name>_<RTS ID>.py
    
  9. In the party connector script:

    1. Authenticate using any method.

    2. Put in these parameters for the "data" section:

      "data": {
              "name": "AdultSklearnDataHandler",
              "path": "./adult_sklearn_data_handler.py",
              "info": {
                      "txt_file": "./adult.csv"
              },
      },
      

      where:

      • name: Class name defined for the data handler.
      • path: Path of where the data handler is located.
      • info: Create a key value pair for the file type of local data set, or the path of your data set.
  10. Run the party connector script: python3 rts_<RTS Name>_<RTS ID>.py.

  11. When all participating parties connect to the aggregator, the aggregator facilitates the local model training and global model update. Its status is Training. You can monitor the status of your Federated Learning experiment from the user interface.

  12. When training is complete, the party receives a Received STOP message on the party.

  13. Now, you can save the trained model and deploy it to a space.

Step 3: Save and deploy the model online

In this section, you learn how to save and deploy the model that you trained.

  1. Save your model.

    1. In your completed Federated Learning experiment, click Save model to project.
    2. Give your model a name and click Save.
    3. Go to your project home.
  2. Create a deployment space, if you don't have one.

    1. From the navigation menu Navigation menu, click Deployments.

    2. Click New deployment space.

    3. Fill in the fields, and click Create.

      Screenshot of creating a deployment

  3. Promote the model to a space.

    1. Return to your project, and click the Assets tab.
    2. In the Models section, click the model to view its details page.
    3. Click Promote to space.
    4. Choose a deployment space for your trained model.
    5. Select the Go to the model in the space after promoting it option.
    6. Click Promote.
  4. When the model displays inside the deployment space, click New deployment.

    1. Select Online as the Deployment type.
    2. Specify a name for the deployment.
    3. Click Create.

Step 4: Score the model

In this section, you learn to create a Python function to process the scoring data to ensure that it is in the same format that was used during training. For comparison, you will also score the raw data set by calling the Python function that we created.

  1. Define the Python function as follows. The function loads the scoring data in its raw format and processes the data exactly as it was done during training. Then, score the processed data.

    def adult_scoring_function():
    
    import pandas as pd
    
    from ibm_watson_machine_learning import APIClient
    
    wml_credentials = {
    "url": "https://us-south.ml.cloud.ibm.com",
    "apikey": "<API KEY>"
    }
    client = APIClient(wml_credentials)
    client.set.default_space('<SPACE ID>')
    
    # converts scoring input data format to pandas dataframe
    def create_dataframe(raw_dataset):
    
    fields = raw_dataset.get("input_data")[0].get("fields")
    values = raw_dataset.get("input_data")[0].get("values")
    
    raw_dataframe = pd.DataFrame(
    columns = fields,
    data = values
    )
    
    return raw_dataframe
    
    # reuse preprocess definition from training data handler
    def preprocess(training_data):
    
    """
    Performs the following preprocessing on adult training and testing data:
    * Drop following features: 'workclass', 'fnlwgt', 'education', 'marital-status', 'occupation',
    'relationship', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country'
    * Map 'race', 'sex' and 'class' values to 0/1
            * ' White': 1, ' Amer-Indian-Eskimo': 0, ' Asian-Pac-Islander': 0, ' Black': 0, ' Other': 0
            * ' Male': 1, ' Female': 0
            * Further details in Kamiran, F. and Calders, T. Data preprocessing techniques for classification without discrimination
    * Split 'age' and 'education' columns into multiple columns based on value
    
    :param training_data: Raw training data
    :type training_data: `pandas.core.frame.DataFrame
    :return: Preprocessed training data
    :rtype: `pandas.core.frame.DataFrame`
    """
    if len(training_data.columns)==15:
    # drop 'fnlwgt' column
    training_data = training_data.drop(training_data.columns[2], axis='columns')
    
    training_data.columns = ['age',
                            'workclass',
                            'education',
                            'education-num',
                            'marital-status',
                            'occupation',
                            'relationship',
                            'race',
                            'sex',
                            'capital-gain',
                            'capital-loss',
                            'hours-per-week',
                            'native-country',
                            'class']
    
    # filter out columns unused in training, and reorder columns
    training_dataset = training_data[['race', 'sex', 'age', 'education-num', 'class']]
    
    # map 'sex' and 'race' feature values based on sensitive attribute privileged/unpriveleged groups
    training_dataset['sex'] = training_dataset['sex'].map({' Female': 0,
                                                            ' Male': 1})
    
    training_dataset['race'] = training_dataset['race'].map({' Asian-Pac-Islander': 0,
                                                            ' Amer-Indian-Eskimo': 0,
                                                            ' Other': 0,
                                                            ' Black': 0,
                                                            ' White': 1})
    
    # map 'class' values to 0/1 based on positive and negative classification
    training_dataset['class'] = training_dataset['class'].map({' <=50K': 0, ' >50K': 1})
    
    training_dataset['age'] = training_dataset['age'].astype(int)
    training_dataset['education-num'] = training_dataset['education-num'].astype(int)
    
    # split age column into category columns
    for i in range(8):
            if i != 0:
            training_dataset['age' + str(i)] = 0
    
    for index, row in training_dataset.iterrows():
            if row['age'] < 20:
            training_dataset.loc[index, 'age1'] = 1
            elif ((row['age'] < 30) & (row['age'] >= 20)):
            training_dataset.loc[index, 'age2'] = 1
            elif ((row['age'] < 40) & (row['age'] >= 30)):
            training_dataset.loc[index, 'age3'] = 1
            elif ((row['age'] < 50) & (row['age'] >= 40)):
            training_dataset.loc[index, 'age4'] = 1
            elif ((row['age'] < 60) & (row['age'] >= 50)):
            training_dataset.loc[index, 'age5'] = 1
            elif ((row['age'] < 70) & (row['age'] >= 60)):
            training_dataset.loc[index, 'age6'] = 1
            elif row['age'] >= 70:
            training_dataset.loc[index, 'age7'] = 1
    
    # split age column into multiple columns
    training_dataset['ed6less'] = 0
    for i in range(13):
            if i >= 6:
            training_dataset['ed' + str(i)] = 0
    training_dataset['ed12more'] = 0
    
    for index, row in training_dataset.iterrows():
            if row['education-num'] < 6:
            training_dataset.loc[index, 'ed6less'] = 1
            elif row['education-num'] == 6:
            training_dataset.loc[index, 'ed6'] = 1
            elif row['education-num'] == 7:
            training_dataset.loc[index, 'ed7'] = 1
            elif row['education-num'] == 8:
            training_dataset.loc[index, 'ed8'] = 1
            elif row['education-num'] == 9:
            training_dataset.loc[index, 'ed9'] = 1
            elif row['education-num'] == 10:
            training_dataset.loc[index, 'ed10'] = 1
            elif row['education-num'] == 11:
            training_dataset.loc[index, 'ed11'] = 1
            elif row['education-num'] == 12:
            training_dataset.loc[index, 'ed12'] = 1
            elif row['education-num'] > 12:
            training_dataset.loc[index, 'ed12more'] = 1
    
    training_dataset.drop(['age', 'education-num'], axis=1, inplace=True)
    
    # move class column to be last column
    label = training_dataset['class']
    training_dataset.drop('class', axis=1, inplace=True)
    training_dataset['class'] = label
    
    return training_dataset
    
    def score(raw_dataset):
    try:
    
    # create pandas dataframe from input
    raw_dataframe = create_dataframe(raw_dataset)
    
    # reuse preprocess from training data handler
    processed_dataset = preprocess(raw_dataframe)
    
    # drop class column
    processed_dataset.drop('class', inplace=True, axis='columns')
    
    # create data payload for scoring
    fields = processed_dataset.columns.values.tolist()
    values = processed_dataset.values.tolist()
    scoring_dataset = {client.deployments.ScoringMetaNames.INPUT_DATA: [{'fields': fields, 'values': values}]}
    print(scoring_dataset)
    
    # score data
    prediction = client.deployments.score('<MODEL DEPLOYMENT ID>', scoring_dataset)
    return prediction
    
    except Exception as e:
    return {'error': repr(e)}
    
    return score
    
  2. Replace the variables in the previous Python function:

    • API KEY: Your IAM API key. To create a new API key, go to the IBM Cloud website, and click Create an IBM Cloud API key under Manage > Access(IAM) > API keys.
    • SPACE ID: ID of the Deployment space where the adult income deployment is running. To see your space ID, go to Deployment spaces > YOUR SPACE NAME > Manage. Copy the Space GUID.
    • MODEL DEPLOYMENT ID: Online deployment ID for the adult income model. To see your model ID, you can see it by clicking the model in your project. It is in both the address bar and the information pane.
  3. Get the Software Spec ID for Python 3.9. For list of other environments run client.software_specifications.list(). software_spec_id = client.software_specifications.get_id_by_name('default_py3.9')

  4. Store the Python function into your Watson Studio space.

    # stores python function in space
    meta_props = {
            client.repository.FunctionMetaNames.NAME: 'Adult Income Scoring Function',
            client.repository.FunctionMetaNames.SOFTWARE_SPEC_ID: software_spec_id
    }
    stored_function = client.repository.store_function(meta_props=meta_props, function=adult_scoring_function)
    function_id = stored_function['metadata']['id']
    
  5. Create an online deployment by using the Python function.

    # create online deployment for fucntion
    meta_props = {
        client.deployments.ConfigurationMetaNames.NAME: "Adult Income Online Scoring Function",
        client.deployments.ConfigurationMetaNames.ONLINE: {}
    }
    online_deployment = client.deployments.create(function_id, meta_props=meta_props)
    function_deployment_id = online_deployment['metadata']['id']
    
  6. Download the Adult Income data set. This is reused as our scoring data.

    import pandas as pd
    
    # read adult csv dataset
    adult_csv = pd.read_csv('./adult.csv', dtype='category')
    
    # use 10 random rows for scoring
    sample_dataset = adult_csv.sample(n=10)
    
    fields = sample_dataset.columns.values.tolist()
    values = sample_dataset.values.tolist()
    
  7. Score the adult income data by using the Python function created.

    raw_dataset = {client.deployments.ScoringMetaNames.INPUT_DATA: [{'fields': fields, 'values': values}]}
    
    prediction = client.deployments.score(function_deployment_id, raw_dataset)
    print(prediction)
    

Next steps

Creating your Federated Learning experiment.

Parent topic: Federated Learning tutorial and samples

Generative AI search and answer
These answers are generated by a large language model in watsonx.ai based on content from the product documentation. Learn more