Create the data handler | IBM Cloud Pak for Data as a Service

Create the data handler

Last updated: Nov 27, 2024

Create the data handler

Each party in a Federated Learning experiment must get a data handler to process their data. You or a data scientist must create the data handler. A data handler is a Python class that loads and transforms data so that all data for the experiment is in a consistent format.

About the data handler class

The data handler performs the following functions:

Accesses the data that is required to train the model. For example, reads data from a CSV file into a Pandas data frame.
Pre-processes the data so data is in a consistent format across all parties. Some example cases are as follows:
- The Date column might be stored as a time epoch or timestamp.
- The Country column might be encoded or abbreviated.
The data handler ensures that the data formatting is in agreement.
- Optional: feature engineer as needed.

The following illustration shows how a data handler is used to process data and make it consumable by the experiment:

A use case of the data handler unifying data formats

One party might have multiple tables in a relational database while another party uses a CSV file. After the data is processed with the data handler, they will have a unified format. For example, all data are put into a single table with previous data in separate tables joined together.

Data handler template

A general data handler template is as follows:

# your import statements

from ibmfl.data.data_handler import DataHandler

class MyDataHandler(DataHandler):
    """
    Data handler for your dataset.
    """
    def __init__(self, data_config=None):
        super().__init__()
        self.file_name = None
        if data_config is not None:
            # This can be any string field.
            # For example, if your data set is in `csv` format,
            # <your_data_file_type> can be "CSV", ".csv", "csv", "csv_file" and more.
            if '<your_data_file_type>' in data_config:
                self.file_name = data_config['<your_data_file_type>']
            # extract other additional parameters from `info` if any.

        # load and preprocess the training and testing data
        self.load_and_preprocess_data()

        """
        # Example:
        # (self.x_train, self.y_train), (self.x_test, self.y_test) = self.load_dataset()
        """

    def load_and_preprocess_data(self):
        """
        Loads and pre-processeses local datasets, 
        and updates self.x_train, self.y_train, self.x_test, self.y_test.

        # Example:
        # return (self.x_train, self.y_train), (self.x_test, self.y_test)
        """

        pass
    
    def get_data(self):
        """
        Gets the prepared training and testing data.
        
        :return: ((x_train, y_train), (x_test, y_test)) # most build-in training modules expect data is returned in this format
        :rtype: `tuple` 

        This function should be as brief as possible. Any pre-processing operations should be performed in a separate function and not inside get_data(), especially computationally expensive ones.

        # Example:
        # X, y = load_somedata()
        # x_train, x_test, y_train, y_test = \
        # train_test_split(X, y, test_size=TEST_SIZE, random_state=RANDOM_STATE)
        # return (x_train, y_train), (x_test, y_test)
        """
        pass

    def preprocess(self, X, y):
        pass

Parameters

your_data_file_type: This can be any string field. For example, if your data set is in csv format, your_data_file_type can be "CSV", ".csv", "csv", "csv_file" and more.

Return a data generator defined by Keras or Tensorflow

The following is a code example that needs to be included as part of the get_data function to return a data generator defined by Keras or Tensorflow:

train_gen = ImageDataGenerator(rotation_range=8,
                                width_sht_range=0.08,
                                shear_range=0.3,
                                height_shift_range=0.08,
                                zoom_range=0.08)

train_datagenerator = train_gen.flow(
    x_train, y_train, batch_size=64)

return train_datagenerator

Data handler examples

Parent topic: Creating a Federated Learning experiment