Customize the data handler

Customizing the data handler

Parties need to send data results in a consistent format. Each party might have stored their training data differently. For example one party might have stored its data in a database while another may have its data stored in an Excel or CSV file. The data handler class accommodates for such cases.

Cloud open beta

This is a Cloud open preview and is not supported for use in production environments.

Note: Each party needs to load its training data by implementing the following Python file on their remote server. This is a step for each party involved and not the admin.

Before you begin

The data needs to be loaded and pre-processed. As part of the pre-processing step, each party needs to make sure that:

The feature order is the same for all parties.
All normalization and other pre-processing steps need to be performed in all parties prior to learning. This first step must be implemented in the function load_and_preprocess_data as shown in the following example.
You need to create a custom data handler for Federated Learning in your remote server with an Anaconda environment set up and access to IBM Watson Machine Learning.

Example of Data Handler Class

Here is a general data handler class template which each party must customize.

# your import statements

from ibmfl.data.data_handler import DataHandler

class MyDataHandler(DataHandler):
    """
    Data handler for your dataset.
    """
    def __init__(self, data_config=None):
        super().__init__()
        self.file_name = None
        if data_config is not None:
            if '<your_data_file_name>' in data_config:
                self.file_name = data_config['<your_data_file_name>']
            # extract other additional parameters from `info` if any.

        # load and preprocess the training and testing data
        self.load_and_preprocess_data()

        """
        # Example:
        # (self.x_train, self.y_train), (self.x_test, self.y_test) = self.load_dataset()
        """

    def load_and_preprocess_data(self):
        """
        Loads and pre-processeses local datasets, 
        and updates self.x_train, self.y_train, self.x_test, self.y_test.
 
        # Example:
        # return (self.x_train, self.y_train), (self.x_test, self.y_test)
        """

        pass
    
    def get_data(self):
        """
        Gets the prepared training and testing data.
        
        :return: ((x_train, y_train), (x_test, y_test)) # most build-in training modules expect data is returned in this format
        :rtype: `tuple` 

        This function should be as brief as possible. Any pre-processing operations should be performed in a separate function and not inside get_data(), especially computationally expensive ones.

        # Example:
        # X, y = load_somedata()
        # x_train, x_test, y_train, y_test = \
        # train_test_split(X, y, test_size=TEST_SIZE, random_state=RANDOM_STATE)
        # return (x_train, y_train), (x_test, y_test)
        """
        pass

    def preprocess(self, X, y):
        pass

Where:
- __init__ function: Loads the dataset information with data_config. This argument takes a dictionary input that is received from the info field of the data section in the Federated Learning configuration file. In the example, we can see the data file is specified in the config and can be received by the data handler. Any other additional hyper-parameters and other arguments can also be passed in here to the data handler from the configuration file.
- get_data function: Must be able to return the training data to the IBM Federated Learning framework. This function is called at each round, so it should include unique code to return the data. For non-neural networks, get_data should return a numpy array that contains the data for training.

To see an example of a data handler for the MNIST dataset, see here.

Returning data for neural networks

There are two ways to return data for frameworks that involve neural networks.

Return a data generator where the where the generator is defined by Keras or Tensorflow 2 (this is useful when the amount of data in the party is large). For a code example, see here.
Return data by using numpy arrays. See here for an example of a data handler that uses numpy arrays to return data.

Next steps

Proceed to Registering the parties for details on how to configure the yml file.