Each party in a Federated Learning experiment must get a data handler to process their data. You or a data scientist must create the data handler. A data handler is a Python class that loads and transforms data so that all data for the experiment
is in a consistent format.
About the data handler class
Copy link to section
The data handler performs the following functions:
Accesses the data that is required to train the model. For example, reads data from a CSV file into a Pandas data frame.
Pre-processes the data so data is in a consistent format across all parties. Some example cases are as follows:
The Date column might be stored as a time epoch or timestamp.
The Country column might be encoded or abbreviated.
The data handler ensures that the data formatting is in agreement.
Optional: feature engineer as needed.
The following illustration shows how a data handler is used to process data and make it consumable by the experiment:
One party might have multiple tables in a relational database while another party uses a CSV file. After the data is processed with the data handler, they will have a unified format. For example, all data are put into a single table with previous
data in separate tables joined together.
Data handler template
Copy link to section
A general data handler template is as follows:
# your import statementsfrom ibmfl.data.data_handler import DataHandler
classMyDataHandler(DataHandler):
"""
Data handler for your dataset.
"""def__init__(self, data_config=None):
super().__init__()
self.file_name = Noneif data_config isnotNone:
# This can be any string field.# For example, if your data set is in `csv` format,# <your_data_file_type> can be "CSV", ".csv", "csv", "csv_file" and more.if'<your_data_file_type>'in data_config:
self.file_name = data_config['<your_data_file_type>']
# extract other additional parameters from `info` if any.# load and preprocess the training and testing data
self.load_and_preprocess_data()
"""
# Example:
# (self.x_train, self.y_train), (self.x_test, self.y_test) = self.load_dataset()
"""defload_and_preprocess_data(self):
"""
Loads and pre-processeses local datasets,
and updates self.x_train, self.y_train, self.x_test, self.y_test.
# Example:
# return (self.x_train, self.y_train), (self.x_test, self.y_test)
"""passdefget_data(self):
"""
Gets the prepared training and testing data.
:return: ((x_train, y_train), (x_test, y_test)) # most build-in training modules expect data is returned in this format
:rtype: `tuple`
This function should be as brief as possible. Any pre-processing operations should be performed in a separate function and not inside get_data(), especially computationally expensive ones.
# Example:
# X, y = load_somedata()
# x_train, x_test, y_train, y_test = \
# train_test_split(X, y, test_size=TEST_SIZE, random_state=RANDOM_STATE)
# return (x_train, y_train), (x_test, y_test)
"""passdefpreprocess(self, X, y):
pass
Copy to clipboardCopied to clipboardShow more
Parameters
your_data_file_type: This can be any string field. For example, if your data set is in csv format, your_data_file_type can be "CSV", ".csv", "csv", "csv_file" and
more.
Return a data generator defined by Keras or Tensorflow
Copy link to section
The following is a code example that needs to be included as part of the get_data function to return a data generator defined by Keras or Tensorflow: