Joining data sources
When you specify the data for an AutoAI experiment, you can choose to combine two or more data sources that share a common column, or key. You are creating a new data table by combining the data based on the specified join keys.
Attention: The AutoAI experiment feature for joining multiple data sources to create a single training data set is deprecated. Support for joining data in an AutoAI experiment will be removed on Dec 7, 2022. After Dec 7, 2022, AutoAI experiments with joined data and deployments of resulting models will no longer run. To join multiple data sources, use a data preparation tool such as Data Refinery or DataStage to join and prepare data, then use the resulting data set for training an AutoAI experiment. Redeploy the resulting model.
Notes about joining data sources
- Each data source must be a CSV file.
- You can join up to 20 files, with each file less than 4 GB and a combined maximum of 20 GB.
- If the total size of the joined data is more than 1 GB, a sample size of 1 GB is used to train the model.
- The max depth of connections is three. For example, the main source (A) can be connected to source B, which is connected to source C, which in turn is connected to source D. Source D cannot be connected to another source.
- The type of join created is a left join, which returns all records from the left table, and the matching records from the right table.
- Each join must have at least one join key, or common column, specified. If no key is specified, the join is ignored when the experiment runs.
- If you configure more than one join, AutoAI determines the best order for running the joins.
- After you run the experiment, you can download the joined data to review the schema and to see that the new columns added as a result of feature engineering.
Joining data sources
- Specify a name and description for your experiment.
- Select a machine learning service instance and a compute configuration and click Create.
- Choose two or more data files from your project, upload them from your file system, or select them from the asset browser, then press Continue. Tip: Click the Preview icon to review your data.
- When you are done loading data sources, start the configuration process by selecting one of the sources as the main source for the data join.
- Next, click Configure joins to open the canvas for connecting your data sources.
- To create a join, hover over either end of the main source, and drag a connection to another source. A join displays.
- Click the join icon to open the pane for specifying the key. A key is a common field that can connect the data sources. AutoAI identifies and suggests common fields.
- Choose the key to complete the join.
- Repeat steps 6 - 8 to create more joins and keys to connect data sources.
- When you are done, click Save join to return to the experiment configuration, choose a column to predict, and run your experiment.
These examples show how to create a single join and multiple joins.
In this example, two data sources are uploaded:
group_customer_customers.csv``. The file group_customer_main.csv is designated as the main source for the data join.
The key for the join is the column
group_customer_id. Tip: Use the Schema preview tab in the Join panel to view the column names to help you select a key.
In this configuration, five tables are joined with four joins, as follows:
|Main source||Joined source||Key|
When you complete the joins, click Save join. AutoAI configures the join
View and edit join settings
From the create experiment page, click Experiment settings to view and edit these settings for the data join.
Stratified sampling limit
Stratified sampling sorts data into subgroups, or strata, for a more accurate representation of your joined data sources. Optionally increase or decrease the number of rows to include in each strata.
Timestamps are used by AutoAI to extract time related features. If your data set includes a date/time column and you enable the timestamp threshold, the join result only includes the data from row before the timestamp threshold to avoid data leakage.
To establish a threshold, enable the option, then choose the timestamp column and choose the type of date/time data it contains.
Feature selectors are options that help to exclude irrelevant data and improve experiment run time. They include:
- Deduplication (enabled by default). Removes duplicated features.
- Inconsistency (enabled by default). Removes features with inconsistent distribution between random splits.
- Filter (disabled by default). Removes low correlation data for regression problems, or low information gains for classification problems.
Run the experiment
Choose a prediction column and run the experiment. In addition to the infographic for viewing the creation of the pipelines, there is also an infographic and panel for examining the join.
Hover over a join path to view the join keys and the transformations applied to create the join.
From the experiment results page you can also download the joined data to review the schema and see the feature engineering columns.
After you review the results of your experiment, use your experiment to generate predictions.
- Save the best pipeline as a model.
- Promote the model to a deployment space.
- Promote or add the data sets you will use to test the model to the space. Note that you must have an input data source that corresponds to each of the training data sources you used to create the experiment.
- Deploy the model.
- Create a batch job, specifying the data sources for input and specifyin a single output location.
- Run the job.
- Review the results.
For an example of deploying an AutoAI experiment with joined data, see Tutorial: Build and deploy a data join experiment.
You can also save the entire experiment as a notebook so that you can review the transformations used to generate and rank the pipelines. Note that you cannot save an individual pipeline as a notebook. For details, see Saving an AutoAI generated notebook.
For more information about performing automated feature engineering on relational data, see this blog post.
Parent topic: AutoAI