0 / 0
Tutorial: AutoAI Data Join Experiment
Tutorial: AutoAI Data Join Experiment

Tutorial: AutoAI Data Join Experiment

Learn how to join several data sources that are related to a fictional outdoor store named Go, then build an experiment that uses the data to train a machine learning experiment. You can then deploy the resulting model and use it to predict daily sales for each product Go sells.

Attention: The AutoAI experiment feature for joining multiple data sources to create a single training data set is deprecated. Support for joining data in an AutoAI experiment will be removed on Dec 7, 2022. After Dec 7, 2022, AutoAI experiments with joined data and deployments of resulting models will no longer run. To join multiple data sources, use a data preparation tool such as Data Refinery or DataStage to join and prepare data, then use the resulting data set for training an AutoAI experiment. Redeploy the resulting model.

Joining data also allows for a specialized set of feature transformations and advanced data aggregators. After building the pipelines, you can explore the factors that produced each pipeline.

Watch this video to see a preview of the tutorial steps.

This video provides a visual method as an alternative to following the written steps in this documentation.

This video shows you how to join several data sources in an AutoAI experiment.

Data sets Overview

This figure shows the relationship between the data. You use the data join canvas to create the data connections that are required to combine the data for the experiment.

An image of tables connected by foreign keys

The data that you will join contains the following information:

  • Daily_sale: The GO company has many retailers selling its outdoor products. The daily sale table is a timeseries of sale records where the DATE and QUANTITY column indicate the sale quantity and the sale date for each product in a retail store.
  • Products: This table keeps product information such as product type and product names.
  • Retailers: This table keeps retailer information such as retailer names and address.
  • Methods: This table keeps order methods such as Via Telephone, Online, or Email
  • Go: The GO company is interested by using this data to predict its daily sale for every product in its retail stores. The prediction target column is QUANTITY in the go table and DATE column indicates the cutoff time when prediction is made.

Tasks Overview

This tutorial presents the basic steps for building and training a machine learning model by using AutoAI:

  1. Create a Watson Studio project
  2. Create an AutoAI experiment
  3. Configuring the experiment
  4. Training the experiment
  5. Deploy the trained model
  6. Creating a batch job to score the model

Create a Watson Studio project

  1. From the Gallery, sign in with your IBM account and download the Go Sample data set data set file to your computer and extract the CSV files.
  2. In the Projects page to create a new project, select New Project.
    • Select Create an empty project.
    • Include your project name.
    • Click Create.

Create an AutoAI experiment

  1. In the Assets tab from within your project, click New asset and choose AutoAI.
  2. Specify a name and optional description for your new experiment, then select Create.
  3. Select the Associate a Machine Learning service instance link to associate the IBM Watson Machine Learning service instance with your project. Click Reload to confirm your configuration.
  4. To add a data source, you can choose one of the following:
    a. If you downloaded your file locally, upload the 5 CSV files in the Go Company data set, from your local computer by dragging the files onto the data pane or by clicking browse and then following the prompts.
    b. If you already uploaded your file to your project, click select from project, then select the data asset tab and add the five tables from the Go Company data set to the project.

Configuring the experiment

  1. Choose go_1k.csv as the main source (the table with a prediction target column).
  2. Click Configure Join to open the data join canvas.
    Joining configuration of the five tables

Step 2: Connect the data tables

To connect data tables, drag from the plus button on the end of one source to the source you want to connect. For each connection, you are prompted to specify a key, which is the common column. You can choose from suggested keys or specify the keys manually.

  1. Starting from the go_1k.csv, drag the node to the go_products.csv table to create a connection.
    Dragging node of the main source to the call log table.

  2. In the pane for configuring the join, click (+) to add the suggested key Product Number as the join key.
    Adding Product Number as keys

  3. Click Done to complete the join.

  4. Using the details in this table, repeat steps 1-3 to create the remaining joins:

Main source Joined source Key
go_1k go_retailer 1. Retailer code
go_1k go_daily_sales 1. Product number
2. Retailer code
go_1k go_methods 1. Order method code

Your canvas looks like this when you complete the data joins:
Five tables after they are joined Click Done and Save Join to finish the data join.

Training the experiment

To train the model, you choose a prediction column in the main source and use the combined data source to train the model to create the prediction.

  1. In Configuration details, select No for the option to create a Time Series Forecast. Choose Quantity as the column to predict. AutoAI analyzes your data and determines that the Quantity column contains a wide range of numeric information, making this data suitable for a regression model. The default metric for a regression model is Root mean squared error (RMSE).

    Your AutoAI configuration looks like this before you run the experiment.

Step 1: Configure a timestamp threshold

For this tutorial, you specify a time threshold to limit the training data to a period of time. Setting a timestamp enables AutoAI to use time information to extract timeseries-related features. Data that is collected outside the prediction time cutoff is ignored during the feature engineering process.

  1. Click Experiment settings.

  2. On the Data Source page, click the Join tab.

  3. Toggle Enable the timestamp threshold.

    Enable timestamp threshold in the data source page in experiment settings.

  4. In the main data table, go_1k.csv, choose Date as the cutoff timestamp column. For Time type, select Time – Custom and enter dd/MM/yyyy as the date format. No data after the date in the cutoff column is considered for training the pipelines.

  5. In the data table go_daily_sales.csv, choose Date as a timestamp column so that AutoAI can enhance the set of features with timeseries related features, for Time – Custom select Time – Type and enter dd/MM/yyyy as the date format.

Note: The data format must exactly match the format in the data source or you get an error running the experiment.

Step 2: Specify the runtime settings

After defining the experiment, you can allocate the resources for training the pipelines.

  1. Click Runtime to switch to the Runtime tab.

  2. Increase the number of executors to 10.

  3. Click Save settings to save the configuration changes.

Step 3: Run experiment

  1. Click Run experiment to start training the experiment and generating the pipelines. Click nodes in the infographic to explore how pipelines were created.

    Experiment summary generating pipelines

  2. After all the pipelines are created, you can compare their accuracy on the Pipeline leaderboard.

    Ranked pipeline leaderboard based on accuracy

  3. You can then click Pipeline comparison to see how they differ. For example:

    Metric chart of pipeline comparison

  4. Select Save as to create your model from the action menu for the pipeline with rank 1 then select Create. This saves the pipeline under the Models section in the Assets tab.

Deploy the trained model

  1. You can deploy the model from the model details page. You can access the model details page in one of these ways:
    • Clicking on the model’s name in the notification displayed when you save the model.
    • Open the Assets tab for the project, select the Models section and select the model’s name.
  2. Click Promote to Deployment Space then select or create the space where the model will be deployed.
    • To create a deployment space:
      • Enter a name
      • Associate it with a Machine Learning Service
      • Select Create.
  3. After you create your deployment space or select an existing one, select Promote.
  4. Click the deployment space link from the notification.
  5. From the Assets tab of the deployment space:
  6. Hover over the model’s name and click the deploy icon Deploy icon.
  7. In the page that opens, complete the fields:
    • Select Batch as the Deployment type.
    • Specify a name for the deployment.
    • Click Create.

After the deployment is complete, click the Deployments tab and select the deployment name to view the details page.

Creating a batch job to score the model

Step 1: Upload files

For this tutorial, you submit the training files as the scoring files as a way to demonstrate the process and view results.

From the Assets page of the deployment space:

  1. Click Add to space then choose Data

Upload the following files that you saved locally:

  • Go.1k.csv
  • Go.retailers.csv
  • Go_methods.csv
  • Go_daily_sales.csv
  • Go_products.csv

Step 2: Create the batch job

The batch job runs the deployment. To create the job, that you specify the input data and the name for the output file. You can set up a job to run on a schedule, or run immediately.

  1. Click New job.
  2. Specify a name for the job
  3. Configure to the smallest hardware specification
  4. Optional: To set a schedule and receive notifications.
  5. Select Data Source > Specify the following input files:
    • Go.1k.csv
    • Go.retailers.csv
    • Go_methods.csv
    • Go_daily_sales.csv
    • Go_products.csv
  6. Name the output file: go-output.csv, click Next.

    Selected data sources in batch job

  7. Review and click Create to run the job.

Step 4: View the output

When the deployment status changes to Deployed, return to the Assets page for the deployment space. You see that the file go-output.csv was created and added to your assets list.

Download the output file and open it in an editor. You can review the prediction results for the customer information that is submitted for batch processing.

Watch this short video to see a different use case of a call center analysis for a mobile company.

This video provides a visual method as an alternative to following the written steps in this documentation.

Learn more

AutoAI overview

Parent topic: AutoAI

Generative AI search and answer
These answers are generated by a large language model in watsonx.ai based on content from the product documentation. Learn more