Batch deployment details

You can create a batch deployment using any of these interfaces:

  • Watson Studio user interface, from an Analytics deployment space
  • Watson Machine Learning Python Client
  • Watson Machine Learning REST APIs

Data sources

The input data sources for a batch deployment job differ by framework. Input data can be supplied to a batch job as:

  • Inline data - In this method, the input data for batch processing is specified in the batch deployment job’s payload, for example, as a value for parameter scoring.input_data. Once the batch deployment job is completed, the output of the batch deployment job is written to the corresponding job’s metadata parameter scoring.predictions
  • Data reference - in this method, the input and output data for batch processing can be stored in a remote data source like a Cloud Object Storage bucket, an SQL/no-SQL database, or as a local or managed data asset in a deployment space. Details for data references include:

    • input_data_references.type and output_data_reference.type must be data_asset

    • The references to input data must be specified as a /v2/assets href in the input_data_references.location.href parameter in the deployment job’s payload. The data asset specified here can be a reference to a local or connected data asset.

    • If the batch deployment job’s output data has to be persisted in a remote data source, the references to output data must be specified as a /v2/assets href in output_data_reference.location.href parameter in the deployment job’s payload.

    • If the batch deployment job’s output data has to be persisted in a deployment space as a local asset, output_data_reference.location.name must be specified. Once the batch deployment job is completed successfully, the asset with the specified name will be created in the space.

    • If the output data references where the data asset is in a remote database, you can specify if the batch output should be appended to the table or if the table is to be truncated and output data updated. Use the output_data_references.location.write_mode parameter to specify the values truncate or append. Note the following:

      • Specifying truncate as value truncates the table and inserts the batch output data.
      • Specifying append as value appends the batch output data to the remote database table.
      • write_mode is applicable only for output_data_references parameter.
      • write_mode is applicable only for remote database related data assets. This parameter will not be applicable for a local data asset or a COS-based data asset.
    • Any input and output data asset references must be in the same space id as the batch deployment.

    • If the connected data asset references a Cloud Object Storage instance as source, for example, a file in a Cloud Object Storage bucket, you must supply the HMAC credentials for the COS bucket. Include an Access Key and a Secret Key to your IBM Cloud Object Storage connection to enable access to the stored files.

Using data from a Cloud Object Storage connection

  1. Create a connection to IBM Cloud Object Storage by adding a Connection to your project or space and selecting Cloud Object Storage (infrastructure) as the connection type. Provide the secret key, access key and login URL.
  2. Add input and output files to the deployment space as connected data using the COS connection you created.

Specifying the compute requirements for the batch deployment job

The compute configuration for a batch deployment refers to the CPU and memory size allocated for a job. This information must be specified in the hardware_spec API parameter of either of these:

  • deployments payload
  • deployment jobs payload.

In the case of a batch deployment of an AutoAI model, the compute configuration must be specified in hybrid_pipeline_hardware_specs instead of hardware_spec parameter.

The compute configurations must be a reference to a predefined hardware specification. You can specify a hardware specification by name or id, using the id or name of the hardware specification with hardware_spec or hybrid_pipeline_hardware_specs (for AutoAI). The list and details about the predefined hardware specifications can be accessed through the Watson Machine Learning Python client or the Watson Machine Learning REST APIs.

Predefined hardware specifications

These are the predefined hardware sepcifications available by model type.

Watson Machine Learning models

Size Hardware definition
XS 1 CPU and 4 GB RAM
S 2 CPU and 8 GB RAM
M 4 CPU and 16 GB RAM
ML 4 CPU and 32 GB RAM
L 8 CPU and 32 GB RAM
XL 8 CPU and 64 GB RAM

Decision Optimization

Size Hardware definition
S 2 CPU and 8 GB RAM
M 4 CPU and 16 GB RAM
XL 16 CPU and 64 GB RAM

AutoAI with joined data

Note: These hardware definitions only apply if you are deploying an AutoAI model that uses a joined data set. For AutoAI models with a single data set, use the hardware definitions for Watson Machine Learning models.

Size Hardware definition
XS-Spark 1 CPU and 4 GB RAM, 1 master + 2 workers
S-Spark 2 CPU and 8 GB RAM, 1 master + 2 workers
M-Spark 4 CPU and 16 GB RAM, 1 master + 2 workers
L-Spark 4 CPU and 32 GB RAM, 1 master + 2 workers
XL-Spark 8 CPU and 32 GB RAM, 1 master + 2 workers

Steps for submitting a batch deployment job (overview)

  1. Create a deployment of type batch.
  2. Submit a deployment job with reference to the batch deployment.
  3. Poll for the status of the deployment job by querying the details of the corresponding deployment job via Watson Machine Learning Python client, REST APIs, or via the deployment space user interface.

Queuing and concurrent job executions

The maximum number of concurrent jobs that can be run for each deployment is handled internally by the deployment service. A maximum of two jobs per batch deployment can be executed concurrently. Any deployment job requests for a specific batch deployment that already has two jobs under running state will be placed into a queue for execution at a later point of time. Once any of the running jobs are completed, the next job in the queue will be picked up for execution. There is no upper limit on the queue size.

Retention of deployment job metadata

The job-related metadata will be persisted and can be accessed as long as the job and its deployment are not deleted.

Input details by framework

Refer to your model type for details on what types of data are supported as input for a batch job.

Decision optimization

Type: data references

Data Sources:

Inline data:

  • Inline input data is converted to CSV files and used by engine.
  • Engine’s CSV output data are converted to output inlined data.
  • No support for raw data.

Local/managed assets in deployment space:

  • Data reference type must be data_asset
  • Input file based tabular data supported by wdp-connect-library like CSV, XLS, XLSX, JSON are converted to CSV files and used by engine.
  • Output is saved as CSV file.
  • Raw data is not supported for input or output data.
  • A managed asset can be updated or created. In case of creation you can set the name and description for the created asset.
  • No support for ZIP files.

Connected(remote) assets in deployment space with source such as Cloud Object Storage or DB2:

  • Data reference type must be data_asset
  • When data source is an SQL database connection, table data are converted to CSV files and used by engine.
  • Output CSV files are then converted to SQL insert command to tables using wdp-connect-library.
  • Output tables can be truncated or appended. By default truncate mode is used.

Notes:

  • Data reference type must be s3 or Db2 (This applies for output_data_reference as well.)
    • Connection details for s3 or Db2 data source must be specified in input_data_references.connection parameter in deployment jobs payload.
    • Location details such as table name, bucket name or path must be specified in input_data_references.location.path parameter in deployment jobs payload.
    • Data reference type must be url if data must be accessed through an URL
    • Connection details such as REST method, URL, and other parameters required must be specified in input_data_references.connection parameter in deployment jobs payload.
    • Support for input and output raw data.
  • Data reference type must be url if data must be accessed through a URL.
    • Connection details such as REST method, URL and other parameters required must be specified in input_data_references.connection parameter in deployment jobs payload.
    • Access to raw data for input and output data references is allowed using URL with associated REST headers.
  • You can use a pattern in id or connection properties. For example:
    • To collect all output csv as inline data: output_data: [ { “id”:”.*\.csv”}]`
    • To collect job output in particular S3 folder output_data_references: [ {"id":".*", "type": "s3", "connection": {...}, "location": { "bucket": "do-wml", "path": "${job_id}/${attachment_name}" }}]
  • Environment_variables parameter of deployment jobs is not applicable.

File Formats: environment_variables parameter of deployment jobs is not applicable.

Spark

Type: inline (environment_variables parameter of deployment jobs is not applicable)

SPSS

Type: inline and data references

Data Sources: Data reference type must be data_asset for following assets

  • Local/managed assets from the space
  • Connected(remote) assets with source such as:
    • Cloud Object Storage
    • Mounted volume of network file system (NFS)
    • DB2 Warehouse
    • DB2

File Formats: csv,xls,sas,sav

Notes:

  • To create a local or managed asset as an output data reference, the name field should be specified for output_data_reference so that a data asset will be created with the specified name. Specifying an href that refers to an existing local data asset is not supported. Note that connected data assets referring to databases Db2 or Db2 Warehouse can be created in the output_data_references only when the input_data_references also refers to one of these sources.
  • Note that table names provided in input and output data references are ignored. Table names referred in SPSS model stream will be used during the batch deployment.
  • The environment_variables parameter of deployment jobs is not applicable

  • If you are creating job via the Python client, you must provide the connection name referred in data nodes of SPSS model stream in the “id” field, and the data asset href in “location.href” for input/output data references of the deployment jobs payload. For example, you can construct the jobs payload like this:

     job_payload_ref = {
         client.deployments.ScoringMetaNames.INPUT_DATA_REFERENCES: [{
             "id": "DB2Connection",
             "name": "drug_ref_input1",
             "type": "data_asset",
             "connection": {},
             "location": {
                 "href": input_asset_href1
             }
         },{
             "id": "Db2 WarehouseConn",
             "name": "drug_ref_input2",
             "type": "data_asset",
             "connection": {},
             "location": {
                 "href": input_asset_href2
             }
         }],
         client.deployments.ScoringMetaNames.OUTPUT_DATA_REFERENCE: {
                 "type": "data_asset",
                 "connection": {},
                 "location": {
                     "href": output_asset_href
                 }
             }
         }
    

Supported combinations of input and output sources

You must specify compatible sources for the SPSS Modeler flow input, the batch job input, and the output. If you specify an incompatible combination of types of data sources, you will get an error trying to execute the batch job.

These combinations are supported for batch jobs:

SPSS model stream input/output Batch deployment job input  Batch deployment job output
File Local/managed or referenced data asset (file) Remote data asset (file) or name
Database Remote data asset (database) Remote data asset (database)

For details on how Watson Studio connects to data, see Accessing data.

Specifying multiple inputs with no schema

If you are specifying multiple inputs for an SPSS model stream deployment with no schema, specify an ID for each element in input_data_references. In this example, when you create the job, provide three input entries with ids: “sample_db2_conn”, “sample_teradata_conn” and “sample_googlequery_conn” and select the required connected data for each input.

{
"deployment": {
    "href": "/v4/deployments/<deploymentID>"
  },
  "scoring": {
  	  "input_data_references": [{
               "id": "sample_db2_conn",              
               "name": "DB2 connection",
               "type": "data_asset",      
               "connection": {},
               "location": {
                     "href": "/v2/assets/<asset_id>?space_id=<space_id>"
               },
	      		"schema": {}
           },
           {
               "id": "sample_teradata_conn",          
               "name": "Teradata connection",
               "type": "data_asset",      
               "connection": {},
               "location": {
                     "href": "/v2/assets/<asset_id>?space_id=<space_id>"
               },
	      		"schema": {}
           },
           {
               "id": "sample_googlequery_conn",        
               "name": "Google bigquery connection",
               "type": "data_asset",      
               "connection": {},
               "location": {
                     "href": "/v2/assets/<asset_id>?space_id=<space_id>"
               },
	      		"schema": {}
           }],
  	  "output_data_references": {
  	  	"id": "sample_db2_conn"
                 "type": "data_asset",
                 "connection": {},
                 "location": {
                    "href": "/v2/assets/<asset_id>?space_id=<space_id>"
                 },
                 "schema": {}
          }
}

AutoAI

Type: inline and data references

Data Sources: Data reference type must be data_asset for following assets

  • Local/managed assets from the space
  • Connected(remote) assets with source such as Cloud Object Storage or NFS

File Formats: csv

Notes: environment_variables parameter of deployment jobs is not applicable

Scikit-Learn & XGBoost

Type: inline and data references

Data Sources: Data reference type must be data_asset for following assets

  • Local/managed assets from the space
  • Connected(remote) assets with source such as Cloud Object Storage

File Formats: csv, ZIP containing .csv files

Notes: environment_variables parameter of deployment jobs is not applicable

Tensorflow

Type: inline and data references

Data Sources: Data reference type must be data_asset for following assets

  • Local/managed assets from the space
  • Connected(remote) assets with source such as Cloud Object Storage.

File Formats: ZIP containing JSON files

Notes: environment_variables parameter of deployment jobs is not applicable

Keras

Type: inline and data references

Data Sources: Data reference type must be data_asset for following assets

  • Local/managed assets from the space
  • Connected(remote) assets with source such as Cloud Object Storage.

File Formats: ZIP containing JSON files

Notes: environment_variables parameter of deployment jobs is not applicable

Pytorch

Type: inline and data references

Data Sources: Data reference type must be data_asset for following assets

  • Local/managed assets from the space
  • Connected(remote) assets with source such as Cloud Object Storage.

File Formats: ZIP containing JSON files

Notes: environment_variables parameter of deployment jobs is not applicable

Python function

Type: inline

Notes: environment_variables parameter of deployment jobs is not applicable

Python Scripts

Type: data references

Data Sources: Data reference type must be data_asset for following assets

  • Local/managed assets from the space
  • Connected(remote) assets with source such as Cloud Object Storage.

File Formats: any

Notes:

  • Environment variables required for executing the Python Script can be specified as key - value pairs in scoring.environment_variables parameters in deployment jobs payload. Key must be the name of environment variable. Value must be value of the corresponding environment variable.
  • The payload of deployment jobs payload will be saved as a JSON file in the deployment container where Python script will be executed. Python script can access the full path filename of the JSON file using JOBS_PAYLOAD_FILE environment variable.
  • If input data is referenced as a local or managed data asset, deployment service will download the input data and place it in the deployment container where Python script will be executed. The location(path) of the downloaded input data can be accessed through BATCH_INPUT_DIR environment variable.
  • If the input data is a connected data asset, downloading of the data must be handled by the Python script. If connected data asset reference is present in the deployment jobs payload, it can be accessed using JOBS_PAYLOAD_FILE which contains full path to deployment jobs payload saved as a JSON file.
  • If output data must be persisted as a local or managed data asset in a space, users can specify the name of the asset to be created in scoring.output_data_reference.location.name. As part of Python script, output data can be placed in the path specified by environment variable BATCH_OUTPUT_DIR. Deployment service will compress in ZIP format and upload the data in BATCH_OUTPUT_DIR.
  • If output data must be saved in a remote data store, users must specify the reference of the output data asset(for example, a connected data asset) in output_data_reference.location.href. Python script must take care of uploading the output data to the remote data source. If connected data asset reference is present in the deployment jobs payload, it can be accessed using JOBS_PAYLOAD_FILE, which contains full path to deployment jobs payload saved as a JSON file.
  • If the Python script does not require any input or output data references to be specified in deployment job’s payload, empty object as [{ }] can be specified for input_data_references and empty { } object for output_data_references can be specified in deployment jobs payload.

R Scripts

Type: data references

Data Sources: Data reference type must be data_asset for following assets

  • Local/managed assets from the space
  • Connected(remote) assets with source such as Cloud Object Storage.

File Formats: any

Notes:

  • Environment variables required for executing the script can be specified as key - value pairs in scoring.environment_variables parameters in deployment jobs payload. Key must be the name of environment variable. Value must be value of the corresponding environment variable.
  • The payload of deployment jobs payload will be saved as a JSON file in the deployment container where the script will be executed. R script can access the full path filename of the JSON file using JOBS_PAYLOAD_FILE environment variable.
  • If input data is referenced as a local or managed data asset, deployment service will download the input data and place it in the deployment container where R script will be executed. The location(path) of the downloaded input data can be accessed through BATCH_INPUT_DIR environment variable.
  • If the input data is a connected data asset, downloading of the data must be handled by the R script. If connected data asset reference is present in the deployment jobs payload, it can be accessed using JOBS_PAYLOAD_FILE which contains full path to deployment jobs payload saved as a JSON file.
  • If output data must be persisted as a local or managed data asset in a space, you can specify the name of the asset to be created in scoring.output_data_reference.location.name. As part of R script, output data can be placed in the path specified by environment variable BATCH_OUTPUT_DIR. Deployment service will compress in ZIP format and upload the data in BATCH_OUTPUT_DIR.
  • If output data must be saved in a remote data store, you must specify the reference of the output data asset(for example, a connected data asset) in output_data_reference.location.href. The R script must take care of uploading the output data to the remote data source. If connected data asset reference is present in the deployment jobs payload, it can be accessed using JOBS_PAYLOAD_FILE, which contains the full path to the deployment jobs payload saved as a JSON file.
  • If the R script does not require any input or output data references to be specified in the deployment job’s payload, empty object as [{ }] can be specified for input_data_references and empty { } object for output_data_references can be specified in deployment jobs payload.
  • R Scripts are currently supported only with the default software spec default_r3.6; specifying a custom software specification is not supported.