SQL optimization

You can push many data preparation and mining operations directly in your database to improve performance.

One of the most powerful capabilities of SPSS Modeler is the ability to perform many data preparation and mining operations directly in the database. By generating SQL code that can be pushed back to the database for execution, many operations, such as sampling, sorting, deriving new fields, and certain types of graphing, can be performed in the database rather than on the client or server computer. When you're working with large datasets, these pushbacks can dramatically enhance performance in several ways:

  • By reducing the size of the result set to be transferred from the DBMS to Watson Studio. When large result sets are read through an ODBC driver, network I/O or driver inefficiencies may result. For this reason, the operations that benefit most from SQL optimization are row and column selection and aggregation (Select, Sample, Aggregate nodes), which typically reduce the size of the dataset to be transferred. Data can also be cached to a temporary table in the database at critical points in the flow (after a Merge or Select node, for example) to further improve performance.
  • By making use of the performance and scalability of the database. Efficiency is increased because a DBMS can often take advantage of parallel processing, more powerful hardware, more sophisticated management of disk storage, and the presence of indexes.

Given these advantages, Watson Studio is designed to maximize the amount of SQL generated by each SPSS Modeler flow so that only those operations that can't be compiled to SQL are executed by Watson Studio. Because of limitations in what can be expressed in standard SQL (SQL-92), however, certain operations may not be supported.

The following databases are currently supported across various product offerings:
Table 1. Databases supporting SQL pushback on Windows
Database Database version SQL pushback in Watson Studio Cloud? SQL pushback in Cloud Pak for Data? SQL pushback in Watson Studio Desktop? SQL pushback in Watson Machine Learning Server?
Amazon Redshift All versions and future fix packs Yes Yes Yes Yes
Cloudera Impala 2.8 No No Yes Yes
Databases for PostgreSQL 10.0 and future fix packs Yes Yes Yes Yes
Db2 11.1.0 Yes Yes Yes Yes
Google BigQuery Latest No No Yes No
Microsoft SQL Server 2016 Yes Yes Yes Yes
Netezza 7.2 Yes Yes Yes No
Oracle 12c Yes Yes Yes Yes
Teradata 16.2 No No Yes No
For Watson Studio Desktop, you must install the SPSS Data Access Pack to enable SQL pushback support for the following databases. See the SPSS Data Access Pack section below for more information.
  • Cloudera Impala
  • Oracle
  • Microsoft SQL Server
  • Amazon Redshift
Also, to enable SQL pushback for the following databases on Watson Studio Desktop, you must install their native ODBC drivers:
  • Db2
    You must install the IBM Data Server Driver Package available here.
  • Netezza
  • Teradata
    An additional ODBC driver is required; install the latest Teradata Tools and Utilities.
  • Google BigQuery
    You must install the required ODBC driver and then configure the JSON KeyFilePath. For Windows, you can install the latest ODBC driver from cloud.google.com (this version is free) or from simba.com. For macOS, install the latest ODBC driver from simba.com.
    Note that if you use the contents of a key file to define your BigQuery connection in the project, SQL pushback isn't supported. To enable SQL pushback, you must instead save the key file directly to your local computer and point to the absolute path in your connection definition.

macOS only supports Db2 and Google BigQuery (you must install the required native drivers).

Tips:
  • When running a flow, nodes that push back to your database are highlighted with a small SQL icon beside the node. When you start making edits to a flow after running it, the icons will be removed until the next time you run the flow.
    Figure 1. SQL pushback indicator
    SQL pushback indicator
  • If a node can't be pushed back, all subsequent nodes in the flow won't be pushed back either (pushback stops at that node). This may impact how you want to organize the order of nodes in your flow.
  • You don't need to define an ODBC data source to use your drivers with Watson Studio Desktop. Instead, create a connection in your project. See Adding connections to projects for more info.
Notes: Keep the following information in mind regarding SQL:
  • Because of minor differences in SQL implementation, flows that run in a database may return slightly different results when executed in Watson Studio. These differences may also vary depending on the database vendor, for similar reasons. For example, depending on the database configuration for case sensitivity in string comparison and string collation, SPSS Modeler flows that run using SQL pushback may produce different results from those that run without SQL pushback. Contact your database administrator for advice on configuring your database. To maximize compatibility with Watson Studio, database string comparisons should be case sensitive.
  • When using Watson Studio to generate SQL, it's possible the result using SQL pushback is not consistent on some platforms (Linux, for example). This is because floating point is handled differently on different platforms.

SPSS Data Access Pack

This section only applies to Watson Studio Desktop.

On Windows, you can obtain pushback to SQL Server and other non-Db2 databases if you install SPSS Data Access Pack version 7.1.2, which is available on IBM Passport Advantage. The part number is CNKK9EN.

For SPSS Data Access Pack installation instructions, see here. Note that some of the information in the PDF doesn't apply to Watson Studio.