SQL optimization

One of the most powerful capabilities of SPSS Modeler is the ability to perform many data preparation and mining operations directly in the database. By generating SQL code that can be pushed back to the database for execution, many operations, such as sampling, sorting, deriving new fields, and certain types of graphing, can be performed in the database rather than on the client or server computer. When you're working with large datasets, these pushbacks can dramatically enhance performance in several ways:

  • By reducing the size of the result set to be transferred from the DBMS to Watson Studio. When large result sets are read through an ODBC driver, network I/O or driver inefficiencies may result. For this reason, the operations that benefit most from SQL optimization are row and column selection and aggregation (Select, Sample, Aggregate nodes), which typically reduce the size of the dataset to be transferred. Data can also be cached to a temporary table in the database at critical points in the flow (after a Merge or Select node, for example) to further improve performance.
  • By making use of the performance and scalability of the database. Efficiency is increased because a DBMS can often take advantage of parallel processing, more powerful hardware, more sophisticated management of disk storage, and the presence of indexes.

Given these advantages, Watson Studio is designed to maximize the amount of SQL generated by each SPSS Modeler flow so that only those operations that can't be compiled to SQL are executed by Watson Studio. Because of limitations in what can be expressed in standard SQL (SQL-92), however, certain operations may not be supported.

Notes: Keep the following information in mind regarding SQL:
  • Because of minor differences in SQL implementation, flows that run in a database may return slightly different results when executed in Watson Studio. These differences may also vary depending on the database vendor, for similar reasons. For example, depending on the database configuration for case sensitivity in string comparison and string collation, SPSS Modeler flows that run using SQL pushback may produce different results from those that run without SQL pushback. Contact your database administrator for advice on configuring your database. To maximize compatibility with Watson Studio, database string comparisons should be case sensitive.
  • When using Watson Studio to generate SQL, it's possible the result using SQL pushback is not consistent on some platforms (Linux/zLinux, for example). This is because floating point is handled differently on different platforms.
  • When flows run in a Netezza database, date and time details are taken from that database. This may differ from your local date and time if, for example, the database is on a machine that's located in a different country or time zone.

Database requirements

For the latest information on which databases and ODBC drivers are supported and tested for use with Watson Studio, see the compatibility reports available at https://www.ibm.com/software/reports/compatibility/clarity/softwareReqsForProduct.html.

ODBC driver setup

To ensure that time details (such as HH:MM:SS) are processed correctly when using SQL 2012 on Windows 8 32-bit systems, when setting up your ODBC SQL Server Wire Protocol Driver, you should select both the Enable Quoted Identifiers and Fetch TWFS as Time options.