How does SQL pushback work?

Last updated: Apr 16, 2025

SQL optimization in SPSS Modeler

The main targets for SQL generation are the initial parts of a flow that come after the data import nodes. When SPSS Modeler finds a node that can't be compiled to SQL, SPSS Modeler extracts the data from the database and processes it.

During flow preparation and running, the process for generating SQL happens as follows:

SPSS Modeler reorders flows to move downstream nodes into the “SQL zone” where it can be proven safe to do so.
SPSS Modeler works from the import nodes toward the terminal nodes, constructing SQL expressions incrementally.
- This phase stops when either SPSS Modeler reaches a node that can't be converted to SQL or the terminal node in a flow is converted to SQL (for example, a Table node or a Graph node).
- To optimize performance, nodes might be excluded from the SQL generation process if the output of that node is not used downstream in later nodes. For example, if a Derive node creates a new column that is not used in a downstream Matrix node, then the Derive node is excluded from the SQL generation. An SQL icon does not appear next to the Derive node at the end of SQL generation.
  
  Figure 1. SQL generation excluding nodes
- At the end of this phase, each node is labeled with an SQL statement if the node and its predecessors have an SQL equivalent.
The SQL is checked for validity. SPSS Modeler works from the nodes that have the most complicated SQL equivalents backwards to the import nodes. The SQL that was successfully validated is chosen for execution.
Nodes for which all operations have generated SQL are highlighted with an SQL icon next to the node on the flow canvas. Based on the results, you might want to further reorganize your flow where appropriate to take full advantage of database execution.

Where do improvements occur?

SQL pushback improves performance for several data operations:

Joins (merge by key): Join operations can increase optimization within databases.
Aggregation: The Aggregate, Distribution, and Web nodes all use aggregation to produce their results. Summarized data uses considerably less bandwidth than the original data.
Selection: Choosing records based on certain criteria reduces the quantity of records.
Sorting: Sorting records is a resource-intensive activity that is performed more efficiently in a database.
Field derivation: New fields are generated more efficiently in a database.
Field projection: The software extracts only fields that are required for subsequent processing from the database, which minimizes bandwidth and memory requirements. The same is also true for superfluous fields in flat files: although the software must read the superfluous fields, it doesn't allocate any storage for them.
Scoring: SQL can be generated from decision trees, rulesets, linear regression, and factor-generated models.

Was the topic helpful?

0/1000

Where do improvements occur?Copy link to section

Where do improvements occur?