To optimize the running of flows, you can set up a cache on any nonterminal node. When
you set up a cache on a node, the cache is filled with the data that passes through the node the
next time you run the data flow. From then on, the data is read from the cache (which is stored
temporarily) rather than from the data source.
Caching is most useful following a time-consuming operation such as a sort, merge, or
aggregation. For example, suppose that you have an import node set to read sales data from a
database and an Aggregate node that summarizes sales by location. You can set up a cache on the
Aggregate node rather than on the import node because you want the cache to store the aggregated
data rather than the entire data set.
Note: Caching at import nodes, which simply stores a copy of
the original data as it is read into SPSS Modeler, won't improve
performance in most circumstances.
Nodes with caching enabled are displayed with a special circle-backslash icon. When the data is
cached at the node, the icon changes to a check mark.
Figure 1. Node with empty cache vs. node with full cache
To enable a cache
Copy link to section
Hover over the node in your flow, then click the overflow menu and select
Cache > Enable.
You can turn off the cache any time by disabling it.
Caching nodes in a database
Copy link to section
For flows that run in a database, you can cache data mid-flow to a temporary table in the
database rather than the file system. When combined with SQL optimization, this may result in
significant gains in performance. For example, the output from a flow that merges multiple tables to
create a data mining view may be cached and reused as needed. By automatically generating SQL for
all downstream nodes, performance can be further improved.
To take advantage of database caching, both SQL optimization and database caching must be
enabled.
With database caching enabled, you can cache data at any nonterminal node, and the cache will be
created automatically directly in the database the next time the flow runs is run.
If database caching or SQL optimization is not enabled, the cache will be written to
the file system instead.
Note: The following databases support temporary tables for the purpose of caching: Db2, Oracle, SQL
Server, and Teradata. Other databases, such as Netezza, will use a normal table for database
caching.
To flush a cache
Copy link to section
A circle-backslash icon by node indicates that its cache is empty. When the cache is full, the
icon becomes a check mark. If you want to replace the contents of the cache, you must first flush
the cache and then re-run the data flow to refill it.
Hover over the node in your flow, then click the overflow menu and select
Cache > Flush.
About cookies on this siteOur websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising.For more information, please review your cookie preferences options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.