SPSS Modeler can run Python scripts using the Apache Spark
framework to process data. This documentation provides the Python API description for the interfaces
provided.
The SPSS Modeler installation includes a Spark distribution.
Accessing data
Copy link to section
Data is transferred between a Python/Spark script and the execution context in the form of a
Spark SQL DataFrame. A script that consumes data (that is, any node except an Import node) must
retrieve the data frame from the
context:
inputData = asContext.getSparkInputData()
Copy to clipboardCopied to clipboard
A script that produces data (that is, any node except a terminal node) must return a data frame
to the context:
asContext.setSparkOutputData(outputData)
Copy to clipboardCopied to clipboard
You can use the SQL context to create an output data frame from an RDD where
required:
outputData = sqlContext.createDataFrame(rdd)
Copy to clipboardCopied to clipboard
Defining the data model
Copy link to section
A node that produces data must also define a data model that describes the
fields visible downstream of the node. In Spark SQL terminology, the data model is the schema.
A Python/Spark script defines its output data model in the form of a
pyspsark.sql.types.StructType object. A StructType describes a row
in the output data frame and is constructed from a list of StructField objects.
Each StructField describes a single field in the output data model.
You can obtain the data model for the input data using the :schema attribute of
the input data frame:
inputSchema = inputData.schema
Copy to clipboardCopied to clipboard
Fields that are passed through unchanged can be copied from the input data model to the output
data model. Fields that are new or modified in the output data model can be created using the
StructField
constructor:
field = StructField(name, dataType, nullable=True, metadata=None)
Copy to clipboardCopied to clipboard
Refer to your Spark documentation for information about the constructor.
You must provide at least the field name and its data type. Optionally, you can specify metadata
to provide a measure, role, and description for the field (see Data metadata).
DataModelOnly mode
Copy link to section
SPSS Modeler needs to know the output data model for a node,
before the node runs, to enable downstream editing. To obtain the output data model for a
Python/Spark node, SPSS Modeler runs the script in a special
data model only mode where there is no data available. The script can identify this
mode using the isComputeDataModelOnly method on the Analytic Server context
object.
The script for a transformation node can follow this general
pattern:
if asContext.isComputeDataModelOnly():
inputSchema = asContext.getSparkInputSchema()
outputSchema = ... # construct the output data model
asContext.setSparkOutputSchema(outputSchema)
else:
inputData = asContext.getSparkInputData()
outputData = ... # construct the output data frame
asContext.setSparkOutputData(outputData)
Copy to clipboardCopied to clipboard
Building a model
Copy link to section
A node that builds a model must return to the execution context some content
that describes the model sufficiently that the node which applies the model can recreate it exactly
at a later time.
Model content is defined in terms of key/value pairs where the meaning of the keys and the values
is known only to the build and score nodes and is not interpreted by SPSS Modeler in any way. Optionally the node may assign a MIME type to a value
with the intent that SPSS Modeler might display those values which
have known types to the user in the model nugget.
A value in this context may be PMML, HTML, an image, etc. To add a value to the model content (in
the build
script):
To retrieve a value from the model content (in the score
script):
value = asContext.getModelContentToString(key)
Copy to clipboardCopied to clipboard
As a shortcut, where a model or part of a model is stored to a file or folder in the file system,
you can bundle all the content stored to that location in one call (in the build
script):
asContext.setModelContentFromPath(key, path)
Copy to clipboardCopied to clipboard
Note that in this case there is no option to specify a MIME type because the bundle may contain
various content types.
If you need a temporary location to store the content while building the model you can obtain an
appropriate location from the
context:
path = asContext.createTemporaryFolder()
Copy to clipboardCopied to clipboard
To retrieve existing content to a temporary location in the file system (in the score
script):
path = asContext.getModelContentToPath(key)
Copy to clipboardCopied to clipboard
Error handling
Copy link to section
To raise errors, throw an exception from the script and display it to the SPSS Modeler user. Some exceptions are predefined in the module
spss.pyspark.exceptions. For
example:
from spss.pyspark.exceptions import ASContextException
if ... some error condition ...:
raise ASContextException("message to display to user")
About cookies on this siteOur websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising.For more information, please review your cookie preferences options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.