0 / 0
Reduce input data string length
Last updated: Dec 11, 2024
Reduce input data string length

This tutorial provides an example of when you might need to reduce the input data string length. For binomial logistic regression, and auto classifier models that include a binomial logistic regression model, string fields are limited to a maximum of eight characters. Where strings are more than eight characters, you can recode them using a Reclassify node.

This example focuses on a small part of a flow to show the type of errors that might be generated with overlong strings, and explains how to use the Reclassify node to change the string details to an acceptable length. Although the example uses a binomial Logistic Regression node, you can also use the Auto Classifier node to generate a binomial Logistic Regression model.

Try the tutorial

In this tutorial, you will complete these tasks:

Sample modeler flow and data set

This tutorial uses the Reducing Input Data String Length flow in the sample project. The data file used is drug_long_name.csv. The following image shows the sample modeler flow.

Figure 1. Sample modeler flow
Example flow showing string reclassification for binomial logistic regression
The following image shows the sample data set.
Figure 2. Sample data set
Sample data set

Task 1: Open the sample project

The sample project contains several data sets and sample modeler flows. If you don't already have the sample project, then refer to the Tutorials topic to create the sample project. Then follow these steps to open the sample project:

  1. In Cloud Pak for Data, from the Navigation menu Navigation menu, choose Projects > View all Projects.
  2. Click SPSS Modeler Project.
  3. Click the Assets tab to see the data sets and modeler flows.

Checkpoint icon Check your progress

The following image shows the project Assets tab. You are now ready to work with the sample modeler flow associated with this tutorial.

Sample project

Back to the top

Task 2: Examine the Data Asset and Type node

Reducing Input Data String Length includes several nodes. Follow these steps to examine the Data Asset and Type node:

  1. From the Assets tab, open the Reducing Input Data String Length modeler flow, and wait for the canvas to load.
  2. Double-click the drug_long_name.csv node. This node is a Data Asset node that points to the drug_long_name.csv file in the project.
  3. Review the File format properties.
  4. Optional: Click Preview data to see the full data set.
  5. Double-click the Type node after the Data Asset node. This node specifies field properties, such as measurement level (the type of data that the field contains), and the role of each field as a target or input in modeling. The measurement level is a category that indicates the type of data in the field. The source data file uses three different measurement levels:
    • A Continuous field (such as the Age field) contains continuous numeric values.
    • A Nominal field (such as the Drug field) has two or more distinct values; in this case, drugA or drugB.
    • A Flag field (such as the Sex field) describes data with multiple distinct values that have an inherent order; in this case, F, and M.
    Figure 3. Type node properties
    Type node

    For each field, the Type node also specifies a role to indicate the part that each field plays in modeling. The Role is set to Target for the field Cholesterol_long, which is the field that indicates whether a customer has Normal or High level of cholesterol. The target is the field for which you want to predict the value.

    Role is set to Input for the other fields. Input fields are sometimes known as predictors, or fields whose values are used by the modeling algorithm to predict the value of the target field.

  6. Optional: Click Preview data to see the filtered data set.

Checkpoint icon Check your progress

The following image shows the Type node. You are now ready to view the Logistic node.

Type node

Back to the top

Task 3: Reclassify values

In this task, you run the model and discover an error, Follow these steps to reclassify the values to avoid the error:

  1. From the Modeling section in the palette, drag the Logistic node onto the canvas and connect it to the existing Type node after the Data Asset node.
  2. Double-click the Cholesterol_long node to see its properties.
  3. Select the Binomial procedure (instead of the default Multinomial procedure).
    • A Binomial model is used when the target field is a flag or nominal field with two discrete values.
    • A Multinomial model is used when the target field is a nominal field with more than two values.
  4. Click Save.
  5. Hover over the Cholesterol_long node, and click the Run icon Run icon. An error message warns you that the Cholesterol_long string values are too long. You can use a Reclassify node to transform the values to fix this issue. Reclassify node is useful for collapsing categories or regrouping data for analysis.
    Figure 4. Notifications
    Error message
  6. Double-click the Cholesterol (Reclassify) node to see its properties. Notice that the Reclassify Field is set to Cholesterol_long and the New Field Name is Cholesterol.
  7. Click Get values and then expand the Automatically Reclassify section. Add the Cholesterol_long values to the original value column.
  8. In the new value column, for the High level of cholesterol original value, type High and for the Normal level of cholesterol original value, type Normal. These settings shorten the values to avoid the error message.

Checkpoint icon Check your progress

The following image shows the Reclassify node. You are now ready to check the Filter node.

Reclassify node properties

Back to the top

Task 4: Check the Filter node

Follow these steps to see and check the Filter node:

  1. Double-click the Filter node to see its properties.
  2. Notice that this node filters out the Cholesterol_long field.

Checkpoint icon Check your progress

The following image shows the Filter node. You are now ready to define the target.

Filter node properties

Back to the top

Task 5: Define the target

You can specify field properties in a Type node. Follow these steps to define the target in the Type node:

  1. Double-click the Type node after the Filter node to view its properties.
  2. Click Read values to read the values from your data source and set the field measurement types. The Role tells modeling nodes whether fields are Input (predictor fields) or Target (predicted fields) for a machine-learning process. Both and None are also available roles, along with Partition, which indicates a field that is used to partition records into separate samples for training, testing, and validation. The value Split specifies that separate models are built for each possible value of the field.
  3. For the Cholesterol field, set the role to Target.
  4. Click Save.

Checkpoint icon Check your progress

The following image shows the Type node. You are now ready to generate the model.

Type node target

Back to the top

Task 6: Generate the model

Follow these steps to view the model output in table format:

  1. Hover over the Cholesterol (Logistic) node, and click the Run icon Run icon.
  2. From the Outputs section in the palette, drag the Table node onto the canvas, and connect it to the model nugget.
  3. Hover over the Table node that is connected to the Cholesterol model, and click the Run icon Run icon.
  4. In the Outputs and models pane, click the output results with the name Table to view the table output.

Checkpoint icon Check your progress

The following image shows the model output.

Model output

Back to the top

Summary

This example showed you the type of errors that might be generated with overlong strings, and explains how to use the Reclassify node to change the string details to an acceptable length. Although the example uses a binomial Logistic Regression node, it is equally applicable when using the Auto Classifier node to generate a binomial Logistic Regression model.

Next steps

You are now ready to try other SPSS® Modeler tutorials.