Creating machine learning flows with Spark MLlib nodes

You can create a machine learning flow by using SparkML nodes.

The SparkML node palette

The following categories and node typess are available:

Import node

Data Asset
The Data Asset node contains the data for your Spark ML flows. Drag it from the Palette to the canvas area. Double-click the node, click the Change data asset button, and select one of the data assets from your project, and then click OK.

Transformations nodes

These transformation nodes enable you to work with data directly so that you can create files that are easier to manipulate. You can filter unnecessary data rows or fill out a data set by adding a column of data.

Filter Rows
Filter rows based on criteria. Set a condition by using a Spark SQL expression to create a new data set.
Select Columns
Select specific columns of data for work later in the machine learning flow. Other columns are excluded from this working data set.
Add Column
Use the Spark SQL expression language to combine existing columns to create new columns or to create a new column of constant values.
Rename Column
Rename a column to better reflect the content or to match the same data in another table.
Sample Rows
Create a random sample of rows, with or without replacement, with the option to provide a limit on the sample size.
SQL Transform

Transform data by running an arbitrary SQL statement. For example, In the situation where there aren't any key columns in common, a dummy key column can be fabricated to allow a join to be constructed. The SQL transformer can be connected to any node and treats the incoming dataframe as a SQL table that can be queried and augmented with new fields based on the SQL executed. This incoming dataframe can be referenced in queries using the table name __THIS__.

The following examples, which are based on the drug1n data set that you can find in this GitHub repository, are meant to help you with creating your own statements:

  • Filter records by male and the sum of the Na and K values to be less than 1:

    SELECT * FROM __THIS__ WHERE (Na + K) < 0.5 AND Sex='M'

  • Get the count of the subjects of the drugY drug and the average age as well:

    SELECT count(*) AS total, int(mean(Age)+0.5) AS average_age FROM __THIS__ WHERE Drug='drugY'

  • Get the mean and variance of the Na variable for the drugY drug:

    SELECT mean(Na) AS Na_mean, variance(Na) AS Na_variance FROM __THIS__ WHERE Drug='drugY'

  • Get the sex-specific counts and averages of the Na variable:

    SELECT count(*) AS number_of_people, mean(Na), Sex FROM __THIS__ GROUP BY Sex

  • Get the distribution (percentage histogram) of the Age variable:

    SELECT concat( count(*)/200 * 100, '%') AS percentage, Age FROM __THIS__ GROUP BY Age ORDER BY Age

  • Randomly get 5 rows in which people need to be over 40 years old:

    SELECT rand() AS rand, * FROM __THIS__ WHERE Age>40 ORDER BY rand LIMIT 5

    A following node might be needed to remove the extra column of the rand output.

  • Count the distinct rows:

    SELECT COUNT(DISTINCT *) FROM __THIS__

For more information, see SQLTransformer.

Select Distinct Rows
Specify a number of unique keys to return one record from each of the resulting groups.

Modeling nodes

Logistic Regression
Logistic regression is a statistical technique for classifying records based on values of input fields. It is analogous to linear regression but takes a categorical target field instead of a numeric range. Logistic regression can only be used for binary classification
Decision Tree Classifier
Maps observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves). It supports both binary and multiclass labels, as well as both continuous and categorical features.
Random Forest Classifier
Constructs multiple decision trees to produce the label that is a mode of each decision tree. It supports both binary and multiclass labels, as well as both continuous and categorical features.
Gradient Boosted Tree Classifier
Produces a classification prediction model in the form of an ensemble of decision trees. It only supports binary labels, as well as both continuous and categorical features.
Linear Regression
Linear regression models predict a continuous target based on linear relationships between the target and one or more predictors.
Generalized Linear Regression
Generalization of ordinary linear regression that allows for target values that have error distribution models other than a normal distribution.
Decision Tree Regressor
Maps observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves). It supports both continuous and categorical features.
Random Forest Regressor
Constructs multiple decision trees to produce the mean prediction of each decision tree. It supports both continuous and categorical features.
Gradient Boosted Tree Regressor
Produces a regression prediction model in the form of an ensemble of decision trees. It supports both continuous and categorical features.
Isotonic Regression
Models the isotonic relationship of a sequence of observations by fitting a free-form line to the observations under the following constraints: the fitted free-form line must be non-decreasing everywhere, and it must lie as close to the observations as possible.

Check out our content pages for more samples, tutorials, documentation, how-tos, and blog posts.

For more detailed information about the Apache Spark SQL expression editor, see the Apache Spark Hive Wiki.

For more detailed information, see the Apache Spark machine learning Web site.