Creating machine learning flows with Spark MLlib nodes
You can create a machine learning flow by using SparkML nodes.
The SparkML node palette
The following categories and node typess are available:
- Data Asset
- The Data Asset node contains the data for your Spark ML flows. Drag it from the Palette to the canvas area. Double-click the node, click the Change data asset button, and select one of the data assets from your project, and then click OK.
These transformation nodes enable you to work with data directly so that you can create files that are easier to manipulate. You can filter unnecessary data rows or fill out a data set by adding a column of data.
- Filter Rows
- Filter rows based on criteria. Set a condition by using a Spark SQL expression to create a new data set.
- Select Columns
- Select specific columns of data for work later in the machine learning flow. Other columns are excluded from this working data set.
- Add Column
- Use the Spark SQL expression language to combine existing columns to create new columns or to create a new column of constant values.
- Rename Column
- Rename a column to better reflect the content or to match the same data in another table.
- Sample Rows
- Create a random sample of rows, with or without replacement, with the option to provide a limit on the sample size.
- SQL Transform
Transform data by running an arbitrary SQL statement. For example, In the situation where there aren't any key columns in common, a dummy key column can be fabricated to allow a join to be constructed. The SQL transformer can be connected to any node and treats the incoming dataframe as a SQL table that can be queried and augmented with new fields based on the SQL executed. This incoming dataframe can be referenced in queries using the table name
The following examples, which are based on the
drug1ndata set that you can find in this GitHub repository, are meant to help you with creating your own statements:
Filter records by male and the sum of the
Kvalues to be less than 1:
SELECT FROM THIS WHERE (Na + K) < 0.5 AND Sex='M'
Get the count of the subjects of the
drugYdrug and the average age as well:
SELECT count() AS total, int(mean(Age)+0.5) AS averageage FROM THIS WHERE Drug='drugY'
Get the mean and variance of the
Navariable for the
SELECT mean(Na) AS Namean, variance(Na) AS Navariance FROM THIS WHERE Drug='drugY'
Get the sex-specific counts and averages of the
SELECT count() AS numberofpeople, mean(Na), Sex FROM __THIS GROUP BY Sex
Get the distribution (percentage histogram) of the
SELECT concat( count()/200 100, '%') AS percentage, Age FROM THIS GROUP BY Age ORDER BY Age
Randomly get 5 rows in which people need to be over 40 years old:
SELECT rand() AS rand, FROM _THIS WHERE Age>40 ORDER BY rand LIMIT 5
A following node might be needed to remove the extra column of the
Count the distinct rows:
SELECT COUNT(DISTINCT *) FROM __THIS
For more information, see SQLTransformer.
- Select Distinct Rows
- Specify a number of unique keys to return one record from each of the resulting groups.
- Logistic Regression
- Logistic regression is a statistical technique for classifying records based on values of input fields. It is analogous to linear regression but takes a categorical target field instead of a numeric range. Logistic regression can only be used for binary classification
- Decision Tree Classifier
- Maps observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves). It supports both binary and multiclass labels, as well as both continuous and categorical features.
- Random Forest Classifier
- Constructs multiple decision trees to produce the label that is a mode of each decision tree. It supports both binary and multiclass labels, as well as both continuous and categorical features.
- Gradient Boosted Tree Classifier
- Produces a classification prediction model in the form of an ensemble of decision trees. It only supports binary labels, as well as both continuous and categorical features.
- Linear Regression
- Linear regression models predict a continuous target based on linear relationships between the target and one or more predictors.
- Generalized Linear Regression
- Generalization of ordinary linear regression that allows for target values that have error distribution models other than a normal distribution.
- Decision Tree Regressor
- Maps observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves). It supports both continuous and categorical features.
- Random Forest Regressor
- Constructs multiple decision trees to produce the mean prediction of each decision tree. It supports both continuous and categorical features.
- Gradient Boosted Tree Regressor
- Produces a regression prediction model in the form of an ensemble of decision trees. It supports both continuous and categorical features.
- Isotonic Regression
- Models the isotonic relationship of a sequence of observations by fitting a free-form line to the observations under the following constraints: the fitted free-form line must be non-decreasing everywhere, and it must lie as close to the observations as possible.
For more information about working with the machine learning flow tool by using SparkML controls, see this machine learning flow tutorial.
Check out our content pages for more samples, tutorials, documentation, how-tos, and blog posts.
For more detailed information about the Apache Spark SQL expression editor, see the Apache Spark Hive Wiki.
For more detailed information, see the Apache Spark machine learning Web site.