Linear regression

Linear regression is the simplest and most widely-used model for supervised learning with continuous targets. It yields a linear prediction function that is particularly easy to interpret and to use in scoring observations. The IBM SPSS Spark Machine Learning Library implementation includes options for predictor or feature selection and a measure of relative predictor importance can be added to the model output.

The following steps show an example linear regression model that you might build, visualize, and interpret.

Step 1. Build a model

The following code shows an example of a linear regression model that you might build.


    import com.ibm.spss.ml.classificationandregression.LinearRegression

    val linearRegression = LinearRegression()
        .setInputFieldList(Array("Beginning_salary", "Gender_of_employee", "Educational_level", 
                                    "Minority_classification", "Current_salary"))
        .setTargetField("Work_experience")
        .setEffects(List(
                        Effect(List("Beginning_salary"), List(0)),
                        Effect(List("Gender_of_employee"), List(0)),
                        Effect(List("Educational_level"), List(0)),
                        Effect(List("Current_salary"), List(0)),
                        Effect(List("Gender_of_employee", "Educational_level"), List(0, 0))))
        .setDistribution("NORMAL")
        .setLinkFunction("IDENTITY")

    val linearRegressionModel = linearRegression.fit(data) 

For syntax options, see Algorithm syntax for linear regression.

Step 2. Create model visualizations

After you build your model, the best way to understand it is by calling the toHTML method from the ModelViewer class from the IBM SPSS Spark Machine Learning Library. To call this method, pass your model to ModelViewer.toHTML() then display the output by using kernel.magics.html(html).


    import com.ibm.spss.scala.ModelViewer

    val html = ModelViewer.toHTML(pc,linearRegressionModel)
    kernel.magics.html(html) 

where linearRegressionModel is the predictive model that you created in the earlier cells in the notebook.

Step 3. Interpret visualizations

The output includes the following charts and tables to review to evaluate the effectiveness of your model:

Model information table

This table contains information on how the model was fitted, so you can make sure that the model you have is what you intended. It contains information on input settings such as model selection methods, as well as summary measures of prediction accuracy. R2 gives the squared correlation between observed and predicted values, which in a linear model with an intercept gives the proportion of variance in the target variable accounted for by the model, which ranges from 0 for a model with no predictive ability to 1 for a perfectly fitting model. Adjusted R2 shrinks or penalizes this proportion based on the number of parameters in the model, in order to facilitate comparisons among models with different numbers of predictors. The Corrected Akaike Information Criterion (AICc) measure can be used to compare models with different numbers of parameters when fitted to the same target variable with the same data. Smaller values are preferred. This measure is a function of the target variable values, so unlike R2 measures it cannot be used to compare models for different targets or different sets of data.

Records summary table

This table shows you how many records were used to fit the model and whether any records were excluded due to missing data. If frequency weighting is in effect, it shows information about both unweighted and weighted numbers of records.

Predictor importance chart

This chart displays bars representing the predictors in descending order of relative importance for predicting the target, as determined by a variance-based sensitivity analysis algorithm. The values for each predictor are scaled so that they add to 1. Hovering over the bar for a particular predictor shows a table with its importance value and descriptive statistics about the predictor.

Tests of model effects table

This table gives a standard analysis of variance (ANOVA) table for each term in the linear model, including effects representing multiple parameters for categorical predictors. The Sig. column provides the probability of observing an F statistic as large or larger than the one observed in a sample if sampling from a population where the predictor has no effect, and can be used to identify “statistically significant” predictors. In large samples predictors may be identified as statistically significant even though in practical terms they are not important. If specified in creating the model, columns containing effect-size estimates known as η2 and partial η2 and confidence intervals for these effect-estimates are provided. These measures are similar to R2 in being based on proportions of variance in the target variable associated with the predictors.

Parameter estimates table

This table displays the parameter estimates (also known as regression coefficients, β coefficients or β weights) for the fitted linear model, along with measures of sampling variation, tests of statistical significance and confidence intervals. These coefficients combine to form the linear prediction model, which typically consists of a constant or intercept coefficient plus each regression coefficient multiplied by its predictor variable value, to produce the model predictions. As with tests of model effects, individual predictors may be statistically significant without being practically important, so effect size estimates can be specified in creating the model, and if so partial η2 estimates and confidence intervals for them will also be displayed in this table. If a regularization method (Lasso, ridge regression or Elastic Net) has been used to fit the model, only the regression coefficients will be displayed.

Observed by predicted chart

This chart shows a scatterplot of predicted values against observed target values. The plotted points may represent averages of binned values. In a perfect-fitting model, the points would all fall exactly on the 45-degree line from lower left to upper right. Vertical departures from this line show the residuals or prediction errors for individual data points or averages of binned values. Points lying particularly far above or below this line are outliers that may warrant attention.

Residuals by predicted chart

This chart is a scatterplot of residuals or prediction errors vs. predicted values from the linear model. The plotted points may represent averages of binned values. If the prediction model is capturing all of the systematic variation in the target, you would expect to see these residuals randomly scattered around the horizontal prediction line. Patterns in the residuals indicate that the model is not capturing all of the systematic variation in the target and may indicate the need for additional predictors, a different functional form for the model (such as one including nonlinear terms) or unequal variances, all of which can cause statistical inferences to be incorrect.

Normal P-P plot of residuals

This chart plots the percentiles of the cumulative distribution of observed residuals against the percentiles from a cumulative normal distribution. This allows you to assess a common linear model assumption of normality of prediction errors. Systematic deviations from the 45-degree line indicate departures from normality such as skewness and kurtosis. Outliers may also be seen here as well.

Residuals histogram

This chart shows a binned representation of the residual values with the vertical axis indicating relative frequencies. This allows you to assess the shape of the distribution of residuals, which under the normality assumption should look more or less like a standard “bell curve,” symmetric around a single peak in the center, with frequencies decreasing as you move away from the peak in either direction.