Logistic regression models in notebooks

Logistic regression is among the most popular models for predicting binary targets. It yields a linear prediction function that is transformed to produce predicted probabilities of response for scoring observations and coefficients that are easily transformed into odds ratios, which are useful measures of predictor effects on response probabilities. The IBM SPSS Spark Machine Learning Library implementation includes options for predictor or feature selection and a measure of relative predictor importance can be added to the model output.

The following steps show an example logistic regression model that you might build, visualize, and interpret.

Step 1. Build a model

The following code shows an example of a logistic regression model that you might build.

	import com.ibm.spss.ml.classificationandregression.GeneralizedLinear
	import com.ibm.spss.ml.classificationandregression.params.Effect

	val logisticRegression = GeneralizedLinear()
		.setInputFieldList(Array("income", "gender", "agecat"))
				Effect(List("gender"), List(0)), 
				Effect(List("agecat"), List(0)), 
				Effect(List("income"), List(0))))
		.setUseTrials(true)               // This line is not required if you have set the measurement 
										     level of the target field.
		.setTrialsField("approved")       // This line is not required if you have set the measurement 
										     level of the target field.

	val logisticRegressionModel = GeneralizedLinear.fit(data) 

For syntax options, see Algorithm syntax for logistic regression.

Step 2. Create model visualizations

After you build your model, the best way to understand it is by calling the toHTML method from the ModelViewer class from the IBM SPSS Spark Machine Learning Library. To call this method, pass your model to ModelViewer.toHTML() then display the output by using kernel.magics.html(html).

  import com.ibm.spss.scala.ModelViewer

	val html = ModelViewer.toHTML(pc,logisticRegressionModel)

where logisticRegressionModel is the predictive model that you created in the earlier cells in the notebook.

Step 3. Interpret visualizations

The output includes the following charts and tables to review to evaluate the effectiveness of your model:

Preliminary notes on terminology and data input formats

Each individual observation in a binary response model such as a binary logistic regression model involves one of two possible responses, which are encoded as 0 for non-response and 1 for response. The response category is sometimes referred to as the target category, modeled or success category, and the non-response category as the reference or failure category. In machine learning, the response field or variable itself is often referred to as the label.

The information used to predict labels or responses in a binary logistic regression is provided by what are commonly known as features in machine learning. Features are also referred to in various literatures as predictors, predictor variables, fields, independent variables or covariates. Predictors that can take on a fixed set of mutually exclusive and exhaustive values are often referred to as categorical, and are sometimes known as factors. While in some literatures the term covariates is reserved for predictors that are treated as continuous, in logistic regression the term is more widely used to refer to any type of predictor variable.

The vector of values of predictor variables taken on by a particular instance, observation, record or case in a data set is sometimes referred to as a covariate pattern. When there are many truly continuous predictors or predictors that take on a large number of values, each record in the data set may have a distinct covariate pattern. When few or no truly continuous predictors are used in a model, the number of covariate patterns may be fairly small or at least much smaller than the number of observations in the data set. In cases where the number of distinct covariate patterns is much smaller than the number of observations, it is common practice to reduce the size of the data set by only representing each distinct covariate pattern with one or two physical observations.

One way in which this is done is to aggregate the data over the distinct values of the covariate patterns and the target variable, and to add to the data set a frequency weighting variable containing a count of the number of times each pattern of predictors corresponded with each of the two values of the target. The other way in which data with a limited number of distinct covariate patterns can be input is in what is commonly known as events/trials or successes/trials format. In this format one observation is included in the data set for each distinct covariate pattern, one field is used to indicate the number of events or successes for each pattern and a second field is used to indicate the total number of observations with that covariate pattern. Thus the observed proportion of responses or successes for each covariate pattern is shown by the value of the events field divided by the trials field. In situations such as designed experiments where the number of trials for each covariate pattern is the same, a fixed trials value can be input as a parameter instead of a trials field.

All three methods of data input should lead to the same fundamental numerical results, but some differences will be seen in particular parts of the IBM SPSS Spark Machine Learning Library output for binary logistic regression. These differences are mentioned as appropriate for each relevant visualization discussed below.

Model information table

This table contains information on how the model was fitted, so you can make sure that the model you have is what you intended. It contains information on input settings such as the target field (and the trials field or fixed value for events/trials input), the numbers of observations at each level of the target, which category is being modeled and which is the reference category (the denominator category in forming logits), and frequency and scale (regression) weight fields if specified.

Since the IBM SPSS Spark Machine Learning library fits binary logistic regression models as a special case of generalized linear models, the Model Information table also includes explicit statements of the probability distribution (binomial) and link function (logit) employed, and the resulting type of model (logistic regression). It also indicates how the generalized linear model’s scale (or dispersion) parameter has been handled.

The table indicates whether a model selection or regularization method has been used, the number of features or predictors input and the number in the final model, and provides some summary measures of goodness of fit and classification accuracy:

  • The Classification Accuracy percentage is a basic measure of prediction accuracy for that data set, the simple percentage of records correctly classified by the model. Its use implicitly assumes that mistakes of each kind are of equal cost. It is also specific to a particular critical probability threshold for classifying an observation as a predicted response and can be highly dependent on the relative proportions of successes and failures among the observations.

  • The Area Under the ROC Curve is another popular summary statistic for binary classification. See the section for the ROC Curve chart for more information on this measure.

  • The Log-likelihood is the function maximized in estimating a logistic regression model, but its raw value is not easily interpreted. Also, the actual value shown in software output may differ depending upon whether the binomial constant that is a function of the data but plays no role in estimation is included. Including this constant gives the value of full log-likelihood function, while excluding it gives the value of the kernel of the log-likelihood function. Currently in the IBM SPSS Spark Machine Learning Library the full likelihood is available only when events/trials input is used, except in cases where each observation has a distinct covariate pattern, in which cases the binomial constant is 0 and the full log-likelihood value is the same as the kernel value.

  • The raw and scaled Deviance and Pearson χ2 measures of goodness of fit. These are currently shown only with events/trials input. The deviance plays a similar role to the error or residual sum of squares in a linear regression model. In some restricted cases (typically when the number of covariate patterns is small and each covariate pattern is represented by enough trials), under the model assumptions with a correctly specified model, the deviance and Pearson χ2 measures divided by their degrees of freedom (df) follow χ2 distributions and can be used as formal goodness-of-fit test statistics. In most situations other than designed experiments, however, they cannot be used in formal tests, but they can still be used to assess an important issue related to the use of a binomial distribution.

    As previously mentioned, a binary logistic regression model is a special case of a generalized linear model that utilizes a binomial distribution and a logit link function. In models using a binomial distribution, the generalized linear model scale or dispersion parameter is theoretically equal to 1. However, it is sometimes the case that the observed dispersion departs markedly from its theoretical value. When this happens, the departure from theoretical expectation is typically in the direction of the dispersion being larger than expected. This is known as extra-binomial dispersion or variation, or overdispersion. In this case, statistics computed assuming a scale parameter value of 1 result in standard errors that are too small and Wald significance tests that are too liberal (i.e., that produce p values that are too small and thereby reject true null hypotheses too often). The ratio of the Deviance or Pearson χ2 value to its df can be used to estimate the scale parameter and adjust the standard errors and resulting tests. If the scale parameter is estimated in this manner an adjusted log-likelihood function value that divides the original log-likelihood value by the estimated scale parameter value will also be shown.

    Be aware of overdispersion, or the greater variance in the data than is predicted in the model. Hilbe, distinguishes between real and apparent overdispersion. Apparent overdispersion can be caused by several model specification problems: leaving out important predictors or interactions among predictors, predictors needing to be transformed, an incorrect link function or outliers. Thus assessing model specifications is called for to distinguish real from apparent overdispersion.

    As noted above, these statistics are currently only shown for events/trials input. For data where all covariate patterns are distinct, overdispersion cannot occur. For cases where covariate patterns are repeated but events/trials input is not used, constructing the data structure necessary for correct computation of the deviance and Pearson χ2 statistics and their degrees of freedom would require an extra pass of the data and is currently not offered.

  • Each of the four information criterion measures (Akaike Information Criterion – AIC, Bayesian Information Criterion – BIC, Finite Sample Corrected AIC – AICc and Consistent AIC – CAIC) can be used to compare models with different numbers of parameters when fitted to the same target variable with the same data. They differ in the relative penalties assigned to the number of parameters in the model and in all cases, smaller values are preferred. Like the log-likelihood value on which they are largely based, these measures are functions of the target variable values, so unlike R2 measures they cannot be used to compare models for different targets or different sets of data.

Records summary table

This table shows you how many records were used to fit the model and whether any records were excluded due to missing data. If frequency weighting is in effect, it shows information about both unweighted and weighted numbers of records. If events/trials input format is used, only the number of physical observations input is shown.

Predictor importance chart

This chart displays bars representing the predictors in descending order of relative importance for predicting the target, as determined by a variance-based sensitivity analysis algorithm. The values for each predictor are scaled so that they add to 1. Hovering over the bar for a particular predictor shows a table with its importance value and descriptive statistics about the predictor.

Tests of model effects table

This table gives one or two Wald chi-square tests for each term in the model, including effects representing multiple parameters for categorical predictors. The Sig. column provides the probability of observing a χ2 statistic as large or larger than the one observed in a sample if sampling from a population where the predictor has no effect, and can be used to identify “statistically significant” predictors. In large samples predictors may be identified as statistically significant even though in practical terms they are not important.

The two types of tests for each effect are known as Type I and Type III tests. Type III are the default. In models with only the main effects of all predictors included these will produce the same p values as are given for the parameter estimates for single degree of freedom effects. In such models they assess the effect of adding a given predictor after all others in the model. They are unique for a given model. Type I tests are equivalent to testing each effect added to the model after those entered previously. They thus depend on the order of entry of effects in the model and are therefore not unique for a given model.

If a regularization method (Lasso, ridge regression or Elastic Net) has been used to the fit the model, or a special estimation algorithm designed for models with very large numbers of parameters has been used, this table will not appear.

Parameter estimates table

This table displays the parameter estimates (also known as regression coefficients, beta coefficients or beta weights) for the fitted model in the logit or logistic transformation metric, along with measures of sampling variation, tests of statistical significance and confidence intervals. These coefficients combine to form the linear predictor model, which typically consists of a constant or intercept coefficient plus each regression coefficient multiplied by its predictor value, to produce the linear predictor values. These linear predictor values are back-transformed via the inverse of the logit transformation to produce predicted probabilities.

Exponentiated values of the coefficients and confidence intervals for these values may also appear. These values are often referred to as (estimated) odds ratios. These values are typically considered more easily interpreted than the coefficients for the linear predictor. Confidence intervals for odds ratios are not symmetric, possibly ranging between 0 and infinity. Interval bounds for predictors that are statistically significant according to the Wald tests shown in the table at a Type I or α error level corresponding to the specified confidence interval coverage level (e.g., α=0.05 for 95% confidence intervals) will exclude the value 1.

If a regularization method (Lasso, ridge regression or Elastic Net) has been used to the fit the model, or a special estimation algorithm designed for models with very large numbers of parameters has been used, only the estimated regression coefficients and possibly exponentiated values will be displayed.

Confusion matrix (Classification table)

The confusion matrix or classification table contains a cross-classification of observed by predicted labels or groups, where predictions are based on predicted probabilities from the model and the specified probability threshold (typically, but not always 0.50). If events/trials input has been used, the labels on the categories will always indicate Events and Non-Events rather than specific values of a target variable. The numbers of correct predictions are shown in the two cells along the main diagonal. Correct percentages are shown for each row, column and overall:

  • The percent correct for the target category row shows what percentage of the observed target category observations were correctly predicted by the model, which is commonly known as sensitivity, recall or true positive rate (TPR).
  • The percent correct for the reference or non-response category row shows the percentage of the non-response observations correctly predicted by the model, which is known as specificity or true negative rate (TNR).
  • The percent correct for the column with predicted positive responses gives the percentage of observations predicted by the model to be positive responses that are actually positive, known as precision or positive predictive value (PPV).</li>
  • The percent correct for the column with predicted negative or non-responses gives the percentage of observations predicted to be non-responses that are actually negative, known as the negative predictive value (NPV).
  • The percent correct at the bottom right of the table gives the overall percentage of correctly classified observations, known as the overall accuracy.

Residuals by predicted chart

This chart is a scatterplot of standardized deviance residuals vs. predicted linear predictor values. It will only be shown when data are input via the events/trials format, where events is the number of “successes” or observations in the category of interest in the response and trials is the number of possible successes or outcomes. For a model that is appropriate for the data, you expect to see no apparent pattern or relationship between the plotted residuals and predicted values, essentially residual values randomly distributed around 0, and no values that are too far from the 0 line in magnitude.

This chart is not shown for data entered in 0/1 response format because such a plot would always show systematic patterns. Also, if a regularization method (Lasso, ridge regression or Elastic Net) has been used to the fit the model, or a special estimation algorithm designed for models with very large numbers of parameters has been used, this table will not appear.

ROC curve chart

The ROC (Receiver Operating Characteristic) Curve chart plots the true positive rate (TPR) on the vertical axis against the false positive rate (FPR) on the horizontal axis, as the threshold for positive classification is varied across the probability range. The true positive rate is the proportion of positive outcomes that are correctly predicted, also known as sensitivity, recall, or probability of detection. The false positive rate is the proportion of negative outcomes that are falsely predicted to be positive, also known as one minus specificity (1-specificity), fall-out, probability of false alarm, or false discovery rate (FDR).

Since the predicted probabilities from a binary classification model such as a logistic regression fall in the open interval between 0 and 1, If the classification threshold is set to 1, no true or false positives would occur, so the curve begins at the (0,0) point at the lower left, and if the threshold is set to 0, all observations would be predicted to be responses, so both the true positive and false positive rates would be 1, so the curve ends at the (1,1) point at the upper right. Intermediate threshold values will produce different combinations of true and false positive rates. The diagonal line running from the lower left to the upper right of the chart represents the expected curve if classification is performed randomly, assigning positive or negative response labels to all observations with various fixed probabilities.

While a theoretical ROC curve is a continuous function that varies over the 0 to 1 critical probability threshold range in infinitely small increments, the nonparametric ROC curve plotted in the IBM SPSS Spark Machine Learning Library will be a finite set of points connected by straight line interpolations. The points correspond to critical probability thresholds dividing the 0 to 1 probability range into 400 equally-spaced intervals of 0.0025, with points defining any intervals not containing predicted probabilities for the data removed. The number of plotted points will thus be the minimum of 401 and one more than the number of distinct predicted probabilities found in the data, which is typically the number of distinct covariate patterns.

Hovering over each plotted point will reveal a pop-up tool tip that shows the coordinate point values for false and true positive rates, as well as the probability threshold for classification. These values illustrate the fact that the ROC curve summarizes in a graphical form results from a number of confusion matrices or classification tables, each based on a different probability threshold for classifying observations.

The area under the ROC curve (AUC) is a popular summary measure of classification performance for binary classifiers. The diagonal line representing random classification divides the ROC curve space in half and corresponds to an AUC of 0.50. A model that is able to perfectly classify responses and non-responses would have an AUC of 1.00, though this is seldom seen in practice and results in non-existence of maximum likelihood estimates of one or more parameters in a logistic regression model. Typical ROC curves will have AUC values between 0.50 and 1.00. Any ROC curve with an AUC values less than 0.50 could be transformed into a curve with an AUC value above 0.50 simply by reversing the decision or group assignment rule.

While no single measure can capture all aspects of the performance of a classifier, the area under the ROC curve measure has some attractive properties as a summary measure. One is that it summarizes the performance of the classifier over the whole range of possible thresholds, not requiring the analyst to choose a single classification cut point. Also, the AUC gives the probability that a given classifier will rank a randomly selected positive observation higher than a randomly selected negative observation, which connects it to the Wilcoxon-Mann-Whitney sum of ranks test. It can also be standardized to where the chance level of 0.50 becomes 0 and the maximum possible value remains 1.00 by calculating G=2AUC-1, which is also known as the Gini coefficient, and is equal to twice the area under the curve and above the chance classification level reference line.

One important advantage of the AUC measure over some other measures, such as overall classification accuracy, is that the ROC curve is based on two quantities (true and false positive rates) that are calculated from distinct parts of the observed data, actual positive and negative observations. This results in the ROC curve and the AUC measure being insensitive to changes in the relative proportions of positive and negative observations. Overall classification accuracy, on the other hand, in addition to requiring specification of a single cut point, can be highly dependent upon the relative proportions of positive and negative observations.

There are dangers however in using the AUC to compare the performance of different classifiers. For example, if the ROC curves cross, it is possible for one classifier to produce a higher AUC value but to have inferior performance at critical probability thresholds that would be most useful in practice. Also, although the ROC curve and the area under it are insensitive to the relative proportions of positive and negative observations, they are dependent upon the distribution of the prediction scores. This implies the use of different assumptions of costs of misclassifications for different classifiers, making comparisons potentially akin to comparing measurements in different units.