Logistic Overview

Some strengths of logistic regression models include their often being quite accurate, their ability to handle symbolic and numeric input fields, and their ability to give predicted probabilities for all target categories so that a second-best guess can easily be identified. Logistic models are most effective when group membership is a truly categorical

field; if group membership is based on values of a continuous range field, you should consider using linear regression to take advantage of the richer information offered by the full range of values. Logistic models can also perform automatic field selection, although other

approaches such as tree models or Feature Selection might do this more quickly on large datasets. Finally, since logistic models are well understood by many analysts and data miners, they may be used by some as a baseline against which other modeling techniques can be compared.

The Logistic node offers two different procedures behind the scenes to estimate logistic regression models, each of which offers some distinct options for input settings and featuring in some situations somewhat different output:

  • Nominal Regression, which fits multi-class nominal response logistic models, and binary response logistic models as special cases
  • Logistic Regression, which fits binary response logistic models

Nominal Regression

The default procedure in the Logistic node fits nominal response logistic models for binary or nominal multi-class outcomes. One category of the outcome is specified as the Base or reference category, and for a target with K categories, K-1 logits are formed, each used to predict the odds of responding in a specified category relative to the reference category. You can control the specified Base or reference category.

Categorical predictors are handled by internal creation of a dummy or indicator variable for each level of the predictor, and redundancies among these are handled via a generalized inverse rather than reparameterizing to full rank by creating one fewer contrast or indictor codings than the number of categories. This typically results in parameter estimates that compare each level to the last for a main effect of the categorical predictor.

In situations where a binary target is being modeled, there are options in the Nominal Regression procedure that make it often preferable to the Logistic Regression procedure. These include:

  • Measures of monotone association between predicted probabilities and observed outcomes (Somers’ D, Goodman and Kruskal’s Gamma, Kendall’s Tau-a, and the Concordance Index C)
  • Valid goodness-of-fit Pearson and likelihood-ratio chi-square tests in situations where a limited number of distinct combinations of predictor values, or subpopulations, exist in the data, such that expected frequencies in a table of subpopulations by responses remain sizeable
  • The ability to obtain a covariance matrix estimator for the parameter estimates that adjusts for anomalous dispersion via a scale parameter (typically used to adjust for over-dispersion), and standard errors and test statistics for parameters based on this covariance matrix
  • The ability to use likelihood-ratio chi-square statistics for predictor entry in stepwise methods, in addition to the usual Score statistics
  • The ability to specify minimum and maximum numbers of terms in the model for stepwise methods
  • The ability to constrain the terms in a stepwise model to those that satisfy properties of hierarchy or containment among effects
  • Explicit diagnoses of convergence problems related to complete or quasi-complete separation

Logistic Regression

For binary response models, the Logistic Regression procedure may be used instead of the default Nominal Regression procedure. Logistic Regression handles categorical predictors by internally coding their categories with indicator or dummy variables for all except one category, which the user can control, or by creating a set of contrast variables. There are C-1 indicator variables or contrast variables created for a categorical predictor with C categories. This procedure thus gives the user somewhat more control over parameterization of models with categorical predictors and thus more flexibility in obtaining parameter estimates with specific interpretations. However, if models are fitted via stepwise methods and interaction terms are considered, the lack of ability to constrain which kinds of effects can be entered together into the model can result in nonsensical models being produced.

Logistic Regression does offer a few things not available in Nominal Regression, including:

  • The Hosmer-Lemeshow goodness-of-fit test for situations where the number of subpopulations or covariate patterns is too large for the standard Pearson or likelihood-ratio tests available in Nominal Regression to be valid
  • Fuller listings of results during the process of fitting stepwise models
  • A listing of casewise outlier results based on cases or observations with studentized deviance residuals greater in magnitude than a specified value

Next steps

Like your visualization? Why not deploy it? For more information, see Deploy a model.