In [1]:

```
#@author: Venky Rao raove@us.ibm.com
#@last edited: 14 Oct 2017
#@source: materials, data and examples adapted from R in Action 2nd Edition by Dr. Robert Kabacoff
```

In [2]:

```
#in many ways, regression analysis lives at the heart of statistics
#it is a broad term for a set of methodologies used to predict a response variable
#(also called a dependent, criterion or outcome variable) from one or more
#predictor variables (also called independent or explanatory variables)
```

In [3]:

```
#in OLS regression, a quantitative dependent variable is predicted from a weighted sum of
#predictor variables, where the weights are parameters estimated from the data.
#OLS regression is the most common variety of statistical analysis today.
#the methods that fall under the rubric of OLS regression include simple linear regression,
#polynomial regression and multiple linear regression.
```

In [1]:

```
#we use the dataset "women" in the base installation of R which provides the height
#and weight of 15 women of ages 30 to 39. The following code calculates an equation
#for predicting weight from height.
#let's review the dataset first
women
```

In [9]:

```
#fit a simple linear regression model
fit <- lm(weight ~ height, data = women)
#display results of the fitted model
summary(fit)
```

In [11]:

```
#because a height of zero is impossible, you would not try to give a physical interpretation
#to the intercept
#from the Pr(>|t|) column, you see that the regression coefficient (3.45) is signiticantly
#different from zero (p < 0.001) and indicates that there is an expected increase of 3.45
#pounds of weight for every 1 inch increase in height.
#1.091e-14 = 1.091 * 10^-14 = 0.000000...
#let us understand this further:
##let us begin with a hypothesis: height and weight are independent
#the p-value is the probability of obtaining the sampled results, assuming independence of
#the row and column variables in the population.
#because p-value < 0.001, you reject the hypothesis that height and weight are independent!
#the multiple R-squared (0.991) indicates that the model accounts for 99.1% of variance in heights
#the multiple R-squared is the squared correlation between the actual and predicted value
#the residual standard error (1.53 pounds) can be thought of as the average error in predicting
#weight from height using this model
#the F-statistic tests whether the predictor variables, take together, predict the response variable
#above chance levels. because there is only one predictor variable in simple regression, in this
#example, the F-statistic is equivalent to the t-test for regression coefficient for height
```

In [3]:

```
#list all the weights of women
women$weight
```

In [5]:

```
#list all the predicted values in a fitted model
round(fitted(fit))
```

In [7]:

```
#list the residual values in a fitted model
round(residuals(fit))
```

In [8]:

```
#plot height versus weight and add a line of fit
plot(women$height, women$weight,
xlab = "Height (in inches)", # x-axis label
ylab = "Weight (in pounds)") #y-axis label
abline(fit) #add the line of fit
```

In [9]:

```
#list the coefficients of the fitted model
coefficients(fit)
```

In [12]:

```
#from the above, we can see that the prediction equation is:
#predictedWeight = -87.52 + 3.45 * Height
```

In [3]:

```
#the following code shows the results of fitting a quadratic equation
fit2 <- lm(weight ~ height + I(height ^ 2), data = women)
summary(fit2)
```

In [4]:

```
#the term "height ^ 2" adds a height-squared term to the prediction equation
#the "I" function treats the contents within the parenthesis as an R regular expression.
#from this analysis, the new prediction equation is:
# predictedWeight = 261.88 - 7.35 * Height + 0.083 * (Height ^ 2)
#both regression coefficients are significant at the p < 0.0001 level
#regression coefficient for height has Pr(>|t|) = 6.58e-07 = 6.58 * 10 ^ -7 = 0.000000658
#regression coefficient for height ^ 2 has Pr(>|t|) = 9.32e-09 = 9.32 * 10 ^ -9 = 0.00000000932
#the p-value is the probability of obtaining the sampled results, assuming independence of
#the row and column variables in the population.
#because p-value < 0.001, you reject the hypothesis that height and weight and height ^ 2 and weight are independent!
#the multiple R-squared has improved from 0.991 in the simple linear model to
#0.9995 in the quadratic model. This shows that the amount of variance accounted for has increased (this is a good thing!)
#the significance of the squared term (t = 13.89, p < 0.01) suggests that the inclusion of the quadratic term improves model fit
#the plot below confirms this
```

In [16]:

```
#plot height versus weight and add a line of fit
plot(women$height, women$weight,
xlab = "Height (in inches)", # x-axis label
ylab = "Weight (in pounds)") #y-axis label
lines(women$height, fitted(fit2), col = "dark green") #add the line of fit including the quadratic term
abline(fit, col = "blue") #adding the original line of fit
```

In [18]:

```
#to fit a cubic polynomial, use the following code:
fit3 <- lm(weight ~ height + I(height ^ 2) + I(height ^ 3), data = women)
summary(fit3)
#although higher polynomials are possible, terms higher than cubic are rarely necessary
```

In [20]:

```
#the scatterplot() function in the "car" package provides a simple and convenient
#method of plotting a bivariate relationship. here's an example:
#install.packages("car") in case you need to install the car package
library(car) #load the "car" package
scatterplot(weight ~ height, data = women, #data that you want to plot
spread = FALSE, #this option suppresses spread and asymmetry information
smoother.args = list(lty = 2), #specifies that the loess fit be rendered as a dashed line
pch = 19, #displays points as filled circles
main = "Women Age 30 - 39", #main title
xlab = "Height (inches)", #x-axis label
ylab = "Weight (lbs.)") #y-axis label
```

In [21]:

```
#you can tell from the chart above that the two variables are roughly
#asymmetrical and that a curved line will fit the data points better than a straight line
#loess stands for locally weighted scatterplot smoothing. More information can be found here:
# https://en.wikipedia.org/wiki/Local_regression
```

In [2]:

```
#when there is more than one predictor variable, simple regression becomes
#multiple linear regression. Technically, ploynomial regression is a special
#case of multiple regression
#to understand multiple linear regression, we will use the states.x77 dataset
#in the base R package and explore the relationship between a state's murder rate
#and other characteristics of the state including population, illiteracy rate,
#average income and frost levels (mean number of days below freezing)
```

In [56]:

```
#the states.x77 dataset is contained in a matrix. Since the lm() function needs
#a data frame, we first create a data frame using the following code:
states <- as.data.frame(state.x77[,c("Murder", "Population", "Illiteracy", "Income", "Frost")])
states #print the data frame
```

In [4]:

```
#a good first step in multiple regression is to examine the relationships among
#the variables two at a time. Let's do this below:
cor(states) #this function provides the bivariate correlations
```

In [5]:

```
#to visually see these correlations, we can generate scatterplots as follows:
# install.packages("car") #in case you don't have the "car" package installed
library(car) #load the "car" package to access the scatterplotMatrix() function
scatterplotMatrix(states, #data to be plotted
spread = FALSE, #this option suppresses spread and asymmetry information
smoother.args = list(lty = 2), #specifies that the loess fit be rendered as a dashed line
main = "Scatter Plot Matrix") #main heading
```

In [6]:

```
#by default, the scatterplotMatrix() function provides scatter plots of the variables
#with each other in the off-diagonals and superimposes smoothed (loess) and linear fit
#lines on these plots. the principal diagonal contains density and rug plots for each variable.
#you can see that the murder rate may be bimodal and that each of the predictor variables
#are skewed to some extent
#murder rates rise with population and illiteracy and fall with income and frost
#colder states have lower illiteracy rates, lower population and higher incomes
```

In [7]:

```
#now we can fit an lm() function as follows:
fit <- lm(Murder ~ Population + Illiteracy + Income + Frost, data = states)
summary(fit)
```

In [8]:

```
#print the coefficients
coefficients(fit)
```

In [9]:

```
#when there is more than one predictor variable, the regression coefficients
#indicate the increase in the dependent variable for a unit change in a
#predictor variable, holding all other predictor variables constant
#for example, the regression coefficient for illiteracy is 4.14,
#suggesting that an increase of 1% in illiteracy is associated with a 4.14% increase
#in the murder rate, controlling for population, income and temperature.
#the illiteracy coefficient is significantly different from zero at the p < 0.001 level
##the p-value is the probability of obtaining the sampled results, assuming independence of
#the illiteracy and murder rate variables.
#because p-value < 0.001, you reject the hypothesis that illiteracy and murder rate are independent!
#on the other hand, frost is NOT significantly different from zero p = 0.954
#the p-value is the probability of obtaining the sampled results, assuming independence of
#the frost and murder rate variables.
#because p-value > 0.1, you accept the hypothesis that frost and murder rate are independent!
#this suggests that frost and murder rate are not linearly related when controlling for
#the other predictor variables.
#as indicated by multiple R-squared, taken together, the predictor variables account for
#57% of the variance in murder rates across states.
```

In [4]:

```
#so far we have assumed that predictor variables do not interact with each other.
#however, this is not necessarily true as we can see in the following code:
fit <- lm(mpg ~ hp + wt + hp:wt, data = mtcars) #here mpg is the response variable
#hp, wt and the interaction between hp and wt (hp:wt) are the predictor variables
#the dataset is the "mtcars" dataset that is provided with the base R installation
summary(fit) #summary of the results
coefficients(fit) #prints out the coefficients
```

In [10]:

```
#you can see from the "Pr(>|t|)" column that the interaction between
#hp and wt is significant (p-value = 0.000811)
#the coefficient is significantly different from zero at the p < 0.001 level
##the p-value is the probability of obtaining the sampled results, assuming independence of
#the interaction between hp and wt and the response variable mpg
#because p-value < 0.001, you reject the hypothesis that mpg and hp:wt are independent!
#this means that the relationship between one predictor and the response variable depends
#on the level of the other predictor. here it means that the relationship between
#mpg and hp depends on wt
#the model for predicting mpg is as follows:
# predictedMpg = 49.81 - 0.12 * hp - 8.22 * wt + 0.03 * hp * wt
#you can plug in various value of wt into this equation to examine the impact
#you can visualize the impacts using the effect() function in the effects package
#the code is provided below:
#install.packages("effects") #install the "effects" package
library(effects) #load the effects library
myEffect <- effect("hp:wt", #this is the term we want to plot; i.e. the terms whose interaction we want to visualize
fit,, #this is the fitted model returned by lm()
list(wt = c(2.2, 3.2, 4.2))) #list specifying the variables to be set to constant values and the values to employ
plot(myEffect, multiline = T) #multiline = TRUE superimposes the lines being plotted
```

In [8]:

```
#you can see from this graph that as the weight of the car increases, the relationship
#between horsepower and miles per gallon weakens. For wt = 4.2, the line is almost horizontal,
#indicating that as hp increases, mpg does not change.
```

In [14]:

```
#to properly interpret the coefficients of the OLS model, you must satisfy
#a number of statistical assumptions:
#1. Normality: for fixed values of the independent variables, the dependent variable
#is normally distributed
#2. Independence: the predicted values of the dependent variable are independent of each other
#3. Linearity: the dependent variable is linearly related to the independent variables
#4. Homoscedasticity: the variance of the dependent variable does not vary with the
#levels of the independent variables. Aka "constant variance".
```

In [16]:

```
#let us look at the output from the confint() function applied to the "states"
#multiple regression problem:
states <- as.data.frame(state.x77[,c("Murder", "Population", "Illiteracy", "Income", "Frost")]) #create the data frame
fit <- lm(Murder ~ Population + Illiteracy + Income + Frost, data = states) #specify the lm() function
summary(fit) #print a summary of the lm() function
coefficients(fit) #print the coefficients from the lm() function
confint(fit) #print the confidence intervals of the lm() function
```

In [3]:

```
#the results above suggest inter-alia that you can be 95% confident:
#1. that the interval [2.38, 5.9] contains the true change in the murder rate for a 1% change in illiteracy rate;
#2. that because the interval for Frost contains 0, you can conclude that a change in temperature is unrelated
#to murder rate, holding the other variables constant.
#However, your faith in these results is only as strong as the evidence you have that your data satisfies the
#statistical assumptions underlying the model, i.e. normality, independence, linearity and homoscedasticity.
#a set of techniques called "Regression Diagnostics" provides the necessary tools for evaluating the appropriateness
#of the regression model and can help you uncover and correct problems.
```

In [4]:

```
#the most common approach is to use the plot() function to the object returned by lm() as shown below:
fit <- lm(weight ~ height, data = women) #store the results of the lm() function in the fit object
par(mfrow = c(2,2)) #create a 2X2 matrix to store the plots generated by the plot() function
plot(fit) #generate the plots
```

In [5]:

```
#let us evaluate these charts against each of the assumptions of OLS regression:
#1. Normality: if a dependent variable is normally distributed for a fixed set of
#predictor values, then the residual values should be normally distributed with
#a mean of 0. The Normal Q-Q plot (upper right) is a probability plot of the
#standardized residuals against the values that would be expected under normality.
#if you have met the normality assumption, the points on this graph should fall on
#the straight 45-degree line. Because they do not, you have clearly violated the
#normality assumption.
#2. Independence: you cannot tell if the dependent variables are independent from these
#plots. You have to use your understanding of how the data was collected. There's
#no apriori reason to believe that one woman's weight influences another woman's
#weight. If you found out that the data were sampled from families, you might have
#to adjust your assumption of independence.
#3. Linearity: if the dependent variable is linearly related to the independent variables,
#there should be no systematic relationship between the residuals and the predictied
#(fitted) values. In other words, the model should capture all the systematic variance
#present in the data, leaving nothing but random noise. In the Residuals v/s Fitted
#graph (upper left), you see clear evidence of a curved relationship, which suggests
#you may want to add a quadratic term to the regression.
#4. Homoscedasticity: if you have met the constant variance assumption, the points in
#the Scale - Location graph (bottom left) should be a random band around a horizontal
#line. You seem to meet this assumption.
#Finally, the Residuals v/s Leverage graph (bottom right) provides information about
#individual observations that you may wish to attend to. The graph identifies outliers,
#high-leverage points and influential observations. Specifically:
#an outlier is an observation that is not predicted well by the fitted regression
#model (i.e. it has a large positive or negative residual)
#an observation with a high leverage value has an unusual combination of predictor
#values. That is, it is an outlier in the predictor space. The dependent variable
#value is not used to calculate an observation's leverage
#an influential observation is an observation that has a disproportionate impact on
#the determination of the model parameters. Influential observations are identified
#using a statistic called Cook's distance or Cook's D.
```

In [6]:

```
#let us review diagnostic plots for the quadratic fit using the following code:
fit2 <- lm(weight ~ height + I(height ^ 2), data = women) #store the results of the lm() function in the fit2 object
par(mfrow = c(2,2)) #create a 2X2 matrix to store the plots generated by the plot() function
plot(fit2) #generate the plots
```

In [7]:

```
#this set of plots suggests that the polynomial regression provides a better fit with regard
#to the linearity assumption, normality of residuals (except of observation 13), and
#homoscedasticity (constant residual variance). Observation 15 appears to be influential
#(based on a large Cook's D value), and deleting it has an impact on the parameter estimates.
#let's try deleting these values and evaluate model fit (you need to be careful when deleting data;
#your models should fit your data, not the other way around!)
fit3 <- lm(weight ~ height + I(height ^ 2), data = women[-c(13, 15),]) #store the results of the lm() function in the fit3 object
par(mfrow = c(2,2)) #create a 2X2 matrix to store the plots generated by the plot() function
plot(fit3) #generate the plots
```