In [70]:

```
#@author: Venky Rao raove@us.ibm.com
#@last edited: 13 Sep 2017
#@source: materials, data and examples adapted from R in Action 2nd Edition by Dr. Robert Kabacoff
```

In [2]:

```
#in this section we will review measures of central tendency, variablity and distribution shape
#for continuous variables
#we will use the mtcars dataset that ships with the base installation of R
myvars <- c("mpg", "hp", "wt")
head(mtcars[myvars])
```

In [3]:

```
#first we will look at descriptive statistics for all 32 cars. Then, we will examine
#descriptive statistics by transmission type (am) and number of cylinders (cyl)
#transmission type is a dichotomous variable coded 0 = automatic, 1 = manual
#number of cylinders can be 4, 5 or 6
```

In [4]:

```
#in the base installation of R, you can use the summary() function to obtain
#descriptive statistics
summary(mtcars[myvars])
```

In [5]:

```
#creating a user defined function for summary statistics
mystats <- function(x, na.omit = F) {
if (na.omit) {
x <- x[!is.na(x)]
}
m <- mean(x)
n <- length(x)
s <- sd(x)
skew <- sum((x - m)^3 / s^3) / n
kurt <- sum((x - m)^4 / s^4) / n - 3
return (c(n = n, mean = m, stdev = s, skew = skew, kurtosis = kurt))
}
```

In [6]:

```
#applying our user defined function to our dataset
options(digits = 3) #keep decimal places to 3 to enhance readability
sapply(mtcars[myvars], mystats)
```

In [7]:

```
#descriptive statistics via the describe() function in the Hmisc package
library(Hmisc) #load the Hmisc package
describe(mtcars[myvars])
```

In [8]:

```
#descriptive statistics via the stat.desc() function in the pastecs package
install.packages("pastecs") #install pastecs
library(pastecs) #load the pastecs package
stat.desc(mtcars[myvars])
```

In [9]:

```
#descriptive statistics via the describe() function in the psych package
library(psych) #load the psych package
describe(mtcars[myvars])
```

In [10]:

```
#descriptive statistics by group using aggregate()
#example 1: mean
aggregate(mtcars[myvars], by = list(am = mtcars$am), mean)
```

In [11]:

```
#descriptive statistics by group using aggregate()
#example 2: standard deviation
aggregate(mtcars[myvars], by = list(am = mtcars$am), sd)
```

In [12]:

```
#descriptive statistics by group using by()
dstats <- function(x) { #create a user-defined function
sapply(x, mystats)
}
by(mtcars[myvars], mtcars$am, dstats)
```

In [13]:

```
#summary statistics by group using the summaryBy() function in the doBy package
install.packages("doBy") #install doBy
library(doBy) #load the pastecs package
summaryBy(mpg + hp + wt ~ am, data = mtcars, FUN = mystats)
```

In [14]:

```
#summary statistics by group using the describe.by() function in the psych package
library(psych) #load the psych package
describeBy(mtcars[myvars], list(am = mtcars$am))
```

In [15]:

```
#in this section we will look at frequency and contingency tables from categorical variables,
#along with tests of independence, measures of association, and methods for graphically displaying results
#we will use the Arthritis dataset from the vcd package
library(vcd) #load the vcd package
head(Arthritis) #display the first 6 rows
```

In [16]:

```
mytable <- with(Arthritis, table(Improved))
mytable
```

In [17]:

```
#convert frequencies into proportions using prop.table():
prop.table(mytable)
```

In [18]:

```
#convert frequencies into percentages using prop.table() * 100:
prop.table(mytable) * 100
```

In [19]:

```
mytable <- xtabs(~ Treatment + Improved, data = Arthritis)
mytable
```

In [20]:

```
#you can generate marginal frequencies and proportions using the
#margin.table() and prop.table()functions
```

In [21]:

```
#for row sums and row proportions, you have:
margin.table(mytable, 1)
prop.table(mytable, 1)
```

In [22]:

```
#for column sums and row proportions, you have:
margin.table(mytable, 2)
prop.table(mytable, 2)
```

In [23]:

```
#cell proportions are obtained by:
prop.table(mytable)
```

In [24]:

```
#you can use addmargins() to add marginal sums to these tables
addmargins(mytable)
addmargins(prop.table(mytable))
```

In [25]:

```
#if you only want to add a sum column:
addmargins(prop.table(mytable, 1), 2)
```

In [26]:

```
#if you only want to add a sum row:
addmargins(prop.table(mytable, 2), 1)
```

In [27]:

```
#a third method of creating two-way tables is the CrossTable() function in the "gmodels" package:
install.packages("gmodels")
library(gmodels)
CrossTable(Arthritis$Treatment, Arthritis$Improved)
```

In [28]:

```
#if you have more than two categorical variables, you're dealing with multidimensional tables
#here's an example of a three-way contingency table
mytable <- xtabs(~ Treatment + Sex + Improved, data = Arthritis) #cell frequencies
mytable
```

In [29]:

```
ftable(mytable) #presents multidimensional tables in a compact and attractive manner
```

In [30]:

```
margin.table(mytable, 1) #marginal frequencies by row (Treatment)
```

In [31]:

```
margin.table(mytable, 2) #marginal frequencies by row (Gender)
```

In [32]:

```
margin.table(mytable, 3) #marginal frequencies by column (Improved)
```

In [33]:

```
margin.table(mytable, c(1, 3)) #treatment X improved marginal frequencies
```

In [34]:

```
ftable(prop.table(mytable, c(1, 2))) #improved proportions for treatment X sex
```

In [35]:

```
ftable(addmargins(prop.table(mytable, c(1, 2)), 3)) #improved proportions for treatment X sex with row totals
```

In [36]:

```
ftable(addmargins(prop.table(mytable, c(1, 2)), 3)) * 100 #improved percentages for treatment X sex with row totals
```

In [37]:

```
#you can apply the function chisq.test() to a two-way table to produce a chi-square test of independence
#of the row and column variables. example:
library(vcd) #load the vcd package for the Arthritis dataset
mytable <- xtabs(~ Treatment + Improved, data = Arthritis) #create a two-way table
chisq.test(mytable) #run the chi-square test
```

In [38]:

```
#let us understand the result:
#let us begin with a hypothesis: treatment type and outcome are independent
#the p-value is the probability of obtaining the sampled results, assuming independence of
#the row and column variables in the population.
#because p-value < 0.01, you reject the hypothesis that the treatment type is independent from the outcome!
```

In [39]:

```
#another example:
mytable <- xtabs(~ Improved + Sex, data = Arthritis) #create a two-way table
chisq.test(mytable) #run the chi-square test
```

In [40]:

```
#let us understand the result:
#let us begin with a hypothesis: outcome and sex are independent
#the p-value is the probability of obtaining the sampled results, assuming independence of
#the row and column variables in the population.
#because p-value > 0.05, you confirm the hypothesis that outcome is independent from sex!
```

In [41]:

```
#conclusion:
#in a chi-square test of independence:
#the hypothesis is that two variables are independent
#if the p-value < 0.01, you reject the hypothesis
#if the p-value > 0.05, you confirm the hypothesis
```

In [42]:

```
#Fisher's exact test evaluates the null hypothesis of independence of rows and columns
#in a contingency table with fixed marginals. The format is fisher.test(mytable) where
#mytable is a two-way table. In contrast to many statistical packages, the fisher.test()
#function can be applied to any two-way table with two or more rows and columns, not
#just a 2X2 table. here's an example of a Fisher's exact test:
library(vcd) #load the vcd package for the Arthritis dataset
mytable <- xtabs(~ Treatment + Improved, data = Arthritis) #create the 2-way table
fisher.test(mytable) #run the test
```

In [46]:

```
#because p-value is < 0.01, we reject the null hypothesis that Treatment and Improved are independent of each other
```

In [44]:

```
#the mantelhaen.test() provides a Cochran-Mantel-Haenszel chi-square test of the null hypothesis
#that two nominal variables are conditionally independent in each stratum of a third variable.
#the following code tests the hypothesis that the Treatment and Improved variables are independent
#within each level for Sex. The test assumes that there is no three-way (Treatment X Improved X Sex) interaction:
library(vcd) #load the vcd package for the Arthritis dataset
mytable <- xtabs(~ Treatment + Improved + Sex, data = Arthritis) #create the 3-way table
mantelhaen.test(mytable) #run the test
```

In [45]:

```
#7e-04 = 7 * 10^-4 = 0.0007
#because p-value is < 0.01, we reject the null hypothesis that Treatment and Improved are independent of each other when controlled for Sex
```

In [48]:

```
#if you reject the null hypothesis of independence of variables using either the chi-square or the Fisher's exact of the CMH test,
#you try to guage the strength of the dependence or association.
#the assocstats() function of the vcd package calculates the phi-coefficient, contingency coefficient and Cramer's V for a 2-way table
#in general, LOWER magnitudes indicate STRONGER associations. here's an example:
library(vcd) #load the vcd package for the Arthritis dataset
mytable <- xtabs(~ Treatment + Improved, data = Arthritis) #create the 2-way table
assocstats(mytable) #run the test
```

In [49]:

```
#Correlation coefficients are used to describe relationships among quantitative variables
#the sign +- indicates the direction of the relationship (positive or negative)
#the magnitude indicates strength of the relationship (ranging from 0 for no relationship to 1 for a perfectly predictable relationship)
```

In [51]:

```
#the Pearson product-moment correlation assesses the degree of linear relationship between 2 quantitative variables
#Spearman's rank-oder correlation coefficient assesses the degree of relationship between 2 rank-ordered variables
#Kendall's tau is also a nonparametric measure of rank correlation
```

In [52]:

```
#the cor() function produces all 3 correlation coefficients whereas the cov() function produces covariances
#the simplified format is: cor(x, use = , method = );
#where x = matrix or data frame; use = specifies the handling of missing data (default = everything); method = method (default = "pearson")
#examples follow
```

In [53]:

```
states <- state.x77[, 1:6] #choose a subset of the state.x77 dataset; pick columns 1 through 6 and store them in the "states" object
```

In [57]:

```
cor(states) #uses defaults for use (everything) and method (pearson)
```

In [58]:

```
cor(states, method = "spearman") #uses defaults for use (everything)
```

In [59]:

```
#you get square matrices by default (i.e. all variables crossed with all other variables)
#you can also produce non-square variables as follows:
x <- states[, c("Population", "Income", "Illiteracy", "HS Grad")]
y <- states[, c("Life Exp", "Murder")]
cor(x, y)
```

In [62]:

```
#a partial correlation is a correlation between 2 quantitative variables, controlling for one or more quantitative variables
#the pcor() function in the ggm package provides partial correlation coefficients. here's an example:
install.packages("ggm") #install the ggm package
library(ggm) #load the ggm package
colnames(states) #names of the columns of the states.x77 dataset in R
```

In [69]:

```
#the form of the pcor() function is as follows:
#pcor(u, S) where u is a vector of numbers, with the first two numbers being the indices of the numbers to be correlated
#and the remaining numbers being the indices of the conditioning variables
#S is the covariance matrix among the variables
pcor(c(1, 5, 2, 3, 6), stats::cov(states)) #here, 1 = Population and 5 = Murder are the 2 variables to be correlated
#Income, Illiteracy and HS Grad are the conditioning variables
#cov(states) is the covariance matrix among variables
#to get the cov() function to work, we specify the package name by using stats::cov(dataset)
```

In [71]:

```
#in this case 0.346 is the correlation between population and murder rate controlling for:
#income, illiteracy and hs graduation rate
```

In [65]:

```
#the hatcor() function in the polycor package can compute a heterogeneous correlation matrix containing
#Pearson product-moment correlations between numeric variables,
#polyserial correlations between numeric and ordinal variables,
#polychoric correlations between ordinal variables, and
#tetrachoric correlations between two dichotomous variables.
#Polyserial, polychoric and tetrachoric correlations assume that the ordinal or dichotomous variables
#are derived from underlying normal distributions.
```

In [72]:

```
#you can use the cor.test() function to test an individual correlation coefficient
#the format is: cor.test(x, y, alternative = , method = ), where
#x and y are the variables to be correlated
#alternative could be "two.side" (default; population correlation isn't equal to 0); "less" (population correlation < 0); or
#"greater" (population correlation > 0)
#method could be either "pearson" (default), "kendall" or "spearman"
#here's an example:
cor.test(states[, 3], states[, 5]) #testing the correlation between illiteracy and murder rate
#the defaults are: alternative = "two.side" and method = "pearson"
```

In [74]:

```
#the code above tests that the Pearson correlation between illiteracy and murder rate is 0.
#if the population correlation between these variables was in fact 0, then you would expect to see
#a sample correlation as large as 0.703 less than 1 time out of 10 million (i.e. p-value = 1e-08 or 1 / 10 ^ 8)
#since this is unlikely, you reject the null hypothesis that the correlation between illiteracy and murder rate is 0,
#and accept the alternative hypothesis that the correlation between illiteracy and murder rate is NOT 0.
```

In [73]:

```
#using cor.test(), you can only test one correlation at a time
#the corr.test() function in the "psych" package allows you to go further. here's an example:
library(psych) #load the psych library
corr.test(states, use = "complete") #"complete" = listwise deletion of missing values
```

In [75]:

```
#in the above code, you can see that the correlation between population size and HS grad rate (-0.10)
#is not significantly different from 0 (p = 0.50).
```

In [76]:

```
#t-tests help in the comparison of two groups
```

In [78]:

```
#are you more likely to be imprisoned if you commit a crime in the South of the US?
#the comparison of interest is Southern versus non-Southers states,
#and the dependent variable is the probablity of incarceration.
#a two-group independent t-test can be used to test the hypothesis that the two population means are equal
#we will use the UScrime dataset from the MASS package to test this hypothesis
library(MASS) #attach the MASS package
t.test(Prob ~ So, data = UScrime) #Prob = probability of incarceration; So = indicator variable for southern states
```

In [79]:

```
#from the results above, you can reject the null hypothesis that Southern states and non-Southern states,
#have equal probabilities of imprisonment (p < 0.001)
```

In [80]:

```
#is the unemployment rate for younger males (14 - 24) > for older males (35 - 39)?
#in this case, the 2 groups are not independent since you would not expect the unemployment
#rate for younger and older males in Alabama to be unrelated
#here's an example:
library(MASS) #attach the MASS package
sapply(UScrime[c("U1", "U2")], function(x)(c(mean = mean(x), sd = sd(x)))) #use sapply() to calculate mean and sd for the 2 groups
```

In [81]:

```
with(UScrime, t.test(U1, U2, paired = T))
```

In [82]:

```
#the mean difference (61.5) is large enough to warrant rejection of the hypothesis that the mean
#unemployent rate for older and younger males is the same. younger males have a higher rate.
#the probability of obtaining a sample difference this large IF the population means are equal is,
#less than 0.0000000000000002 (i.e. p-value of 2e-16 or 2 * 10 ^ -16)
```

In [83]:

```
#if outcome variables are severely skewed or ordinal in nature, you may wish to use nonparametric tests
```

In [84]:

```
#if two groups are independent, you can use the Wilcoxon rank sum test (aka Mann-Whitney U test)
#here's an example:
library(MASS) #load the MASS package
with(UScrime, by(Prob, So, median)) #calculate median probabilities of incarceration rates by Southern v/s non-Southern states
```

In [85]:

```
wilcox.test(Prob ~ So, data = UScrime) #run the test
```

In [86]:

```
#you can reject the null hypothesis that incarceration rates are the same in
#Southern and non-Southern states (since p-value < 0.001; 8e-05 = 8 * 10 ^ -5)
```

In [87]:

```
#if the two groups are paired, you can apply the same test but with the paired = TRUE option
#here's an example:
sapply(UScrime[c("U1", "U2")], median) #calculate the median unemployment rate for young and old male age groups
```

In [88]:

```
with(UScrime, wilcox.test(U1, U2, paired = T)) #run the test with paired = T option
```

In [89]:

```
#again, we can reject the null hypothesis since the p-value < 0.001
```

In [90]:

```
#if the groups are independent, you can use the Kruskal-Wallis test;
#if the groups are dependent, you can use the Friedman test
```

In [91]:

```
#here's an example of a Kruskal-Wallis test
states <- data.frame(state.region, state.x77) #add the region designation to the dataset
kruskal.test(Illiteracy ~ state.region, data = states) #apply the test
```

In [92]:

```
#this significance test suggests that the illiteracy rate isn't the same in each of the four
#regions of the country (p-value < 0.001; p-value = 5e-05 = 5 * 10 ^ -5)
```

In [94]:

```
#nonparametric multiple comparisons using the wmc() function
source("http://www.statmethods.net/RiA/wmc.txt") #access the function
states <- data.frame(state.region, state.x77) #create the data frame
wmc(Illiteracy ~ state.region, data = states, method = "holm") #run the test
```

In [95]:

```
#you can see from the test results that the South differs significantly
#from the other three regions and that the other three regions do not
#differ from each other at a p < 0.05 level.
```