In [70]:
#@author: Venky Rao raove@us.ibm.com
#@last edited: 13 Sep 2017
#@source: materials, data and examples adapted from R in Action 2nd Edition by Dr. Robert Kabacoff


# Basic Statistics¶

## Descriptive statistics¶

In [2]:
#in this section we will review measures of central tendency, variablity and distribution shape
#for continuous variables
#we will use the mtcars dataset that ships with the base installation of R
myvars <- c("mpg", "hp", "wt")

mpghpwt
Mazda RX421.0 110 2.620
Mazda RX4 Wag21.0 110 2.875
Datsun 71022.8 93 2.320
Hornet 4 Drive21.4 110 3.215
Valiant18.1 105 3.460
In [3]:
#first we will look at descriptive statistics for all 32 cars.  Then, we will examine
#descriptive statistics by transmission type (am) and number of cylinders (cyl)
#transmission type is a dichotomous variable coded 0 = automatic, 1 = manual
#number of cylinders can be 4, 5 or 6

In [4]:
#in the base installation of R, you can use the summary() function to obtain
#descriptive statistics
summary(mtcars[myvars])

      mpg              hp              wt
Min.   :10.40   Min.   : 52.0   Min.   :1.513
1st Qu.:15.43   1st Qu.: 96.5   1st Qu.:2.581
Median :19.20   Median :123.0   Median :3.325
Mean   :20.09   Mean   :146.7   Mean   :3.217
3rd Qu.:22.80   3rd Qu.:180.0   3rd Qu.:3.610
Max.   :33.90   Max.   :335.0   Max.   :5.424  
In [5]:
#creating a user defined function for summary statistics
mystats <- function(x, na.omit = F) {
if (na.omit) {
x <- x[!is.na(x)]
}
m <- mean(x)
n <- length(x)
s <- sd(x)
skew <- sum((x - m)^3 / s^3) / n
kurt <- sum((x - m)^4 / s^4) / n - 3
return (c(n = n, mean = m, stdev = s, skew = skew, kurtosis = kurt))
}

In [6]:
#applying our user defined function to our dataset
options(digits = 3) #keep decimal places to 3 to enhance readability
sapply(mtcars[myvars], mystats)

mpghpwt
n32.000 32.00032.0000
mean20.091 146.688 3.2172
stdev 6.027 68.563 0.9785
skew 0.611 0.726 0.4231
kurtosis-0.373 -0.136-0.0227
In [7]:
#descriptive statistics via the describe() function in the Hmisc package
describe(mtcars[myvars])

Loading required package: lattice

Attaching package: ‘lattice’

The following object is masked from ‘package:SparkR’:

histogram

Attaching package: ‘Hmisc’

The following objects are masked from ‘package:SparkR’:

ceil, describe, summarize, translate

The following objects are masked from ‘package:base’:

format.pval, round.POSIXt, trunc.POSIXt, units


mtcars[myvars]

3  Variables      32  Observations
--------------------------------------------------------------------------------
mpg
n  missing distinct     Info     Mean      Gmd      .05      .10
32        0       25    0.999    20.09    6.796    12.00    14.34
.25      .50      .75      .90      .95
15.43    19.20    22.80    30.09    31.30

lowest : 10.4 13.3 14.3 14.7 15.0, highest: 26.0 27.3 30.4 32.4 33.9
--------------------------------------------------------------------------------
hp
n  missing distinct     Info     Mean      Gmd      .05      .10
32        0       22    0.997    146.7    77.04    63.65    66.00
.25      .50      .75      .90      .95
96.50   123.00   180.00   243.50   253.55

lowest :  52  62  65  66  91, highest: 215 230 245 264 335
--------------------------------------------------------------------------------
wt
n  missing distinct     Info     Mean      Gmd      .05      .10
32        0       29    0.999    3.217    1.089    1.736    1.956
.25      .50      .75      .90      .95
2.581    3.325    3.610    4.048    5.293

lowest : 1.51 1.61 1.83 1.94 2.14, highest: 3.85 4.07 5.25 5.34 5.42
--------------------------------------------------------------------------------
In [8]:
#descriptive statistics via the stat.desc() function in the pastecs package
install.packages("pastecs") #install pastecs
stat.desc(mtcars[myvars])

Installing package into ‘/gpfs/global_fs01/sym_shared/YPProdSpark/user/s17c-9f3318fc11f06c-d37a4b9405b6/R/libs’
(as ‘lib’ is unspecified)

Attaching package: ‘boot’

The following object is masked from ‘package:survival’:

aml

The following object is masked from ‘package:lattice’:

melanoma

The following object is masked from ‘package:SparkR’:

corr

Attaching package: ‘pastecs’

The following objects are masked from ‘package:SparkR’:

first, last


mpghpwt
nbr.val 32.00 32.000 32.000
nbr.null 0.00 0.000 0.000
nbr.na 0.00 0.000 0.000
min 10.40 52.000 1.513
max 33.90 335.000 5.424
range 23.50 283.000 3.911
sum642.90 4694.000102.952
median 19.20 123.000 3.325
mean 20.09 146.688 3.217
SE.mean 1.07 12.120 0.173
CI.mean.0.95 2.17 24.720 0.353
var 36.32 4700.867 0.957
std.dev 6.03 68.563 0.978
coef.var 0.30 0.467 0.304
In [9]:
#descriptive statistics via the describe() function in the psych package
describe(mtcars[myvars])

Attaching package: ‘psych’

The following object is masked from ‘package:boot’:

logit

The following object is masked from ‘package:Hmisc’:

describe

The following objects are masked from ‘package:ggplot2’:

%+%, alpha

The following object is masked from ‘package:SparkR’:

describe


mpg1 32 20.09 6.027 19.20 19.70 5.411 10.40 33.90 23.50 0.611 -0.3728 1.065
hp2 32 146.69 68.563 123.00 141.19 77.095 52.00 335.00 283.00 0.726 -0.135612.120
wt3 32 3.22 0.978 3.33 3.15 0.767 1.51 5.42 3.91 0.423 -0.0227 0.173

### Descriptive statistics by group¶

In [10]:
#descriptive statistics by group using aggregate()
#example 1: mean
aggregate(mtcars[myvars], by = list(am = mtcars$am), mean)  ammpghpwt 0 17.1160 3.77 1 24.4127 2.41 In [11]: #descriptive statistics by group using aggregate() #example 2: standard deviation aggregate(mtcars[myvars], by = list(am = mtcars$am), sd)

ammpghpwt
0 3.83 53.9 0.777
1 6.17 84.1 0.617
In [12]:
#descriptive statistics by group using by()
dstats <- function(x) { #create a user-defined function
sapply(x, mystats)
}
by(mtcars[myvars], mtcars$am, dstats)  mtcars$am: 0
mpg       hp     wt
n        19.000  19.0000 19.000
mean     17.147 160.2632  3.769
stdev     3.834  53.9082  0.777
skew      0.014  -0.0142  0.976
kurtosis -0.803  -1.2097  0.142
------------------------------------------------------------
mtcars$am: 1 mpg hp wt n 13.0000 13.000 13.000 mean 24.3923 126.846 2.411 stdev 6.1665 84.062 0.617 skew 0.0526 1.360 0.210 kurtosis -1.4554 0.563 -1.174 In [13]: #summary statistics by group using the summaryBy() function in the doBy package install.packages("doBy") #install doBy library(doBy) #load the pastecs package summaryBy(mpg + hp + wt ~ am, data = mtcars, FUN = mystats)  Installing package into ‘/gpfs/global_fs01/sym_shared/YPProdSpark/user/s17c-9f3318fc11f06c-d37a4b9405b6/R/libs’ (as ‘lib’ is unspecified) Attaching package: ‘doBy’ The following objects are masked from ‘package:SparkR’: orderBy, sampleBy  ammpg.nmpg.meanmpg.stdevmpg.skewmpg.kurtosishp.nhp.meanhp.stdevhp.skewhp.kurtosiswt.nwt.meanwt.stdevwt.skewwt.kurtosis 0 19 17.1 3.83 0.0140 -0.803 19 160 53.9 -0.0142-1.210 19 3.77 0.777 0.976 0.142 1 13 24.4 6.17 0.0526 -1.455 13 127 84.1 1.3599 0.563 13 2.41 0.617 0.210 -1.174 In [14]: #summary statistics by group using the describe.by() function in the psych package library(psych) #load the psych package describeBy(mtcars[myvars], list(am = mtcars$am))

$0 vars n mean sd median trimmed mad min max range skew mpg 1 19 17.15 3.83 17.30 17.12 3.11 10.40 24.40 14.00 0.01 hp 2 19 160.26 53.91 175.00 161.06 77.10 62.00 245.00 183.00 -0.01 wt 3 19 3.77 0.78 3.52 3.75 0.45 2.46 5.42 2.96 0.98 kurtosis se mpg -0.80 0.88 hp -1.21 12.37 wt 0.14 0.18$1
vars  n   mean    sd median trimmed   mad   min    max  range skew kurtosis
mpg    1 13  24.39  6.17  22.80   24.38  6.67 15.00  33.90  18.90 0.05    -1.46
hp     2 13 126.85 84.06 109.00  114.73 63.75 52.00 335.00 283.00 1.36     0.56
wt     3 13   2.41  0.62   2.32    2.39  0.68  1.51   3.57   2.06 0.21    -1.17
se
mpg  1.71
hp  23.31
wt   0.17

attr(,"call")
by.data.frame(data = x, INDICES = group, FUN = describe, type = type)

## Frequency and contingency tables¶

In [15]:
#in this section we will look at frequency and contingency tables from categorical variables,
#along with tests of independence, measures of association, and methods for graphically displaying results
#we will use the Arthritis dataset from the vcd package
head(Arthritis) #display the first 6 rows

Loading required package: grid

Attaching package: ‘grid’

The following object is masked from ‘package:SparkR’:

explode


IDTreatmentSexAgeImproved
57 TreatedMale 27 Some
46 TreatedMale 29 None
77 TreatedMale 30 None
17 TreatedMale 32 Marked
36 TreatedMale 46 Marked
23 TreatedMale 58 Marked

### Generating frequency tables¶

#### One way tables¶

In [16]:
mytable <- with(Arthritis, table(Improved))
mytable

Improved
None   Some Marked
42     14     28 
In [17]:
#convert frequencies into proportions using prop.table():
prop.table(mytable)

Improved
None   Some Marked
0.500  0.167  0.333 
In [18]:
#convert frequencies into percentages using prop.table() * 100:
prop.table(mytable) * 100

Improved
None   Some Marked
50.0   16.7   33.3 

#### Two way tables¶

In [19]:
mytable <- xtabs(~ Treatment + Improved, data = Arthritis)
mytable

         Improved
Treatment None Some Marked
Placebo   29    7      7
Treated   13    7     21
In [20]:
#you can generate marginal frequencies and proportions using the
#margin.table() and prop.table()functions

In [21]:
#for row sums and row proportions, you have:
margin.table(mytable, 1)
prop.table(mytable, 1)

Treatment
Placebo Treated
43      41 
         Improved
Treatment  None  Some Marked
Placebo 0.674 0.163  0.163
Treated 0.317 0.171  0.512
In [22]:
#for column sums and row proportions, you have:
margin.table(mytable, 2)
prop.table(mytable, 2)

Improved
None   Some Marked
42     14     28 
         Improved
Treatment None Some Marked
Placebo 0.69 0.50   0.25
Treated 0.31 0.50   0.75
In [23]:
#cell proportions are obtained by:
prop.table(mytable)

         Improved
Treatment   None   Some Marked
Placebo 0.3452 0.0833 0.0833
Treated 0.1548 0.0833 0.2500
In [24]:
#you can use addmargins() to add marginal sums to these tables

NoneSomeMarkedSum
Placebo29 7 743
Treated13 72141
Sum42142884
NoneSomeMarkedSum
Placebo0.345 0.08330.08330.512
Treated0.155 0.08330.25000.488
Sum0.500 0.16670.33331.000
In [25]:
#if you only want to add a sum column:

NoneSomeMarkedSum
Placebo0.6740.1630.1631
Treated0.3170.1710.5121
In [26]:
#if you only want to add a sum row:

NoneSomeMarked
Placebo0.690.5 0.25
Treated0.310.5 0.75
Sum1.001.0 1.00
In [27]:
#a third method of creating two-way tables is the CrossTable() function in the "gmodels" package:
install.packages("gmodels")
library(gmodels)
CrossTable(Arthritis$Treatment, Arthritis$Improved)

Installing package into ‘/gpfs/global_fs01/sym_shared/YPProdSpark/user/s17c-9f3318fc11f06c-d37a4b9405b6/R/libs’
(as ‘lib’ is unspecified)


Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

Total Observations in Table:  84

| Arthritis$Improved Arthritis$Treatment |      None |      Some |    Marked | Row Total |
--------------------|-----------|-----------|-----------|-----------|
Placebo |        29 |         7 |         7 |        43 |
|     2.616 |     0.004 |     3.752 |           |
|     0.674 |     0.163 |     0.163 |     0.512 |
|     0.690 |     0.500 |     0.250 |           |
|     0.345 |     0.083 |     0.083 |           |
--------------------|-----------|-----------|-----------|-----------|
Treated |        13 |         7 |        21 |        41 |
|     2.744 |     0.004 |     3.935 |           |
|     0.317 |     0.171 |     0.512 |     0.488 |
|     0.310 |     0.500 |     0.750 |           |
|     0.155 |     0.083 |     0.250 |           |
--------------------|-----------|-----------|-----------|-----------|
Column Total |        42 |        14 |        28 |        84 |
|     0.500 |     0.167 |     0.333 |           |
--------------------|-----------|-----------|-----------|-----------|



#### Multidimensional tables¶

In [28]:
#if you have more than two categorical variables, you're dealing with multidimensional tables
#here's an example of a three-way contingency table
mytable <- xtabs(~ Treatment + Sex + Improved, data = Arthritis) #cell frequencies
mytable

, , Improved = None

Sex
Treatment Female Male
Placebo     19   10
Treated      6    7

, , Improved = Some

Sex
Treatment Female Male
Placebo      7    0
Treated      5    2

, , Improved = Marked

Sex
Treatment Female Male
Placebo      6    1
Treated     16    5

In [29]:
ftable(mytable) #presents multidimensional tables in a compact and attractive manner

                 Improved None Some Marked
Treatment Sex
Placebo   Female            19    7      6
Male              10    0      1
Treated   Female             6    5     16
Male               7    2      5
In [30]:
margin.table(mytable, 1) #marginal frequencies by row (Treatment)

Treatment
Placebo Treated
43      41 
In [31]:
margin.table(mytable, 2) #marginal frequencies by row (Gender)

Sex
Female   Male
59     25 
In [32]:
margin.table(mytable, 3) #marginal frequencies by column (Improved)

Improved
None   Some Marked
42     14     28 
In [33]:
margin.table(mytable, c(1, 3)) #treatment X improved marginal frequencies

         Improved
Treatment None Some Marked
Placebo   29    7      7
Treated   13    7     21
In [34]:
ftable(prop.table(mytable, c(1, 2))) #improved proportions for treatment X sex

                 Improved   None   Some Marked
Treatment Sex
Placebo   Female          0.5938 0.2188 0.1875
Male            0.9091 0.0000 0.0909
Treated   Female          0.2222 0.1852 0.5926
Male            0.5000 0.1429 0.3571
In [35]:
ftable(addmargins(prop.table(mytable, c(1, 2)), 3)) #improved proportions for treatment X sex with row totals

                 Improved   None   Some Marked    Sum
Treatment Sex
Placebo   Female          0.5938 0.2188 0.1875 1.0000
Male            0.9091 0.0000 0.0909 1.0000
Treated   Female          0.2222 0.1852 0.5926 1.0000
Male            0.5000 0.1429 0.3571 1.0000
In [36]:
ftable(addmargins(prop.table(mytable, c(1, 2)), 3)) * 100 #improved percentages for treatment X sex with row totals

                 Improved   None   Some Marked    Sum
Treatment Sex
Placebo   Female           59.38  21.88  18.75 100.00
Male             90.91   0.00   9.09 100.00
Treated   Female           22.22  18.52  59.26 100.00
Male             50.00  14.29  35.71 100.00

### Tests of independence¶

#### Chi-Square test of independence¶

In [37]:
#you can apply the function chisq.test() to a two-way table to produce a chi-square test of independence
#of the row and column variables.  example:
library(vcd) #load the vcd package for the Arthritis dataset
mytable <- xtabs(~ Treatment + Improved, data = Arthritis) #create a two-way table
chisq.test(mytable) #run the chi-square test

	Pearson's Chi-squared test

data:  mytable
X-squared = 10, df = 2, p-value = 0.001

In [38]:
#let us understand the result:
#let us begin with a hypothesis: treatment type and outcome are independent
#the p-value is the probability of obtaining the sampled results, assuming independence of
#the row and column variables in the population.
#because p-value < 0.01, you reject the hypothesis that the treatment type is independent from the outcome!

In [39]:
#another example:
mytable <- xtabs(~ Improved + Sex, data = Arthritis) #create a two-way table
chisq.test(mytable) #run the chi-square test

Warning message in chisq.test(mytable):
“Chi-squared approximation may be incorrect”
	Pearson's Chi-squared test

data:  mytable
X-squared = 5, df = 2, p-value = 0.09

In [40]:
#let us understand the result:
#let us begin with a hypothesis: outcome and sex are independent
#the p-value is the probability of obtaining the sampled results, assuming independence of
#the row and column variables in the population.
#because p-value > 0.05, you confirm the hypothesis that outcome is independent from sex!

In [41]:
#conclusion:
#in a chi-square test of independence:
#the hypothesis is that two variables are independent
#if the p-value < 0.01, you reject the hypothesis
#if the p-value > 0.05, you confirm the hypothesis


#### Fisher's exact test¶

In [42]:
#Fisher's exact test evaluates the null hypothesis of independence of rows and columns
#in a contingency table with fixed marginals.  The format is fisher.test(mytable) where
#mytable is a two-way table.  In contrast to many statistical packages, the fisher.test()
#function can be applied to any two-way table with two or more rows and columns, not
#just a 2X2 table.  here's an example of a Fisher's exact test:
library(vcd) #load the vcd package for the Arthritis dataset
mytable <- xtabs(~ Treatment + Improved, data = Arthritis) #create the 2-way table
fisher.test(mytable) #run the test

	Fisher's Exact Test for Count Data

data:  mytable
p-value = 0.001
alternative hypothesis: two.sided

In [46]:
#because p-value is < 0.01, we reject the null hypothesis that Treatment and Improved are independent of each other


#### Cochran-Mantel-Haenszel test¶

In [44]:
#the mantelhaen.test() provides a Cochran-Mantel-Haenszel chi-square test of the null hypothesis
#that two nominal variables are conditionally independent in each stratum of a third variable.
#the following code tests the hypothesis that the Treatment and Improved variables are independent
#within each level for Sex.  The test assumes that there is no three-way (Treatment X Improved X Sex) interaction:
library(vcd) #load the vcd package for the Arthritis dataset
mytable <- xtabs(~ Treatment + Improved + Sex, data = Arthritis) #create the 3-way table
mantelhaen.test(mytable) #run the test

	Cochran-Mantel-Haenszel test

data:  mytable
Cochran-Mantel-Haenszel M^2 = 10, df = 2, p-value = 7e-04

In [45]:
#7e-04 = 7 * 10^-4 = 0.0007
#because p-value is < 0.01, we reject the null hypothesis that Treatment and Improved are independent of each other when controlled for Sex


### Measures of association¶

In [48]:
#if you reject the null hypothesis of independence of variables using either the chi-square or the Fisher's exact of the CMH test,
#you try to guage the strength of the dependence or association.
#the assocstats() function of the vcd package calculates the phi-coefficient, contingency coefficient and Cramer's V for a 2-way table
#in general, LOWER magnitudes indicate STRONGER associations.  here's an example:
library(vcd) #load the vcd package for the Arthritis dataset
mytable <- xtabs(~ Treatment + Improved, data = Arthritis) #create the 2-way table
assocstats(mytable) #run the test

                    X^2 df  P(> X^2)
Likelihood Ratio 13.530  2 0.0011536
Pearson          13.055  2 0.0014626

Phi-Coefficient   : NA
Contingency Coeff.: 0.367
Cramer's V        : 0.394 

## Correlations¶

In [49]:
#Correlation coefficients are used to describe relationships among quantitative variables
#the sign +- indicates the direction of the relationship (positive or negative)
#the magnitude indicates strength of the relationship (ranging from 0 for no relationship to 1 for a perfectly predictable relationship)


### Types of correlations¶

#### Pearson, Spearman and Kendall correlations¶

In [51]:
#the Pearson product-moment correlation assesses the degree of linear relationship between 2 quantitative variables
#Spearman's rank-oder correlation coefficient assesses the degree of relationship between 2 rank-ordered variables
#Kendall's tau is also a nonparametric measure of rank correlation

In [52]:
#the cor() function produces all 3 correlation coefficients whereas the cov() function produces covariances
#the simplified format is: cor(x, use = , method = );
#where x = matrix or data frame; use = specifies the handling of missing data (default = everything); method = method (default = "pearson")
#examples follow

In [53]:
states <- state.x77[, 1:6] #choose a subset of the state.x77 dataset; pick columns 1 through 6 and store them in the "states" object

In [57]:
cor(states) #uses defaults for use (everything) and method (pearson)

Population 1.0000 0.208 0.108 -0.0681 0.344 -0.0985
Income 0.2082 1.000 -0.437 0.3403-0.230 0.6199
Illiteracy 0.1076-0.437 1.000 -0.5885 0.703 -0.6572
Life Exp-0.0681 0.340 -0.588 1.0000-0.781 0.5822
Murder 0.3436-0.230 0.703 -0.7808 1.000 -0.4880
HS Grad-0.0985 0.620 -0.657 0.5822-0.488 1.0000
In [58]:
cor(states, method = "spearman") #uses defaults for use (everything)

Population 1.000 0.125 0.313-0.104 0.346-0.383
Income 0.125 1.000-0.315 0.324-0.217 0.510
Illiteracy 0.313-0.315 1.000-0.555 0.672-0.655
Life Exp-0.104 0.324-0.555 1.000-0.780 0.524
Murder 0.346-0.217 0.672-0.780 1.000-0.437
In [59]:
#you get square matrices by default (i.e. all variables crossed with all other variables)
#you can also produce non-square variables as follows:
x <- states[, c("Population", "Income", "Illiteracy", "HS Grad")]
y <- states[, c("Life Exp", "Murder")]
cor(x, y)

Life ExpMurder
Population-0.0681 0.344
Income 0.3403-0.230
Illiteracy-0.5885 0.703

#### Partial correlations¶

In [62]:
#a partial correlation is a correlation between 2 quantitative variables, controlling for one or more quantitative variables
#the pcor() function in the ggm package provides partial correlation coefficients.  here's an example:
install.packages("ggm") #install the ggm package
colnames(states) #names of the columns of the states.x77 dataset in R

Installing package into ‘/gpfs/global_fs01/sym_shared/YPProdSpark/user/s17c-9f3318fc11f06c-d37a4b9405b6/R/libs’
(as ‘lib’ is unspecified)
also installing the dependencies ‘irlba’, ‘igraph’

Attaching package: ‘igraph’

The following object is masked from ‘package:SparkR’:

union

The following objects are masked from ‘package:stats’:

decompose, spectrum

The following object is masked from ‘package:base’:

union

Attaching package: ‘ggm’

The following object is masked from ‘package:igraph’:

pa

The following object is masked from ‘package:Hmisc’:

rcorr


1. 'Population'
2. 'Income'
3. 'Illiteracy'
4. 'Life Exp'
5. 'Murder'
In [69]:
#the form of the pcor() function is as follows:
#pcor(u, S) where u is a vector of numbers, with the first two numbers being the indices of the numbers to be correlated
#and the remaining numbers being the indices of the conditioning variables
#S is the covariance matrix among the variables
pcor(c(1, 5, 2, 3, 6), stats::cov(states)) #here, 1 = Population and 5 = Murder are the 2 variables to be correlated
#Income, Illiteracy and HS Grad are the conditioning variables
#cov(states) is the covariance matrix among variables
#to get the cov() function to work, we specify the package name by using stats::cov(dataset)

0.346272371372947
In [71]:
#in this case 0.346 is the correlation between population and murder rate controlling for:
#income, illiteracy and hs graduation rate


#### Other types of correlations¶

In [65]:
#the hatcor() function in the polycor package can compute a heterogeneous correlation matrix containing
#Pearson product-moment correlations between numeric variables,
#polyserial correlations between numeric and ordinal variables,
#polychoric correlations between ordinal variables, and
#tetrachoric correlations between two dichotomous variables.
#Polyserial, polychoric and tetrachoric correlations assume that the ordinal or dichotomous variables
#are derived from underlying normal distributions.


### Testing correlations for significance¶

In [72]:
#you can use the cor.test() function to test an individual correlation coefficient
#the format is: cor.test(x, y, alternative = , method = ), where
#x and y are the variables to be correlated
#alternative could be "two.side" (default; population correlation isn't equal to 0); "less" (population correlation < 0); or
#"greater" (population correlation > 0)
#method could be either "pearson" (default), "kendall" or "spearman"
#here's an example:
cor.test(states[, 3], states[, 5]) #testing the correlation between illiteracy and murder rate
#the defaults are: alternative = "two.side" and method = "pearson"

	Pearson's product-moment correlation

data:  states[, 3] and states[, 5]
t = 7, df = 50, p-value = 1e-08
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.528 0.821
sample estimates:
cor
0.703

In [74]:
#the code above tests that the Pearson correlation between illiteracy and murder rate is 0.
#if the population correlation between these variables was in fact 0, then you would expect to see
#a sample correlation as large as 0.703 less than 1 time out of 10 million (i.e. p-value = 1e-08 or 1 / 10 ^ 8)
#since this is unlikely, you reject the null hypothesis that the correlation between illiteracy and murder rate is 0,
#and accept the alternative hypothesis that the correlation between illiteracy and murder rate is NOT 0.

In [73]:
#using cor.test(), you can only test one correlation at a time
#the corr.test() function in the "psych" package allows you to go further.  here's an example:
corr.test(states, use = "complete") #"complete" = listwise deletion of missing values

Call:corr.test(x = states, use = "complete")
Correlation matrix
Population Income Illiteracy Life Exp Murder HS Grad
Population       1.00   0.21       0.11    -0.07   0.34   -0.10
Income           0.21   1.00      -0.44     0.34  -0.23    0.62
Illiteracy       0.11  -0.44       1.00    -0.59   0.70   -0.66
Life Exp        -0.07   0.34      -0.59     1.00  -0.78    0.58
Murder           0.34  -0.23       0.70    -0.78   1.00   -0.49
HS Grad         -0.10   0.62      -0.66     0.58  -0.49    1.00
Sample Size
[1] 50
Probability values (Entries above the diagonal are adjusted for multiple tests.)
Population Income Illiteracy Life Exp Murder HS Grad
Population       0.00   0.59       1.00      1.0   0.10       1
Income           0.15   0.00       0.01      0.1   0.54       0
Illiteracy       0.46   0.00       0.00      0.0   0.00       0
Life Exp         0.64   0.02       0.00      0.0   0.00       0
Murder           0.01   0.11       0.00      0.0   0.00       0
HS Grad          0.50   0.00       0.00      0.0   0.00       0

To see confidence intervals of the correlations, print with the short=FALSE option
In [75]:
#in the above code, you can see that the correlation between population size and HS grad rate (-0.10)
#is not significantly different from 0 (p = 0.50).


## T-tests¶

In [76]:
#t-tests help in the comparison of two groups


### Independent t-test¶

In [78]:
#are you more likely to be imprisoned if you commit a crime in the South of the US?
#the comparison of interest is Southern versus non-Southers states,
#and the dependent variable is the probablity of incarceration.
#a two-group independent t-test can be used to test the hypothesis that the two population means are equal
#we will use the UScrime dataset from the MASS package to test this hypothesis
library(MASS) #attach the MASS package
t.test(Prob ~ So, data = UScrime) #Prob = probability of incarceration; So = indicator variable for southern states

	Welch Two Sample t-test

data:  Prob by So
t = -4, df = 20, p-value = 7e-04
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.0385 -0.0119
sample estimates:
mean in group 0 mean in group 1
0.0385          0.0637

In [79]:
#from the results above, you can reject the null hypothesis that Southern states and non-Southern states,
#have equal probabilities of imprisonment (p < 0.001)


### Dependent t-test¶

In [80]:
#is the unemployment rate for younger males (14 - 24) > for older males (35 - 39)?
#in this case, the 2 groups are not independent since you would not expect the unemployment
#rate for younger and older males in Alabama to be unrelated
#here's an example:
library(MASS) #attach the MASS package
sapply(UScrime[c("U1", "U2")], function(x)(c(mean = mean(x), sd = sd(x)))) #use sapply() to calculate mean and sd for the 2 groups

U1U2
mean95.5 33.98
sd18.0 8.45
In [81]:
with(UScrime, t.test(U1, U2, paired = T))

	Paired t-test

data:  U1 and U2
t = 30, df = 50, p-value <2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
57.7 65.3
sample estimates:
mean of the differences
61.5

In [82]:
#the mean difference (61.5) is large enough to warrant rejection of the hypothesis that the mean
#unemployent rate for older and younger males is the same.  younger males have a higher rate.
#the probability of obtaining a sample difference this large IF the population means are equal is,
#less than 0.0000000000000002 (i.e. p-value of 2e-16 or 2 * 10 ^ -16)


## Nonparametric tests of group differences¶

In [83]:
#if outcome variables are severely skewed or ordinal in nature, you may wish to use nonparametric tests


### Comparing two groups¶

In [84]:
#if two groups are independent, you can use the Wilcoxon rank sum test (aka Mann-Whitney U test)
#here's an example:
with(UScrime, by(Prob, So, median)) #calculate median probabilities of incarceration rates by Southern v/s non-Southern states

So: 0
[1] 0.0382
------------------------------------------------------------
So: 1
[1] 0.0556
In [85]:
wilcox.test(Prob ~ So, data = UScrime) #run the test

	Wilcoxon rank sum test

data:  Prob by So
W = 80, p-value = 8e-05
alternative hypothesis: true location shift is not equal to 0

In [86]:
#you can reject the null hypothesis that incarceration rates are the same in
#Southern and non-Southern states (since p-value < 0.001; 8e-05 = 8 * 10 ^ -5)

In [87]:
#if the two groups are paired, you can apply the same test but with the paired = TRUE option
#here's an example:
sapply(UScrime[c("U1", "U2")], median) #calculate the median unemployment rate for young and old male age groups

U1
92
U2
34
In [88]:
with(UScrime, wilcox.test(U1, U2, paired = T)) #run the test with paired = T option

Warning message in wilcox.test.default(U1, U2, paired = T):
“cannot compute exact p-value with ties”
	Wilcoxon signed rank test with continuity correction

data:  U1 and U2
V = 1000, p-value = 2e-09
alternative hypothesis: true location shift is not equal to 0

In [89]:
#again, we can reject the null hypothesis since the p-value < 0.001


### Comparing more than two groups¶

In [90]:
#if the groups are independent, you can use the Kruskal-Wallis test;
#if the groups are dependent, you can use the Friedman test

In [91]:
#here's an example of a Kruskal-Wallis test
states <- data.frame(state.region, state.x77) #add the region designation to the dataset
kruskal.test(Illiteracy ~ state.region, data = states) #apply the test

	Kruskal-Wallis rank sum test

data:  Illiteracy by state.region
Kruskal-Wallis chi-squared = 20, df = 3, p-value = 5e-05

In [92]:
#this significance test suggests that the illiteracy rate isn't the same in each of the four
#regions of the country (p-value < 0.001; p-value = 5e-05 = 5 * 10 ^ -5)

In [94]:
#nonparametric multiple comparisons using the wmc() function
source("http://www.statmethods.net/RiA/wmc.txt") #access the function
states <- data.frame(state.region, state.x77) #create the data frame
wmc(Illiteracy ~ state.region, data = states, method = "holm") #run the test

Descriptive Statistics

West North Central Northeast  South
n      13.000        12.000     9.000 16.000
median  0.600         0.700     1.100  1.750

Multiple Comparisons (Wilcoxon Rank Sum Tests)

Group.1       Group.2    W        p
1          West North Central 88.0 8.67e-01
2          West     Northeast 46.5 8.67e-01
3          West         South 39.0 1.79e-02   *
4 North Central     Northeast 20.5 5.36e-02   .
5 North Central         South  2.0 8.05e-05 ***
6     Northeast         South 18.0 1.19e-02   *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In [95]:
#you can see from the test results that the South differs significantly
#from the other three regions and that the other three regions do not
#differ from each other at a p < 0.05 level.