In [70]:
#@author: Venky Rao raove@us.ibm.com
#@last edited: 13 Sep 2017
#@source: materials, data and examples adapted from R in Action 2nd Edition by Dr. Robert Kabacoff

Basic Statistics

Descriptive statistics

In [2]:
#in this section we will review measures of central tendency, variablity and distribution shape
#for continuous variables
#we will use the mtcars dataset that ships with the base installation of R
myvars <- c("mpg", "hp", "wt")
head(mtcars[myvars])
mpghpwt
Mazda RX421.0 110 2.620
Mazda RX4 Wag21.0 110 2.875
Datsun 71022.8 93 2.320
Hornet 4 Drive21.4 110 3.215
Hornet Sportabout18.7 175 3.440
Valiant18.1 105 3.460
In [3]:
#first we will look at descriptive statistics for all 32 cars.  Then, we will examine
#descriptive statistics by transmission type (am) and number of cylinders (cyl)
#transmission type is a dichotomous variable coded 0 = automatic, 1 = manual
#number of cylinders can be 4, 5 or 6
In [4]:
#in the base installation of R, you can use the summary() function to obtain
#descriptive statistics
summary(mtcars[myvars])
      mpg              hp              wt       
 Min.   :10.40   Min.   : 52.0   Min.   :1.513  
 1st Qu.:15.43   1st Qu.: 96.5   1st Qu.:2.581  
 Median :19.20   Median :123.0   Median :3.325  
 Mean   :20.09   Mean   :146.7   Mean   :3.217  
 3rd Qu.:22.80   3rd Qu.:180.0   3rd Qu.:3.610  
 Max.   :33.90   Max.   :335.0   Max.   :5.424  
In [5]:
#creating a user defined function for summary statistics
mystats <- function(x, na.omit = F) {
    if (na.omit) {
        x <- x[!is.na(x)]
    }
    m <- mean(x)
    n <- length(x)
    s <- sd(x)
    skew <- sum((x - m)^3 / s^3) / n
    kurt <- sum((x - m)^4 / s^4) / n - 3
    return (c(n = n, mean = m, stdev = s, skew = skew, kurtosis = kurt))
}
In [6]:
#applying our user defined function to our dataset
options(digits = 3) #keep decimal places to 3 to enhance readability
sapply(mtcars[myvars], mystats)
mpghpwt
n32.000 32.00032.0000
mean20.091 146.688 3.2172
stdev 6.027 68.563 0.9785
skew 0.611 0.726 0.4231
kurtosis-0.373 -0.136-0.0227
In [7]:
#descriptive statistics via the describe() function in the Hmisc package
library(Hmisc) #load the Hmisc package
describe(mtcars[myvars])
Loading required package: lattice

Attaching package: ‘lattice’

The following object is masked from ‘package:SparkR’:

    histogram

Loading required package: survival
Loading required package: Formula
Loading required package: ggplot2

Attaching package: ‘Hmisc’

The following objects are masked from ‘package:SparkR’:

    ceil, describe, summarize, translate

The following objects are masked from ‘package:base’:

    format.pval, round.POSIXt, trunc.POSIXt, units

mtcars[myvars] 

 3  Variables      32  Observations
--------------------------------------------------------------------------------
mpg 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
      32        0       25    0.999    20.09    6.796    12.00    14.34 
     .25      .50      .75      .90      .95 
   15.43    19.20    22.80    30.09    31.30 

lowest : 10.4 13.3 14.3 14.7 15.0, highest: 26.0 27.3 30.4 32.4 33.9
--------------------------------------------------------------------------------
hp 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
      32        0       22    0.997    146.7    77.04    63.65    66.00 
     .25      .50      .75      .90      .95 
   96.50   123.00   180.00   243.50   253.55 

lowest :  52  62  65  66  91, highest: 215 230 245 264 335
--------------------------------------------------------------------------------
wt 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
      32        0       29    0.999    3.217    1.089    1.736    1.956 
     .25      .50      .75      .90      .95 
   2.581    3.325    3.610    4.048    5.293 

lowest : 1.51 1.61 1.83 1.94 2.14, highest: 3.85 4.07 5.25 5.34 5.42
--------------------------------------------------------------------------------
In [8]:
#descriptive statistics via the stat.desc() function in the pastecs package
install.packages("pastecs") #install pastecs
library(pastecs) #load the pastecs package
stat.desc(mtcars[myvars])
Installing package into ‘/gpfs/global_fs01/sym_shared/YPProdSpark/user/s17c-9f3318fc11f06c-d37a4b9405b6/R/libs’
(as ‘lib’ is unspecified)
Loading required package: boot

Attaching package: ‘boot’

The following object is masked from ‘package:survival’:

    aml

The following object is masked from ‘package:lattice’:

    melanoma

The following object is masked from ‘package:SparkR’:

    corr


Attaching package: ‘pastecs’

The following objects are masked from ‘package:SparkR’:

    first, last

mpghpwt
nbr.val 32.00 32.000 32.000
nbr.null 0.00 0.000 0.000
nbr.na 0.00 0.000 0.000
min 10.40 52.000 1.513
max 33.90 335.000 5.424
range 23.50 283.000 3.911
sum642.90 4694.000102.952
median 19.20 123.000 3.325
mean 20.09 146.688 3.217
SE.mean 1.07 12.120 0.173
CI.mean.0.95 2.17 24.720 0.353
var 36.32 4700.867 0.957
std.dev 6.03 68.563 0.978
coef.var 0.30 0.467 0.304
In [9]:
#descriptive statistics via the describe() function in the psych package
library(psych) #load the psych package
describe(mtcars[myvars])
Attaching package: ‘psych’

The following object is masked from ‘package:boot’:

    logit

The following object is masked from ‘package:Hmisc’:

    describe

The following objects are masked from ‘package:ggplot2’:

    %+%, alpha

The following object is masked from ‘package:SparkR’:

    describe

varsnmeansdmediantrimmedmadminmaxrangeskewkurtosisse
mpg1 32 20.09 6.027 19.20 19.70 5.411 10.40 33.90 23.50 0.611 -0.3728 1.065
hp2 32 146.69 68.563 123.00 141.19 77.095 52.00 335.00 283.00 0.726 -0.135612.120
wt3 32 3.22 0.978 3.33 3.15 0.767 1.51 5.42 3.91 0.423 -0.0227 0.173

Descriptive statistics by group

In [10]:
#descriptive statistics by group using aggregate()
#example 1: mean
aggregate(mtcars[myvars], by = list(am = mtcars$am), mean)
ammpghpwt
0 17.1160 3.77
1 24.4127 2.41
In [11]:
#descriptive statistics by group using aggregate()
#example 2: standard deviation
aggregate(mtcars[myvars], by = list(am = mtcars$am), sd)
ammpghpwt
0 3.83 53.9 0.777
1 6.17 84.1 0.617
In [12]:
#descriptive statistics by group using by()
dstats <- function(x) { #create a user-defined function
    sapply(x, mystats)
}
by(mtcars[myvars], mtcars$am, dstats)
mtcars$am: 0
            mpg       hp     wt
n        19.000  19.0000 19.000
mean     17.147 160.2632  3.769
stdev     3.834  53.9082  0.777
skew      0.014  -0.0142  0.976
kurtosis -0.803  -1.2097  0.142
------------------------------------------------------------ 
mtcars$am: 1
             mpg      hp     wt
n        13.0000  13.000 13.000
mean     24.3923 126.846  2.411
stdev     6.1665  84.062  0.617
skew      0.0526   1.360  0.210
kurtosis -1.4554   0.563 -1.174
In [13]:
#summary statistics by group using the summaryBy() function in the doBy package
install.packages("doBy") #install doBy
library(doBy) #load the pastecs package
summaryBy(mpg + hp + wt ~ am, data = mtcars, FUN = mystats)
Installing package into ‘/gpfs/global_fs01/sym_shared/YPProdSpark/user/s17c-9f3318fc11f06c-d37a4b9405b6/R/libs’
(as ‘lib’ is unspecified)

Attaching package: ‘doBy’

The following objects are masked from ‘package:SparkR’:

    orderBy, sampleBy

ammpg.nmpg.meanmpg.stdevmpg.skewmpg.kurtosishp.nhp.meanhp.stdevhp.skewhp.kurtosiswt.nwt.meanwt.stdevwt.skewwt.kurtosis
0 19 17.1 3.83 0.0140 -0.803 19 160 53.9 -0.0142-1.210 19 3.77 0.777 0.976 0.142
1 13 24.4 6.17 0.0526 -1.455 13 127 84.1 1.3599 0.563 13 2.41 0.617 0.210 -1.174
In [14]:
#summary statistics by group using the describe.by() function in the psych package
library(psych) #load the psych package
describeBy(mtcars[myvars], list(am = mtcars$am))
$`0`
    vars  n   mean    sd median trimmed   mad   min    max  range  skew
mpg    1 19  17.15  3.83  17.30   17.12  3.11 10.40  24.40  14.00  0.01
hp     2 19 160.26 53.91 175.00  161.06 77.10 62.00 245.00 183.00 -0.01
wt     3 19   3.77  0.78   3.52    3.75  0.45  2.46   5.42   2.96  0.98
    kurtosis    se
mpg    -0.80  0.88
hp     -1.21 12.37
wt      0.14  0.18

$`1`
    vars  n   mean    sd median trimmed   mad   min    max  range skew kurtosis
mpg    1 13  24.39  6.17  22.80   24.38  6.67 15.00  33.90  18.90 0.05    -1.46
hp     2 13 126.85 84.06 109.00  114.73 63.75 52.00 335.00 283.00 1.36     0.56
wt     3 13   2.41  0.62   2.32    2.39  0.68  1.51   3.57   2.06 0.21    -1.17
       se
mpg  1.71
hp  23.31
wt   0.17

attr(,"call")
by.data.frame(data = x, INDICES = group, FUN = describe, type = type)

Frequency and contingency tables

In [15]:
#in this section we will look at frequency and contingency tables from categorical variables,
#along with tests of independence, measures of association, and methods for graphically displaying results
#we will use the Arthritis dataset from the vcd package
library(vcd) #load the vcd package
head(Arthritis) #display the first 6 rows
Loading required package: grid

Attaching package: ‘grid’

The following object is masked from ‘package:SparkR’:

    explode

IDTreatmentSexAgeImproved
57 TreatedMale 27 Some
46 TreatedMale 29 None
77 TreatedMale 30 None
17 TreatedMale 32 Marked
36 TreatedMale 46 Marked
23 TreatedMale 58 Marked

Generating frequency tables

One way tables

In [16]:
mytable <- with(Arthritis, table(Improved))
mytable
Improved
  None   Some Marked 
    42     14     28 
In [17]:
#convert frequencies into proportions using prop.table():
prop.table(mytable)
Improved
  None   Some Marked 
 0.500  0.167  0.333 
In [18]:
#convert frequencies into percentages using prop.table() * 100:
prop.table(mytable) * 100
Improved
  None   Some Marked 
  50.0   16.7   33.3 

Two way tables

In [19]:
mytable <- xtabs(~ Treatment + Improved, data = Arthritis)
mytable
         Improved
Treatment None Some Marked
  Placebo   29    7      7
  Treated   13    7     21
In [20]:
#you can generate marginal frequencies and proportions using the
#margin.table() and prop.table()functions
In [21]:
#for row sums and row proportions, you have:
margin.table(mytable, 1)
prop.table(mytable, 1)
Treatment
Placebo Treated 
     43      41 
         Improved
Treatment  None  Some Marked
  Placebo 0.674 0.163  0.163
  Treated 0.317 0.171  0.512
In [22]:
#for column sums and row proportions, you have:
margin.table(mytable, 2)
prop.table(mytable, 2)
Improved
  None   Some Marked 
    42     14     28 
         Improved
Treatment None Some Marked
  Placebo 0.69 0.50   0.25
  Treated 0.31 0.50   0.75
In [23]:
#cell proportions are obtained by:
prop.table(mytable)
         Improved
Treatment   None   Some Marked
  Placebo 0.3452 0.0833 0.0833
  Treated 0.1548 0.0833 0.2500
In [24]:
#you can use addmargins() to add marginal sums to these tables
addmargins(mytable)
addmargins(prop.table(mytable))
NoneSomeMarkedSum
Placebo29 7 743
Treated13 72141
Sum42142884
NoneSomeMarkedSum
Placebo0.345 0.08330.08330.512
Treated0.155 0.08330.25000.488
Sum0.500 0.16670.33331.000
In [25]:
#if you only want to add a sum column:
addmargins(prop.table(mytable, 1), 2)
NoneSomeMarkedSum
Placebo0.6740.1630.1631
Treated0.3170.1710.5121
In [26]:
#if you only want to add a sum row:
addmargins(prop.table(mytable, 2), 1)
NoneSomeMarked
Placebo0.690.5 0.25
Treated0.310.5 0.75
Sum1.001.0 1.00
In [27]:
#a third method of creating two-way tables is the CrossTable() function in the "gmodels" package:
install.packages("gmodels")
library(gmodels)
CrossTable(Arthritis$Treatment, Arthritis$Improved)
Installing package into ‘/gpfs/global_fs01/sym_shared/YPProdSpark/user/s17c-9f3318fc11f06c-d37a4b9405b6/R/libs’
(as ‘lib’ is unspecified)
 
   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  84 

 
                    | Arthritis$Improved 
Arthritis$Treatment |      None |      Some |    Marked | Row Total | 
--------------------|-----------|-----------|-----------|-----------|
            Placebo |        29 |         7 |         7 |        43 | 
                    |     2.616 |     0.004 |     3.752 |           | 
                    |     0.674 |     0.163 |     0.163 |     0.512 | 
                    |     0.690 |     0.500 |     0.250 |           | 
                    |     0.345 |     0.083 |     0.083 |           | 
--------------------|-----------|-----------|-----------|-----------|
            Treated |        13 |         7 |        21 |        41 | 
                    |     2.744 |     0.004 |     3.935 |           | 
                    |     0.317 |     0.171 |     0.512 |     0.488 | 
                    |     0.310 |     0.500 |     0.750 |           | 
                    |     0.155 |     0.083 |     0.250 |           | 
--------------------|-----------|-----------|-----------|-----------|
       Column Total |        42 |        14 |        28 |        84 | 
                    |     0.500 |     0.167 |     0.333 |           | 
--------------------|-----------|-----------|-----------|-----------|

 

Multidimensional tables

In [28]:
#if you have more than two categorical variables, you're dealing with multidimensional tables
#here's an example of a three-way contingency table
mytable <- xtabs(~ Treatment + Sex + Improved, data = Arthritis) #cell frequencies
mytable
, , Improved = None

         Sex
Treatment Female Male
  Placebo     19   10
  Treated      6    7

, , Improved = Some

         Sex
Treatment Female Male
  Placebo      7    0
  Treated      5    2

, , Improved = Marked

         Sex
Treatment Female Male
  Placebo      6    1
  Treated     16    5
In [29]:
ftable(mytable) #presents multidimensional tables in a compact and attractive manner
                 Improved None Some Marked
Treatment Sex                             
Placebo   Female            19    7      6
          Male              10    0      1
Treated   Female             6    5     16
          Male               7    2      5
In [30]:
margin.table(mytable, 1) #marginal frequencies by row (Treatment)
Treatment
Placebo Treated 
     43      41 
In [31]:
margin.table(mytable, 2) #marginal frequencies by row (Gender)
Sex
Female   Male 
    59     25 
In [32]:
margin.table(mytable, 3) #marginal frequencies by column (Improved)
Improved
  None   Some Marked 
    42     14     28 
In [33]:
margin.table(mytable, c(1, 3)) #treatment X improved marginal frequencies
         Improved
Treatment None Some Marked
  Placebo   29    7      7
  Treated   13    7     21
In [34]:
ftable(prop.table(mytable, c(1, 2))) #improved proportions for treatment X sex
                 Improved   None   Some Marked
Treatment Sex                                 
Placebo   Female          0.5938 0.2188 0.1875
          Male            0.9091 0.0000 0.0909
Treated   Female          0.2222 0.1852 0.5926
          Male            0.5000 0.1429 0.3571
In [35]:
ftable(addmargins(prop.table(mytable, c(1, 2)), 3)) #improved proportions for treatment X sex with row totals
                 Improved   None   Some Marked    Sum
Treatment Sex                                        
Placebo   Female          0.5938 0.2188 0.1875 1.0000
          Male            0.9091 0.0000 0.0909 1.0000
Treated   Female          0.2222 0.1852 0.5926 1.0000
          Male            0.5000 0.1429 0.3571 1.0000
In [36]:
ftable(addmargins(prop.table(mytable, c(1, 2)), 3)) * 100 #improved percentages for treatment X sex with row totals
                 Improved   None   Some Marked    Sum
Treatment Sex                                        
Placebo   Female           59.38  21.88  18.75 100.00
          Male             90.91   0.00   9.09 100.00
Treated   Female           22.22  18.52  59.26 100.00
          Male             50.00  14.29  35.71 100.00

Tests of independence

Chi-Square test of independence

In [37]:
#you can apply the function chisq.test() to a two-way table to produce a chi-square test of independence
#of the row and column variables.  example:
library(vcd) #load the vcd package for the Arthritis dataset
mytable <- xtabs(~ Treatment + Improved, data = Arthritis) #create a two-way table
chisq.test(mytable) #run the chi-square test
	Pearson's Chi-squared test

data:  mytable
X-squared = 10, df = 2, p-value = 0.001
In [38]:
#let us understand the result:
#let us begin with a hypothesis: treatment type and outcome are independent
#the p-value is the probability of obtaining the sampled results, assuming independence of
     #the row and column variables in the population.
#because p-value < 0.01, you reject the hypothesis that the treatment type is independent from the outcome!
In [39]:
#another example:
mytable <- xtabs(~ Improved + Sex, data = Arthritis) #create a two-way table
chisq.test(mytable) #run the chi-square test
Warning message in chisq.test(mytable):
“Chi-squared approximation may be incorrect”
	Pearson's Chi-squared test

data:  mytable
X-squared = 5, df = 2, p-value = 0.09
In [40]:
#let us understand the result:
#let us begin with a hypothesis: outcome and sex are independent
#the p-value is the probability of obtaining the sampled results, assuming independence of
     #the row and column variables in the population.
#because p-value > 0.05, you confirm the hypothesis that outcome is independent from sex!
In [41]:
#conclusion:
#in a chi-square test of independence:
#the hypothesis is that two variables are independent
#if the p-value < 0.01, you reject the hypothesis
#if the p-value > 0.05, you confirm the hypothesis

Fisher's exact test

In [42]:
#Fisher's exact test evaluates the null hypothesis of independence of rows and columns
#in a contingency table with fixed marginals.  The format is fisher.test(mytable) where
#mytable is a two-way table.  In contrast to many statistical packages, the fisher.test()
#function can be applied to any two-way table with two or more rows and columns, not
#just a 2X2 table.  here's an example of a Fisher's exact test:
library(vcd) #load the vcd package for the Arthritis dataset
mytable <- xtabs(~ Treatment + Improved, data = Arthritis) #create the 2-way table
fisher.test(mytable) #run the test
	Fisher's Exact Test for Count Data

data:  mytable
p-value = 0.001
alternative hypothesis: two.sided
In [46]:
#because p-value is < 0.01, we reject the null hypothesis that Treatment and Improved are independent of each other

Cochran-Mantel-Haenszel test

In [44]:
#the mantelhaen.test() provides a Cochran-Mantel-Haenszel chi-square test of the null hypothesis
#that two nominal variables are conditionally independent in each stratum of a third variable.
#the following code tests the hypothesis that the Treatment and Improved variables are independent
#within each level for Sex.  The test assumes that there is no three-way (Treatment X Improved X Sex) interaction:
library(vcd) #load the vcd package for the Arthritis dataset
mytable <- xtabs(~ Treatment + Improved + Sex, data = Arthritis) #create the 3-way table
mantelhaen.test(mytable) #run the test
	Cochran-Mantel-Haenszel test

data:  mytable
Cochran-Mantel-Haenszel M^2 = 10, df = 2, p-value = 7e-04
In [45]:
#7e-04 = 7 * 10^-4 = 0.0007
#because p-value is < 0.01, we reject the null hypothesis that Treatment and Improved are independent of each other when controlled for Sex

Measures of association

In [48]:
#if you reject the null hypothesis of independence of variables using either the chi-square or the Fisher's exact of the CMH test,
#you try to guage the strength of the dependence or association.
#the assocstats() function of the vcd package calculates the phi-coefficient, contingency coefficient and Cramer's V for a 2-way table
#in general, LOWER magnitudes indicate STRONGER associations.  here's an example:
library(vcd) #load the vcd package for the Arthritis dataset
mytable <- xtabs(~ Treatment + Improved, data = Arthritis) #create the 2-way table
assocstats(mytable) #run the test
                    X^2 df  P(> X^2)
Likelihood Ratio 13.530  2 0.0011536
Pearson          13.055  2 0.0014626

Phi-Coefficient   : NA 
Contingency Coeff.: 0.367 
Cramer's V        : 0.394 

Correlations

In [49]:
#Correlation coefficients are used to describe relationships among quantitative variables
#the sign +- indicates the direction of the relationship (positive or negative)
#the magnitude indicates strength of the relationship (ranging from 0 for no relationship to 1 for a perfectly predictable relationship)

Types of correlations

Pearson, Spearman and Kendall correlations

In [51]:
#the Pearson product-moment correlation assesses the degree of linear relationship between 2 quantitative variables
#Spearman's rank-oder correlation coefficient assesses the degree of relationship between 2 rank-ordered variables
#Kendall's tau is also a nonparametric measure of rank correlation
In [52]:
#the cor() function produces all 3 correlation coefficients whereas the cov() function produces covariances
#the simplified format is: cor(x, use = , method = );
#where x = matrix or data frame; use = specifies the handling of missing data (default = everything); method = method (default = "pearson")
#examples follow
In [53]:
states <- state.x77[, 1:6] #choose a subset of the state.x77 dataset; pick columns 1 through 6 and store them in the "states" object
In [57]:
cor(states) #uses defaults for use (everything) and method (pearson)
PopulationIncomeIlliteracyLife ExpMurderHS Grad
Population 1.0000 0.208 0.108 -0.0681 0.344 -0.0985
Income 0.2082 1.000 -0.437 0.3403-0.230 0.6199
Illiteracy 0.1076-0.437 1.000 -0.5885 0.703 -0.6572
Life Exp-0.0681 0.340 -0.588 1.0000-0.781 0.5822
Murder 0.3436-0.230 0.703 -0.7808 1.000 -0.4880
HS Grad-0.0985 0.620 -0.657 0.5822-0.488 1.0000
In [58]:
cor(states, method = "spearman") #uses defaults for use (everything)
PopulationIncomeIlliteracyLife ExpMurderHS Grad
Population 1.000 0.125 0.313-0.104 0.346-0.383
Income 0.125 1.000-0.315 0.324-0.217 0.510
Illiteracy 0.313-0.315 1.000-0.555 0.672-0.655
Life Exp-0.104 0.324-0.555 1.000-0.780 0.524
Murder 0.346-0.217 0.672-0.780 1.000-0.437
HS Grad-0.383 0.510-0.655 0.524-0.437 1.000
In [59]:
#you get square matrices by default (i.e. all variables crossed with all other variables)
#you can also produce non-square variables as follows:
x <- states[, c("Population", "Income", "Illiteracy", "HS Grad")]
y <- states[, c("Life Exp", "Murder")]
cor(x, y)
Life ExpMurder
Population-0.0681 0.344
Income 0.3403-0.230
Illiteracy-0.5885 0.703
HS Grad 0.5822-0.488

Partial correlations

In [62]:
#a partial correlation is a correlation between 2 quantitative variables, controlling for one or more quantitative variables
#the pcor() function in the ggm package provides partial correlation coefficients.  here's an example:
install.packages("ggm") #install the ggm package
library(ggm) #load the ggm package
colnames(states) #names of the columns of the states.x77 dataset in R
Installing package into ‘/gpfs/global_fs01/sym_shared/YPProdSpark/user/s17c-9f3318fc11f06c-d37a4b9405b6/R/libs’
(as ‘lib’ is unspecified)
also installing the dependencies ‘irlba’, ‘igraph’

Loading required package: igraph

Attaching package: ‘igraph’

The following object is masked from ‘package:SparkR’:

    union

The following objects are masked from ‘package:stats’:

    decompose, spectrum

The following object is masked from ‘package:base’:

    union


Attaching package: ‘ggm’

The following object is masked from ‘package:igraph’:

    pa

The following object is masked from ‘package:Hmisc’:

    rcorr

  1. 'Population'
  2. 'Income'
  3. 'Illiteracy'
  4. 'Life Exp'
  5. 'Murder'
  6. 'HS Grad'
In [69]:
#the form of the pcor() function is as follows:
#pcor(u, S) where u is a vector of numbers, with the first two numbers being the indices of the numbers to be correlated
#and the remaining numbers being the indices of the conditioning variables
#S is the covariance matrix among the variables
pcor(c(1, 5, 2, 3, 6), stats::cov(states)) #here, 1 = Population and 5 = Murder are the 2 variables to be correlated
                                    #Income, Illiteracy and HS Grad are the conditioning variables
                                    #cov(states) is the covariance matrix among variables
                                    #to get the cov() function to work, we specify the package name by using stats::cov(dataset)
0.346272371372947
In [71]:
#in this case 0.346 is the correlation between population and murder rate controlling for:
#income, illiteracy and hs graduation rate

Other types of correlations

In [65]:
#the hatcor() function in the polycor package can compute a heterogeneous correlation matrix containing
#Pearson product-moment correlations between numeric variables,
#polyserial correlations between numeric and ordinal variables,
#polychoric correlations between ordinal variables, and
#tetrachoric correlations between two dichotomous variables.
#Polyserial, polychoric and tetrachoric correlations assume that the ordinal or dichotomous variables
#are derived from underlying normal distributions.

Testing correlations for significance

In [72]:
#you can use the cor.test() function to test an individual correlation coefficient
#the format is: cor.test(x, y, alternative = , method = ), where
#x and y are the variables to be correlated
#alternative could be "two.side" (default; population correlation isn't equal to 0); "less" (population correlation < 0); or
                      #"greater" (population correlation > 0)
#method could be either "pearson" (default), "kendall" or "spearman"
#here's an example:
cor.test(states[, 3], states[, 5]) #testing the correlation between illiteracy and murder rate
                                   #the defaults are: alternative = "two.side" and method = "pearson"
	Pearson's product-moment correlation

data:  states[, 3] and states[, 5]
t = 7, df = 50, p-value = 1e-08
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.528 0.821
sample estimates:
  cor 
0.703 
In [74]:
#the code above tests that the Pearson correlation between illiteracy and murder rate is 0.
#if the population correlation between these variables was in fact 0, then you would expect to see
#a sample correlation as large as 0.703 less than 1 time out of 10 million (i.e. p-value = 1e-08 or 1 / 10 ^ 8)
#since this is unlikely, you reject the null hypothesis that the correlation between illiteracy and murder rate is 0,
#and accept the alternative hypothesis that the correlation between illiteracy and murder rate is NOT 0.
In [73]:
#using cor.test(), you can only test one correlation at a time
#the corr.test() function in the "psych" package allows you to go further.  here's an example:
library(psych) #load the psych library
corr.test(states, use = "complete") #"complete" = listwise deletion of missing values
Call:corr.test(x = states, use = "complete")
Correlation matrix 
           Population Income Illiteracy Life Exp Murder HS Grad
Population       1.00   0.21       0.11    -0.07   0.34   -0.10
Income           0.21   1.00      -0.44     0.34  -0.23    0.62
Illiteracy       0.11  -0.44       1.00    -0.59   0.70   -0.66
Life Exp        -0.07   0.34      -0.59     1.00  -0.78    0.58
Murder           0.34  -0.23       0.70    -0.78   1.00   -0.49
HS Grad         -0.10   0.62      -0.66     0.58  -0.49    1.00
Sample Size 
[1] 50
Probability values (Entries above the diagonal are adjusted for multiple tests.) 
           Population Income Illiteracy Life Exp Murder HS Grad
Population       0.00   0.59       1.00      1.0   0.10       1
Income           0.15   0.00       0.01      0.1   0.54       0
Illiteracy       0.46   0.00       0.00      0.0   0.00       0
Life Exp         0.64   0.02       0.00      0.0   0.00       0
Murder           0.01   0.11       0.00      0.0   0.00       0
HS Grad          0.50   0.00       0.00      0.0   0.00       0

 To see confidence intervals of the correlations, print with the short=FALSE option
In [75]:
#in the above code, you can see that the correlation between population size and HS grad rate (-0.10)
#is not significantly different from 0 (p = 0.50).

T-tests

In [76]:
#t-tests help in the comparison of two groups

Independent t-test

In [78]:
#are you more likely to be imprisoned if you commit a crime in the South of the US?
#the comparison of interest is Southern versus non-Southers states,
#and the dependent variable is the probablity of incarceration.
#a two-group independent t-test can be used to test the hypothesis that the two population means are equal
#we will use the UScrime dataset from the MASS package to test this hypothesis
library(MASS) #attach the MASS package
t.test(Prob ~ So, data = UScrime) #Prob = probability of incarceration; So = indicator variable for southern states
	Welch Two Sample t-test

data:  Prob by So
t = -4, df = 20, p-value = 7e-04
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.0385 -0.0119
sample estimates:
mean in group 0 mean in group 1 
         0.0385          0.0637 
In [79]:
#from the results above, you can reject the null hypothesis that Southern states and non-Southern states,
#have equal probabilities of imprisonment (p < 0.001)

Dependent t-test

In [80]:
#is the unemployment rate for younger males (14 - 24) > for older males (35 - 39)?
#in this case, the 2 groups are not independent since you would not expect the unemployment
#rate for younger and older males in Alabama to be unrelated
#here's an example:
library(MASS) #attach the MASS package
sapply(UScrime[c("U1", "U2")], function(x)(c(mean = mean(x), sd = sd(x)))) #use sapply() to calculate mean and sd for the 2 groups
U1U2
mean95.5 33.98
sd18.0 8.45
In [81]:
with(UScrime, t.test(U1, U2, paired = T))
	Paired t-test

data:  U1 and U2
t = 30, df = 50, p-value <2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 57.7 65.3
sample estimates:
mean of the differences 
                   61.5 
In [82]:
#the mean difference (61.5) is large enough to warrant rejection of the hypothesis that the mean
#unemployent rate for older and younger males is the same.  younger males have a higher rate.
#the probability of obtaining a sample difference this large IF the population means are equal is,
#less than 0.0000000000000002 (i.e. p-value of 2e-16 or 2 * 10 ^ -16)

Nonparametric tests of group differences

In [83]:
#if outcome variables are severely skewed or ordinal in nature, you may wish to use nonparametric tests

Comparing two groups

In [84]:
#if two groups are independent, you can use the Wilcoxon rank sum test (aka Mann-Whitney U test)
#here's an example:
library(MASS) #load the MASS package
with(UScrime, by(Prob, So, median)) #calculate median probabilities of incarceration rates by Southern v/s non-Southern states
So: 0
[1] 0.0382
------------------------------------------------------------ 
So: 1
[1] 0.0556
In [85]:
wilcox.test(Prob ~ So, data = UScrime) #run the test
	Wilcoxon rank sum test

data:  Prob by So
W = 80, p-value = 8e-05
alternative hypothesis: true location shift is not equal to 0
In [86]:
#you can reject the null hypothesis that incarceration rates are the same in
#Southern and non-Southern states (since p-value < 0.001; 8e-05 = 8 * 10 ^ -5)
In [87]:
#if the two groups are paired, you can apply the same test but with the paired = TRUE option
#here's an example:
sapply(UScrime[c("U1", "U2")], median) #calculate the median unemployment rate for young and old male age groups
U1
92
U2
34
In [88]:
with(UScrime, wilcox.test(U1, U2, paired = T)) #run the test with paired = T option
Warning message in wilcox.test.default(U1, U2, paired = T):
“cannot compute exact p-value with ties”
	Wilcoxon signed rank test with continuity correction

data:  U1 and U2
V = 1000, p-value = 2e-09
alternative hypothesis: true location shift is not equal to 0
In [89]:
#again, we can reject the null hypothesis since the p-value < 0.001

Comparing more than two groups

In [90]:
#if the groups are independent, you can use the Kruskal-Wallis test;
#if the groups are dependent, you can use the Friedman test
In [91]:
#here's an example of a Kruskal-Wallis test
states <- data.frame(state.region, state.x77) #add the region designation to the dataset
kruskal.test(Illiteracy ~ state.region, data = states) #apply the test
	Kruskal-Wallis rank sum test

data:  Illiteracy by state.region
Kruskal-Wallis chi-squared = 20, df = 3, p-value = 5e-05
In [92]:
#this significance test suggests that the illiteracy rate isn't the same in each of the four
#regions of the country (p-value < 0.001; p-value = 5e-05 = 5 * 10 ^ -5)
In [94]:
#nonparametric multiple comparisons using the wmc() function
source("http://www.statmethods.net/RiA/wmc.txt") #access the function
states <- data.frame(state.region, state.x77) #create the data frame
wmc(Illiteracy ~ state.region, data = states, method = "holm") #run the test
Descriptive Statistics

         West North Central Northeast  South
n      13.000        12.000     9.000 16.000
median  0.600         0.700     1.100  1.750
mad     0.148         0.148     0.297  0.593

Multiple Comparisons (Wilcoxon Rank Sum Tests)
Probability Adjustment = holm

        Group.1       Group.2    W        p    
1          West North Central 88.0 8.67e-01    
2          West     Northeast 46.5 8.67e-01    
3          West         South 39.0 1.79e-02   *
4 North Central     Northeast 20.5 5.36e-02   .
5 North Central         South  2.0 8.05e-05 ***
6     Northeast         South 18.0 1.19e-02   *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In [95]:
#you can see from the test results that the South differs significantly
#from the other three regions and that the other three regions do not
#differ from each other at a p < 0.05 level.