In [56]:
#@author: Venky Rao raove@us.ibm.com
#@last edited: 2 Sep 2017
#@source: materials, data and examples adapted from R in Action 2nd Edition by Dr. Robert Kabacoff

Advanced data preparation in R

A data management challenge

In [39]:
#let us begin this notebook by creating a dataset that we can analyze
#this dataset is about grades various students have received in different subjects
student <- c("John Davis", "Angela Williams", "Bullwinkle Moose", "David Jones", "Janice Markhammer", 
             "Cheryl Cushing", "Reuven Ytzrhak", "Greg Knox", "Joel England", "Mary Rayburn")
mathematics <- c(502, 600, 412, 358, 495, 512, 410, 625, 573, 522)
science <- c(95, 99, 80, 82, 75, 85, 80, 95, 89, 86)
english <- c(25, 22, 18, 15, 20, 28, 15, 30, 27, 18)
#create the data frame
grades <- data.frame("Student" = student, "Math" = mathematics, "Science" = science, "English" = english, stringsAsFactors = F)
grades
StudentMathScienceEnglish
John Davis 502 95 25
Angela Williams 600 99 22
Bullwinkle Moose 412 80 18
David Jones 358 82 15
Janice Markhammer495 75 20
Cheryl Cushing 512 85 28
Reuven Ytzrhak 410 80 15
Greg Knox 625 95 30
Joel England 573 89 27
Mary Rayburn 522 86 18
In [3]:
#our task is to:
# 1. combine the scores of the students into a single performance indicator
# 2. assign A grade to the top 20% of the students, B grade to the next 20% and so on
In [4]:
#considerations:
#several obstacles are immediately evident:
# 1. scores on the 3 exams are not comparable.  They have widely different means and standard deviations 
     # so averaging them does not make sense.  You must transform the scores into comparable units before
     # combining them
# 2. you will need a method of determining a student's percentile rank on this score to assign a grade
# 3. there is a single field for names complicating the task of sorting students.  You will need to split
     # split names into first name and last name in order to sort them properly

# let's review some key functions and then tackle these tasks

Numerical and character functions - some examples and applications

Calculating mean and standard deviation

In [5]:
# calculating the mean and standard deviation of a vector of numbers
mean(mathematics)
sd(mathematics)
500.9
86.6736535645188
In [6]:
#another approach to calculating mean and sd of a vector of numbers
n <- length(mathematics) #storing the length of the vector in a variable
meanMath <- sum(mathematics) / n #mean = sum divided by number of observations
css <- sum((mathematics - meanMath)^2) #css = corrected sum of squares
sdMath <- sqrt(css / (n - 1)) #standard deviation is the square root of css divided by (n - 1)
meanMath
sdMath
500.9
86.6736535645188

Generating pseudo-random numbers with a seed

In [7]:
#generating pseudo-random numbers from a uniform distribution
runif(5)
  1. 0.711534205358475
  2. 0.670486355898902
  3. 0.872814682777971
  4. 0.773019835585728
  5. 0.82747818948701
In [8]:
#you get a different set of numbers if you run this code again
runif(5)
  1. 0.543440421111882
  2. 0.771847827592865
  3. 0.0152768774423748
  4. 0.859203945845366
  5. 0.856453391257674
In [9]:
#to ensure you receive the same numbers, you should set the seed explicitly as follows:
set.seed(1234)
runif(5)
  1. 0.113703411305323
  2. 0.622299404814839
  3. 0.609274732880294
  4. 0.623379441676661
  5. 0.860915383556858
In [10]:
#let's try it again to confirm the results
set.seed(1234)
runif(5)
  1. 0.113703411305323
  2. 0.622299404814839
  3. 0.609274732880294
  4. 0.623379441676661
  5. 0.860915383556858

Generating multivariate normal data

In [11]:
#install the MASS package
install.packages("MASS")
Installing package into ‘/gpfs/global_fs01/sym_shared/YPProdSpark/user/s17c-9f3318fc11f06c-d37a4b9405b6/R/libs’
(as ‘lib’ is unspecified)
In [12]:
#call the MASS library
library(MASS)
Attaching package: ‘MASS’

The following object is masked from ‘package:SparkR’:

    select

In [13]:
#to create a sample of 500 observations for 3 variables with a normal distribution, use the following function:
# mvrnorm(n, mean, sigma) where n = sample size, mean = vector of means and sigma = variance-covariance (or correlation) matrix
size <- 500 #size of the sample
options(digits = 3) # 3 variables
set.seed(1234) # set the seed
mean <- c(230.7, 146.7, 3.6) #vector of specified means
sigma <- matrix(c(15360.8, 6721.2, -47.1, 
                  6721.2, 4700.9, -16.5,
                  -47.1, -16.5, 0.3), nrow = 3, ncol = 3) #covariance matrix
mydata <- mvrnorm(size, mean, sigma) #generate the data
mydata <- as.data.frame(mydata) # convert the data into a data frame
names(mydata) <- c("y", "x1", "x2") #name the columns of the data frame
dim(mydata) #view the dimensions of the data
head(mydata, n = 10) #view the first 10 observations of the dataset
  1. 500
  2. 3
yx1x2
98.8 41.33.43
244.5205.23.80
375.7186.72.51
-59.2 11.24.71
313.0111.03.45
288.8185.12.72
134.8165.04.39
171.7 97.43.64
167.2101.03.50
121.1 94.54.10

Applying functions to matrices and data frames

In [14]:
#the following examples demonstrate how to apply functions to data objects
a <- 5
sqrt(a)
2.23606797749979
In [15]:
b <- c(1.243, 5.654, 2.99)
round(b)
  1. 1
  2. 6
  3. 3
In [16]:
c <- matrix(runif(12), nrow = 3)
c
0.96360.216 0.289 0.913
0.20680.240 0.804 0.353
0.08620.197 0.378 0.931
In [17]:
log(c)
-0.0371-1.53 -1.241 -0.0912
-1.5762-1.43 -0.218 -1.0402
-2.4511-1.62 -0.972 -0.0710
In [18]:
mean(c)
0.464871924913799

Applying functions to the rows (or columns) of a matrix

In [19]:
mydata <- matrix(rnorm(30), nrow = 6) #generates data
mydata #displays data
0.459 1.203 1.234 0.591-0.281
-1.261 0.769-1.891-0.435 0.812
-0.527 0.238-0.223-0.251-0.208
-0.557-1.415 0.768-0.926 1.451
-0.374 2.934 0.388 1.087 0.841
-0.604 0.935 0.609-1.944-0.866
In [20]:
apply(mydata, 1, mean) #applies the mean function to the rows of the matrix
  1. 0.641087154125071
  2. -0.401344766887558
  3. -0.194047803717953
  4. -0.135845379716559
  5. 0.975243660012329
  6. -0.3739876460196
In [21]:
apply(mydata, 2, sd) #applies the standard deviation function to the columns of the matrix
  1. 0.552406205780717
  2. 1.41091655037894
  3. 1.10728458523771
  4. 1.08004486563685
  5. 0.875567958448012
In [22]:
apply(mydata, 2, mean, trim = 0.2) #calculates trimmed column means
# in this case, means based on the middle 60% of the data, top 20% and bottom 20% of the values are discarded
  1. -0.515779523050443
  2. 0.786344271134455
  3. 0.38564073126604
  4. -0.255415356807863
  5. 0.291311652370587
In [23]:
#apply() applies a function to an array; lapply() and sapply() apply to a list

A solution to the data management challenge

In [40]:
options(digits = 2) #limits the number of digits printed after the decimal place and makes the outputs easier to read
z <- scale(grades[, 2:4]) # scales the scores so that the variables (scores) are standardized so that each test is reported
                          # in standard deviation units rather than in their original scales
z #output z to look at the transformed values
MathScienceEnglish
0.013 1.078 0.587
1.143 1.591 0.037
-1.026-0.847-0.697
-1.649-0.590-1.247
-0.068-1.489-0.330
0.128-0.205 1.137
-1.049-0.847-1.247
1.432 1.078 1.504
0.832 0.308 0.954
0.243-0.077-0.697
In [41]:
score <- apply(z, 1, mean) # use the apply function to calculate the score for each student
score #display the score
  1. 0.559202782475718
  2. 0.923825924994712
  3. -0.85654139662578
  4. -1.16204727526912
  5. -0.62897755156843
  6. 0.353248480638513
  7. -1.04762416088104
  8. 1.33789336450671
  9. 0.697836120261835
  10. -0.176816288533107
In [42]:
grades <- cbind(grades, "Score" = score) # add the score as a column to the data frame
grades
StudentMathScienceEnglishScore
John Davis 502 95 25 0.56
Angela Williams 600 99 22 0.92
Bullwinkle Moose 412 80 18 -0.86
David Jones 358 82 15 -1.16
Janice Markhammer495 75 20 -0.63
Cheryl Cushing 512 85 28 0.35
Reuven Ytzrhak 410 80 15 -1.05
Greg Knox 625 95 30 1.34
Joel England 573 89 27 0.70
Mary Rayburn 522 86 18 -0.18
In [43]:
y <- quantile(score, c(0.8, 0.6, 0.4, 0.2)) # gives the percentile rank of each student's performance score
y # displays the quantiles
80%
0.74303408120841
60%
0.435630201373395
40%
-0.357680793747236
20%
-0.894757949476833
In [44]:
#grade the students
grades$Grade[score >= y[1]] <- "A"
grades$Grade[score < y[1] & score >= y[2]] <- "B"
grades$Grade[score < y[2] & score >= y[3]] <- "C"
grades$Grade[score < y[3] & score >= y[4]] <- "D"
grades$Grade[score < y[4]] <- "F"
grades #print the dataset with the new Grade column
StudentMathScienceEnglishScoreGrade
John Davis 502 95 25 0.56 B
Angela Williams 600 99 22 0.92 A
Bullwinkle Moose 412 80 18 -0.86 D
David Jones 358 82 15 -1.16 F
Janice Markhammer495 75 20 -0.63 D
Cheryl Cushing 512 85 28 0.35 C
Reuven Ytzrhak 410 80 15 -1.05 F
Greg Knox 625 95 30 1.34 A
Joel England 573 89 27 0.70 B
Mary Rayburn 522 86 18 -0.18 C
In [45]:
#extract the first and last names
name <- strsplit((grades$Student), " ") #splits each row in the name column into 2 character vectors
name
    1. 'John'
    2. 'Davis'
    1. 'Angela'
    2. 'Williams'
    1. 'Bullwinkle'
    2. 'Moose'
    1. 'David'
    2. 'Jones'
    1. 'Janice'
    2. 'Markhammer'
    1. 'Cheryl'
    2. 'Cushing'
    1. 'Reuven'
    2. 'Ytzrhak'
    1. 'Greg'
    2. 'Knox'
    1. 'Joel'
    2. 'England'
    1. 'Mary'
    2. 'Rayburn'
In [46]:
Lastname <- sapply(name, "[", 2) #extracts the last name
Lastname
  1. 'Davis'
  2. 'Williams'
  3. 'Moose'
  4. 'Jones'
  5. 'Markhammer'
  6. 'Cushing'
  7. 'Ytzrhak'
  8. 'Knox'
  9. 'England'
  10. 'Rayburn'
In [47]:
Firstname <- sapply(name, "[", 1) #extracts the first name
Firstname
  1. 'John'
  2. 'Angela'
  3. 'Bullwinkle'
  4. 'David'
  5. 'Janice'
  6. 'Cheryl'
  7. 'Reuven'
  8. 'Greg'
  9. 'Joel'
  10. 'Mary'
In [48]:
grades
StudentMathScienceEnglishScoreGrade
John Davis 502 95 25 0.56 B
Angela Williams 600 99 22 0.92 A
Bullwinkle Moose 412 80 18 -0.86 D
David Jones 358 82 15 -1.16 F
Janice Markhammer495 75 20 -0.63 D
Cheryl Cushing 512 85 28 0.35 C
Reuven Ytzrhak 410 80 15 -1.05 F
Greg Knox 625 95 30 1.34 A
Joel England 573 89 27 0.70 B
Mary Rayburn 522 86 18 -0.18 C
In [49]:
grades <- cbind(Firstname, Lastname, grades[,-1]) #adds the firstname and lastname columns and drops the original name column
grades
FirstnameLastnameMathScienceEnglishScoreGrade
John Davis 502 95 25 0.56 B
Angela Williams 600 99 22 0.92 A
BullwinkleMoose 412 80 18 -0.86 D
David Jones 358 82 15 -1.16 F
Janice Markhammer495 75 20 -0.63 D
Cheryl Cushing 512 85 28 0.35 C
Reuven Ytzrhak 410 80 15 -1.05 F
Greg Knox 625 95 30 1.34 A
Joel England 573 89 27 0.70 B
Mary Rayburn 522 86 18 -0.18 C
In [50]:
grades <- grades[order(Lastname, Firstname), ] #sorts the data frame by last name and then by first name
grades # prints the final result
FirstnameLastnameMathScienceEnglishScoreGrade
6Cheryl Cushing 512 85 28 0.35 C
1John Davis 502 95 25 0.56 B
9Joel England 573 89 27 0.70 B
4David Jones 358 82 15 -1.16 F
8Greg Knox 625 95 30 1.34 A
5Janice Markhammer495 75 20 -0.63 D
3BullwinkleMoose 412 80 18 -0.86 D
10Mary Rayburn 522 86 18 -0.18 C
2Angela Williams 600 99 22 0.92 A
7Reuven Ytzrhak 410 80 15 -1.05 F

Control flow

The FOR loop

In [35]:
#the for loop executes a statement repetitively until a variable's value is no longer contained in the sequence
#for example:
for (i in 1:10) {
    print ("Hello")
} #hello is printed 10 times
[1] "Hello"
[1] "Hello"
[1] "Hello"
[1] "Hello"
[1] "Hello"
[1] "Hello"
[1] "Hello"
[1] "Hello"
[1] "Hello"
[1] "Hello"

The WHILE loop

In [36]:
#a while loop executes a statement repetitively until the condition is no longer true
i <- 10
while (i > 0) {
    print ("Hello")
    i <- i - 1
} #prints hello 10 times
[1] "Hello"
[1] "Hello"
[1] "Hello"
[1] "Hello"
[1] "Hello"
[1] "Hello"
[1] "Hello"
[1] "Hello"
[1] "Hello"
[1] "Hello"

The IF-ELSE statement

In [52]:
#the if-else control structure executes a statement if a given condition is true.
#optionally, a different statement is executed if the conditon is false
#example
if (is.character(grades$Grade)){
    grades$Grade <- as.factor(grades$Grade)
}
if (!is.factor(grades$Grade)){
    grades$Grade <- as.factor(grades$Grade)
} else {
    print ("Grade already is a factor.")
}
[1] "Grade already is a factor."

The IFELSE statement

In [54]:
#the ifelse statement is a compact and vectorized version of the if-else statement
#example:
# ifelse(grades$score > 0.5, print ("Passed"), print ("Failed"))
# outcome <- ifelse(score > 0.5, "Passed", "Failed")

The SWITCH statement

In [55]:
#switch choose statements based on the value of an expression. example:
feelings <- c("sad", "afraid")
for (i in feelings) {
    print(
    switch(i,
          happy = "I am glad you are happy",
          afraid = "There is nothing to fear",
          sad = "Cheer up",
          angry = "Calm down"
          )
    )
}
[1] "Cheer up"
[1] "There is nothing to fear"

User-written functions

In [57]:
#let's say you want a function that calculates the central tendency and spread of data objects.
#the function should give you a choice between parametric (mean and standard deviation) and
#non-parametric (median and median absolute deviation) statistics.
#the results should be returned as a named list.
#additionally, the user should have a choice of printing the results or not.
#unless otherwise specified, the function's default behavior should be to calculate parametric statistics
#and not print the results.  one way of achieving this is provided below.

mystats() : a user-written function for summary statistics

In [59]:
mystats <- function(x, parametric = T, print = F) { #set parametric as default; set no printing as default
    if (parametric) {
        center <- mean(x)
        spread <- sd(x)
    } else {
        center <- median(x)
        spread <- mad(x)
    }
    if (print & parametric) {
        cat ("Mean = ", center, "\n", "SD = ", spread, "\n")
        } else if (print & !parametric) {
        cat ("Median =", center, "\n", "MAD = ", spread, "\n")
        }
    result <- list(center = center, spread = spread)
    return(result)
}
In [61]:
#to see this function in action, first generate some data (a random sample of size 500 from a normal distribution)
set.seed(1234)
x <- rnorm(500)
In [62]:
y <- mystats(x) #parametric stats are calculated but not printed
In [63]:
y <- mystats(x, parametric = F, print = T) #non-parametric stats are calculated and printed
Median = -0.021 
 MAD =  1 
In [64]:
#another example of a user-written function that uses the Switch statement
#this function gives the user a choice regarding the format of today's date
#values that are assigned to parameters in the function declaration are taken as defaults
#long is the default format for dates in this function if type isn't specified
mydate <- function(type = "long") {
    switch(type,
          long = format(Sys.time(), "%A %B %d %Y"),
          short = format(Sys.time(), "%m-%d-%y"),
          cat(type, "is not a recognized type\n")
          )
}
In [65]:
#here is the function in action:
mydate("long")
'Friday September 01 2017'
In [66]:
mydate("short")
'09-01-17'
In [68]:
mydate() #default type is long
'Friday September 01 2017'
In [70]:
mydate("medium") #should return error message
medium is not a recognized type

Aggregation and reshaping

Transposing a matrix

In [71]:
#transposing a matrix
cars <- mtcars[1:5, 1:4] #store a subset of the mtcars dataset into the cars object
cars #display the cars object
mpgcyldisphp
Mazda RX421 6 160110
Mazda RX4 Wag21 6 160110
Datsun 71023 4 108 93
Hornet 4 Drive21 6 258110
Hornet Sportabout19 8 360175
In [72]:
#transpose the cars object:
t(cars)
Mazda RX4Mazda RX4 WagDatsun 710Hornet 4 DriveHornet Sportabout
mpg 21 21 23 21 19
cyl 6 6 4 6 8
disp160160108258360
hp110110 93110175

Aggregating data

In [73]:
#aggregating data using the aggregate() function
options(digits = 3) #specify the maximum number of digits for easy readability
attach(mtcars) #attach mtcars dataset
aggdata <- aggregate(mtcars, by = list(cyl, gear), FUN = mean, na.rm = T) #aggregate by means of number of cylinder and gears
aggdata #print the results
detach(mtcars) #detach the dataset
Group.1Group.2mpgcyldisphpdratwtqsecvsamgearcarb
4 3 21.54 120 97 3.702.4620.01.0 0.003 1.00
6 3 19.86 242 108 2.923.3419.81.0 0.003 1.00
8 3 15.18 358 194 3.124.1017.10.0 0.003 3.08
4 4 26.94 103 76 4.112.3819.61.0 0.754 1.50
6 4 19.86 164 116 3.913.0917.70.5 0.504 4.00
4 5 28.24 108 102 4.101.8316.80.5 1.005 2.00
6 5 19.76 145 175 3.622.7715.50.0 1.005 6.00
8 5 15.48 326 300 3.883.3714.60.0 1.005 6.00

The reshape2 package

In [74]:
#reshape2 package is a tremendously versatile approach to both restructuring and aggregating datasets
#install the reshape2 package
install.packages("reshape2")
Installing package into ‘/gpfs/global_fs01/sym_shared/YPProdSpark/user/s17c-9f3318fc11f06c-d37a4b9405b6/R/libs’
(as ‘lib’ is unspecified)
In [75]:
#load the reshape2 library
library(reshape2)
In [76]:
#create a dataset
id <- c(1, 1, 2, 2)
time <- c(1, 2, 1, 2)
x1 <- c(5, 3, 6, 2)
x2 <- c(6, 5, 1, 4)
mydata <- data.frame("ID" = id, "Time" = time, "X1" = x1, "X2" = x2)
mydata
IDTimeX1X2
1156
1235
2161
2224
In [78]:
#"melt"ing the dataset using the "melt" function
md <- melt(mydata, id = c("ID", "Time"))
md
IDTimevariablevalue
1 1 X15
1 2 X13
2 1 X16
2 2 X12
1 1 X26
1 2 X25
2 1 X21
2 2 X24
In [80]:
#"cast"ing the dataset using the dcast() function
newdata <- dcast(md, ID + Time ~ variable)
newdata
IDTimeX1X2
1156
1235
2161
2224
In [81]:
#another result using different dcast() parameters
newdata <- dcast(md, ID + variable ~ Time)
newdata
IDvariable12
1 X15 3
1 X26 5
2 X16 2
2 X21 4
In [82]:
#another result using different dcast() parameters
newdata <- dcast(md, ID ~ variable + Time)
newdata
IDX1_1X1_2X2_1X2_2
15365
26214