In [90]:
#@author: Venky Rao raove@us.ibm.com
#@last edited: 22 Aug 2017
#@source: materials, data and examples adapted from R in Action 2nd Edition by Dr. Robert Kabacoff

Data preparation in R

Creating a Data Frame

In [4]:
#let us begin by creating a data frame of data for us to analyze
manager <- c(1, 2, 3, 4, 5)
date <- c("10/24/08", "10/28/08", "10/1/08", "10/12/08", "5/1/09")
country <- c("US", "US", "UK", "UK", "UK")
gender <- c("M", "F", "F", "M", "F")
age <- c(32, 45, 25, 39, 99)
q1 <- c(5, 3, 3, 3, 2)
q2 <- c(4, 5, 5, 3, 2)
q3 <- c(5, 2, 5, 4, 1)
q4 <- c(5, 5, 5, NA, 2)
q5 <- c(5, 5, 2, NA, 1)
leadership <- data.frame(manager, date, country, gender, age, q1, q2, q3, q4, q5, stringsAsFactors = F)
#if stringsAsFactors = T, then dates would be treated as factors
In [5]:
#display the data frame that we created
leadership
managerdatecountrygenderageq1q2q3q4q5
1 10/24/08US M 32 5 4 5 5 5
2 10/28/08US F 45 3 5 2 5 5
3 10/1/08 UK F 25 3 5 5 5 2
4 10/12/08UK M 39 3 3 4 NA NA
5 5/1/09 UK F 99 2 2 1 2 1

Creating new variables

In [6]:
#data preparation often involves creating new variables in a data frame
#let's create a new data frame to explore this further
mydata <- data.frame(x1 = c(2, 2, 6, 4), x2 = c(3, 4, 2, 8))
mydata
x1x2
23
24
62
48
In [7]:
#let's say we want to create two new variables in the data frame:
#sumx which is the sum of x1 and x2; and
#meanx which is the mean of x1 and x2
#here are 3 ways of doing this:
In [8]:
#option 1
mydata$sumx <- mydata$x1 + mydata$x2
mydata$meanx <- (mydata$x1 + mydata$x2) / 2
mydata
x1x2sumxmeanx
2 3 5 2.5
2 4 6 3.0
6 2 8 4.0
4 8 12 6.0
In [9]:
#option 2
attach(mydata)
mydata$sumx <- x1 + x2
mydata$meanx <- (x1 + x2) / 2
detach(mydata)
mydata
x1x2sumxmeanx
2 3 5 2.5
2 4 6 3.0
6 2 8 4.0
4 8 12 6.0
In [10]:
#option 3
mydata <- transform(mydata, sumx = x1 + x2, meanx = (x1 + x2) / 2)
mydata
x1x2sumxmeanx
2 3 5 2.5
2 4 6 3.0
6 2 8 4.0
4 8 12 6.0

Recoding variables

In [11]:
#recoding involves creating new values of a variable conditional on the existing values of
#the same and / or other variables.  for example: you may want to
#1. change a continuous variable into a set of categories
#2. replace miscoded values with correct values
#3. create a pass / fail variable based on a set of cutoff scores
#to recode data, you use one of R's logical operators (i.e. expressions that return TRUE or FALSE)
In [14]:
#example: recode ages of managers in the "leadership" data frame from continuous to categorical
#step 1: recode age 99 to indicate missing value since 99 is obviously incorrect
leadership$age[leadership$age == 99] <- NA
leadership
managerdatecountrygenderageq1q2q3q4q5
1 10/24/08US M 32 5 4 5 5 5
2 10/28/08US F 45 3 5 2 5 5
3 10/1/08 UK F 25 3 5 5 5 2
4 10/12/08UK M 39 3 3 4 NA NA
5 5/1/09 UK F NA 2 2 1 2 1
In [15]:
#step 2: create the categories
leadership <- within(leadership, {
    agecat <- NA
    agecat[age > 75] <- "Elder"
    agecat[age >= 55 & age <= 75] <- "Middle Aged"
    agecat[age < 55] <- "Young"
})
leadership
managerdatecountrygenderageq1q2q3q4q5agecat
1 10/24/08US M 32 5 4 5 5 5 Young
2 10/28/08US F 45 3 5 2 5 5 Young
3 10/1/08 UK F 25 3 5 5 5 2 Young
4 10/12/08UK M 39 3 3 4 NA NA Young
5 5/1/09 UK F NA 2 2 1 2 1 NA

Renaming variables

In [17]:
#here's one way:
names(leadership)[2] <- "testDate"
leadership
managertestDatecountrygenderageq1q2q3q4q5agecat
1 10/24/08US M 32 5 4 5 5 5 Young
2 10/28/08US F 45 3 5 2 5 5 Young
3 10/1/08 UK F 25 3 5 5 5 2 Young
4 10/12/08UK M 39 3 3 4 NA NA Young
5 5/1/09 UK F NA 2 2 1 2 1 NA
In [18]:
#another example:
names(leadership)[6:10] <- c("item1", "item2", "item3", "item4", "item5")
leadership
managertestDatecountrygenderageitem1item2item3item4item5agecat
1 10/24/08US M 32 5 4 5 5 5 Young
2 10/28/08US F 45 3 5 2 5 5 Young
3 10/1/08 UK F 25 3 5 5 5 2 Young
4 10/12/08UK M 39 3 3 4 NA NA Young
5 5/1/09 UK F NA 2 2 1 2 1 NA
In [21]:
#another way:
#using the rename function in the "plyr" package
install.packages("plyr")
Installing package into ‘/gpfs/global_fs01/sym_shared/YPProdSpark/user/s17c-9f3318fc11f06c-d37a4b9405b6/R/libs’
(as ‘lib’ is unspecified)
In [22]:
#load libraries
library(plyr)
Attaching package: ‘plyr’

The following objects are masked from ‘package:SparkR’:

    arrange, count, desc, join, mutate, rename, summarize, take

In [23]:
#use the rename() function
leadership <- rename(leadership, c(manager = "managerID"))
leadership
managerIDtestDatecountrygenderageitem1item2item3item4item5agecat
1 10/24/08US M 32 5 4 5 5 5 Young
2 10/28/08US F 45 3 5 2 5 5 Young
3 10/1/08 UK F 25 3 5 5 5 2 Young
4 10/12/08UK M 39 3 3 4 NA NA Young
5 5/1/09 UK F NA 2 2 1 2 1 NA

Missing values

In [24]:
#identifying missing values using the is.na() function
#example:
y <- c(1, 2, 3, NA)
is.na(y)
  1. FALSE
  2. FALSE
  3. FALSE
  4. TRUE
In [25]:
#another example
is.na(leadership)
managerIDtestDatecountrygenderageitem1item2item3item4item5agecat
FALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSE
FALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSE
FALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSE
FALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSE TRUE TRUEFALSE
FALSEFALSEFALSEFALSE TRUEFALSEFALSEFALSEFALSEFALSE TRUE

Excluding missing values from analyses

In [29]:
#sometimes you need to eliminate missing vaues from your analysis.  for example:
x <- c(1, 2, NA, 3)
y <- x[1] + x[2] + x[3] + x[4] #will result in NA
y
[1] NA
In [30]:
z <- sum(x) #will result in NA
z
[1] NA
In [31]:
#one way to fix this is to use the na.rm = T option, as follows:
z <- sum(x, na.rm = T)
z
6
In [32]:
#another way to remove missing data is by using the na.omit() function
leadership #without removing missing data
managerIDtestDatecountrygenderageitem1item2item3item4item5agecat
1 10/24/08US M 32 5 4 5 5 5 Young
2 10/28/08US F 45 3 5 2 5 5 Young
3 10/1/08 UK F 25 3 5 5 5 2 Young
4 10/12/08UK M 39 3 3 4 NA NA Young
5 5/1/09 UK F NA 2 2 1 2 1 NA
In [33]:
newdata <- na.omit(leadership)
newdata
managerIDtestDatecountrygenderageitem1item2item3item4item5agecat
1 10/24/08US M 32 5 4 5 5 5 Young
2 10/28/08US F 45 3 5 2 5 5 Young
3 10/1/08 UK F 25 3 5 5 5 2 Young

Date values

In [38]:
#convert "testDate" into a date type by using the as.Date() function
#default format in R is yyyy-mm-dd
myformat <- "%m/%d/%y"
leadership$testDate <- as.Date(leadership$testDate, myformat)
leadership
managerIDtestDatecountrygenderageitem1item2item3item4item5agecat
1 2008-10-24US M 32 5 4 5 5 5 Young
2 2008-10-28US F 45 3 5 2 5 5 Young
3 2008-10-01UK F 25 3 5 5 5 2 Young
4 2008-10-12UK M 39 3 3 4 NA NA Young
5 2009-05-01UK F NA 2 2 1 2 1 NA
In [39]:
#when R stores dates internally, they're represented as number of days since 1 Jan 1970 with negative values for earlier dates
#so you can perform arithmetic operations on them
startDate <- as.Date("2004-02-13")
endDate <- as.Date("2011-01-22")
days <- endDate - startDate
days
Time difference of 2535 days
In [40]:
#you can use the difftime() function to calculate a time interval
#for example if you want to calculate how old someone born on 29 May 2015 is, you could do it as follows:
today <- Sys.Date()
dob <- as.Date("2015-05-29")
difftime(today, dob, units = "weeks")
Time difference of 116.4286 weeks
In [41]:
#you can convert date variables into character variables using the as.character() function
#this conversion allows you to apply a range of character functions to the data values
#(subsetting, replacement, concatenation, etc)

Type conversions

In [42]:
#example of converting one data type into another
a <- c(1, 2, 3)
is.numeric(a)
TRUE
In [43]:
is.vector(a)
TRUE
In [47]:
b <- as.character(a)
b
  1. '1'
  2. '2'
  3. '3'
In [48]:
is.numeric(b)
FALSE
In [49]:
is.vector(b)
TRUE
In [50]:
is.character(b)
TRUE

Sorting data

In [51]:
#the order() function lets you sort a data frame in R.  The default sorting order is ascending
#prepend the sorting variable with a minus sign to indicate descending order
#example:
newdata <- leadership[order(leadership$age),] #creates a new dataset sorted from youngest manager to oldest manager
newdata
managerIDtestDatecountrygenderageitem1item2item3item4item5agecat
33 2008-10-01UK F 25 3 5 5 5 2 Young
11 2008-10-24US M 32 5 4 5 5 5 Young
44 2008-10-12UK M 39 3 3 4 NA NA Young
22 2008-10-28US F 45 3 5 2 5 5 Young
55 2009-05-01UK F NA 2 2 1 2 1 NA
In [52]:
#another example
attach(leadership)
newdata <- leadership[order(gender, age),] #creates a new dataset sorting rows into female, followed by male
                                           #and youngest to oldest within each gender
detach(leadership)
newdata
The following objects are masked _by_ .GlobalEnv:

    age, country, gender

managerIDtestDatecountrygenderageitem1item2item3item4item5agecat
33 2008-10-01UK F 25 3 5 5 5 2 Young
22 2008-10-28US F 45 3 5 2 5 5 Young
55 2009-05-01UK F NA 2 2 1 2 1 NA
11 2008-10-24US M 32 5 4 5 5 5 Young
44 2008-10-12UK M 39 3 3 4 NA NA Young
In [53]:
#another example:
attach(leadership)
newdata <- leadership[order(gender, -age),] #creates a new dataset sorting rows into female, followed by male
                                            #and oldest to youngest within each gender
detach(leadership)
newdata
The following objects are masked _by_ .GlobalEnv:

    age, country, gender

managerIDtestDatecountrygenderageitem1item2item3item4item5agecat
55 2009-05-01UK F NA 2 2 1 2 1 NA
22 2008-10-28US F 45 3 5 2 5 5 Young
33 2008-10-01UK F 25 3 5 5 5 2 Young
44 2008-10-12UK M 39 3 3 4 NA NA Young
11 2008-10-24US M 32 5 4 5 5 5 Young

Merging datasets

In [55]:
#if your data exists in different locations, you will need to combine it

Adding columns to a data frame

In [54]:
#adding columns to a data frame:
#to merge two data frames horizontally, you use the merge() function
#syntax: total <- merge(dataFrameA, dataFrameB, by = "ID") #by = the common key variable

Horizontal concatenation

In [56]:
#another way to merge two matrices or data frames where you don't need to specify a common key,
#is by using the cbind() function.  syntax:
#total <- cbind(a, b)
#cbind() concatenates objects a and b.  each object must have the same number of rows and be sorted in the same order

Adding rows to a data frame

In [57]:
#use the rbind() function to add rows to a data frame.  syntax:
#total <- rbind(a, b)
#a and b must have the same number of variables but they don't need to be in the same order
#if a has more variables than b, then either 1) delete the extra variables or 2) create the same variable in b and assign them to NAs

Subsetting datasets

In [58]:
#R has powerful indexing features to select and exclude variables, observations or both

Selecting (keeping) variables

In [60]:
#selecting using [row indices, column indices].  example:
newdata <- leadership[, c(6:10)] #selects all rows and columns 6 through 10 of the leadership data frame
newdata
item1item2item3item4item5
5 4 5 5 5
3 5 2 5 5
3 5 5 5 2
3 3 4 NANA
2 2 1 2 1
In [62]:
#another example
myvars <- c("item1", "item2", "item3", "item4", "item5")
newdata <- leadership[myvars]
newdata
item1item2item3item4item5
5 4 5 5 5
3 5 2 5 5
3 5 5 5 2
3 3 4 NANA
2 2 1 2 1

Excluding (dropping) variables

In [63]:
#there are many reasons to exclude variables.  for example, if a variable has many missing values,
#you may want to drop it from further analysis
In [64]:
#here is an example: lets say you want to drop item4 and item5 from the leadership data frame.
#here is one way you can do this:
myvars <- names(leadership) %in% c("item4", "item5") #names(leadership) produces a character vector with names of all variables
                           #names(leadership) %in% c("item4", "item5") returns a logical vector with TRUE for each element in
                           #names(leadership) that matches either "item4" or "item5"
newdata <- leadership[!myvars] #the ! operator reverses the logical values
newdata
managerIDtestDatecountrygenderageitem1item2item3agecat
1 2008-10-24US M 32 5 4 5 Young
2 2008-10-28US F 45 3 5 2 Young
3 2008-10-01UK F 25 3 5 5 Young
4 2008-10-12UK M 39 3 3 4 Young
5 2009-05-01UK F NA 2 2 1 NA
In [65]:
#another way of dropping variables
#since item4 is column 9 and item 5 is column 10, you can also do:
newdata <- leadership[c(-9, -10)]
newdata
managerIDtestDatecountrygenderageitem1item2item3agecat
1 2008-10-24US M 32 5 4 5 Young
2 2008-10-28US F 45 3 5 2 Young
3 2008-10-01UK F 25 3 5 5 Young
4 2008-10-12UK M 39 3 3 4 Young
5 2009-05-01UK F NA 2 2 1 NA

Selecting observations

In [66]:
#selecting or excluding observations is a key aspect of data preparation
#here are some examples:
newdata <- leadership[1:3,] #selects the first three rows of data
newdata
managerIDtestDatecountrygenderageitem1item2item3item4item5agecat
1 2008-10-24US M 32 5 4 5 5 5 Young
2 2008-10-28US F 45 3 5 2 5 5 Young
3 2008-10-01UK F 25 3 5 5 5 2 Young
In [67]:
newdata <- leadership[leadership$gender == "M" & #selects all men aged > 30
                     leadership$age > 30,]
newdata
managerIDtestDatecountrygenderageitem1item2item3item4item5agecat
11 2008-10-24US M 32 5 4 5 5 5 Young
44 2008-10-12UK M 39 3 3 4 NA NA Young
In [68]:
#another way of selecting men aged > 30 by attaching the leadership data frame
attach(leadership)
newdata <- leadership[gender == "M" & age > 30,]
detach(leadership)
newdata
The following objects are masked _by_ .GlobalEnv:

    age, country, gender

managerIDtestDatecountrygenderageitem1item2item3item4item5agecat
11 2008-10-24US M 32 5 4 5 5 5 Young
44 2008-10-12UK M 39 3 3 4 NA NA Young
In [69]:
#limit analysis to observations collected between 01/01/2009 and 12/31/2009
newdata <- leadership[leadership$testDate >= "2009-01-01" & leadership$testDate <= "2009-12-31",]
newdata
managerIDtestDatecountrygenderageitem1item2item3item4item5agecat
55 2009-05-01UK F NA 2 2 1 2 1 NA

The subset() function

In [71]:
#the subset() function is a really powerful function to select or exclude variables and observations
#examples:
#this statement selects all observations where the age is either 35 or over OR under 24 and returns 4 columns (items 1 through 4)
newdata <- subset(leadership, age >= 35 | age < 24, select = c(item1, item2, item3, item4))
newdata
item1item2item3item4
23 5 2 5
43 3 4 NA
In [73]:
#another example
#this statement selects all men > 25 years old and all columns from gender to item 5 (and everything in between)
newdata <- subset(leadership, gender == "M" & age > 25, select = gender:item5)
newdata
genderageitem1item2item3item4item5
1M 325 4 5 5 5
4M 393 3 4 NANA

Random samples

In [81]:
#sampling from larger datasets is a common practice in data mining and machine learning
#for example, you may want to create a predictive model from one sample and validate its effectiveness on the other
#the sample() function lets you take a random sample (with or without replacement) from size n of a dataset
#R has extensive facilities for sampling (see the "sampling" package)

Using SQL statements to manipulate data frames

In [89]:
#using the "sqldf" package, you can use SQL statements within R
#experienced SQl users will find the sqldf package a useful addition to the R data preparation toolkit