In [90]:
#@author: Venky Rao raove@us.ibm.com
#@last edited: 22 Aug 2017
#@source: materials, data and examples adapted from R in Action 2nd Edition by Dr. Robert Kabacoff


# Data preparation in R¶

## Creating a Data Frame¶

In [4]:
#let us begin by creating a data frame of data for us to analyze
manager <- c(1, 2, 3, 4, 5)
date <- c("10/24/08", "10/28/08", "10/1/08", "10/12/08", "5/1/09")
country <- c("US", "US", "UK", "UK", "UK")
gender <- c("M", "F", "F", "M", "F")
age <- c(32, 45, 25, 39, 99)
q1 <- c(5, 3, 3, 3, 2)
q2 <- c(4, 5, 5, 3, 2)
q3 <- c(5, 2, 5, 4, 1)
q4 <- c(5, 5, 5, NA, 2)
q5 <- c(5, 5, 2, NA, 1)
leadership <- data.frame(manager, date, country, gender, age, q1, q2, q3, q4, q5, stringsAsFactors = F)
#if stringsAsFactors = T, then dates would be treated as factors

In [5]:
#display the data frame that we created

managerdatecountrygenderageq1q2q3q4q5
1 10/24/08US M 32 5 4 5 5 5
2 10/28/08US F 45 3 5 2 5 5
3 10/1/08 UK F 25 3 5 5 5 2
4 10/12/08UK M 39 3 3 4 NA NA
5 5/1/09 UK F 99 2 2 1 2 1

## Creating new variables¶

In [6]:
#data preparation often involves creating new variables in a data frame
#let's create a new data frame to explore this further
mydata <- data.frame(x1 = c(2, 2, 6, 4), x2 = c(3, 4, 2, 8))
mydata

x1x2
23
24
62
48
In [7]:
#let's say we want to create two new variables in the data frame:
#sumx which is the sum of x1 and x2; and
#meanx which is the mean of x1 and x2
#here are 3 ways of doing this:

In [8]:
#option 1
mydata$sumx <- mydata$x1 + mydata$x2 mydata$meanx <- (mydata$x1 + mydata$x2) / 2
mydata

x1x2sumxmeanx
2 3 5 2.5
2 4 6 3.0
6 2 8 4.0
4 8 12 6.0
In [9]:
#option 2
attach(mydata)
mydata$sumx <- x1 + x2 mydata$meanx <- (x1 + x2) / 2
detach(mydata)
mydata

x1x2sumxmeanx
2 3 5 2.5
2 4 6 3.0
6 2 8 4.0
4 8 12 6.0
In [10]:
#option 3
mydata <- transform(mydata, sumx = x1 + x2, meanx = (x1 + x2) / 2)
mydata

x1x2sumxmeanx
2 3 5 2.5
2 4 6 3.0
6 2 8 4.0
4 8 12 6.0

## Recoding variables¶

In [11]:
#recoding involves creating new values of a variable conditional on the existing values of
#the same and / or other variables.  for example: you may want to
#1. change a continuous variable into a set of categories
#2. replace miscoded values with correct values
#3. create a pass / fail variable based on a set of cutoff scores
#to recode data, you use one of R's logical operators (i.e. expressions that return TRUE or FALSE)

In [14]:
#example: recode ages of managers in the "leadership" data frame from continuous to categorical
#step 1: recode age 99 to indicate missing value since 99 is obviously incorrect
leadership$age[leadership$age == 99] <- NA

managerdatecountrygenderageq1q2q3q4q5
1 10/24/08US M 32 5 4 5 5 5
2 10/28/08US F 45 3 5 2 5 5
3 10/1/08 UK F 25 3 5 5 5 2
4 10/12/08UK M 39 3 3 4 NA NA
5 5/1/09 UK F NA 2 2 1 2 1
In [15]:
#step 2: create the categories
agecat <- NA
agecat[age > 75] <- "Elder"
agecat[age >= 55 & age <= 75] <- "Middle Aged"
agecat[age < 55] <- "Young"
})

managerdatecountrygenderageq1q2q3q4q5agecat
1 10/24/08US M 32 5 4 5 5 5 Young
2 10/28/08US F 45 3 5 2 5 5 Young
3 10/1/08 UK F 25 3 5 5 5 2 Young
4 10/12/08UK M 39 3 3 4 NA NA Young
5 5/1/09 UK F NA 2 2 1 2 1 NA

## Renaming variables¶

In [17]:
#here's one way:

managertestDatecountrygenderageq1q2q3q4q5agecat
1 10/24/08US M 32 5 4 5 5 5 Young
2 10/28/08US F 45 3 5 2 5 5 Young
3 10/1/08 UK F 25 3 5 5 5 2 Young
4 10/12/08UK M 39 3 3 4 NA NA Young
5 5/1/09 UK F NA 2 2 1 2 1 NA
In [18]:
#another example:
names(leadership)[6:10] <- c("item1", "item2", "item3", "item4", "item5")

managertestDatecountrygenderageitem1item2item3item4item5agecat
1 10/24/08US M 32 5 4 5 5 5 Young
2 10/28/08US F 45 3 5 2 5 5 Young
3 10/1/08 UK F 25 3 5 5 5 2 Young
4 10/12/08UK M 39 3 3 4 NA NA Young
5 5/1/09 UK F NA 2 2 1 2 1 NA
In [21]:
#another way:
#using the rename function in the "plyr" package
install.packages("plyr")

Installing package into ‘/gpfs/global_fs01/sym_shared/YPProdSpark/user/s17c-9f3318fc11f06c-d37a4b9405b6/R/libs’
(as ‘lib’ is unspecified)

In [22]:
#load libraries
library(plyr)

Attaching package: ‘plyr’

The following objects are masked from ‘package:SparkR’:

arrange, count, desc, join, mutate, rename, summarize, take


In [23]:
#use the rename() function

managerIDtestDatecountrygenderageitem1item2item3item4item5agecat
1 10/24/08US M 32 5 4 5 5 5 Young
2 10/28/08US F 45 3 5 2 5 5 Young
3 10/1/08 UK F 25 3 5 5 5 2 Young
4 10/12/08UK M 39 3 3 4 NA NA Young
5 5/1/09 UK F NA 2 2 1 2 1 NA

## Missing values¶

In [24]:
#identifying missing values using the is.na() function
#example:
y <- c(1, 2, 3, NA)
is.na(y)

1. FALSE
2. FALSE
3. FALSE
4. TRUE
In [25]:
#another example

managerIDtestDatecountrygenderageitem1item2item3item4item5agecat
FALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSE
FALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSE
FALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSE
FALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSE TRUE TRUEFALSE
FALSEFALSEFALSEFALSE TRUEFALSEFALSEFALSEFALSEFALSE TRUE

### Excluding missing values from analyses¶

In [29]:
#sometimes you need to eliminate missing vaues from your analysis.  for example:
x <- c(1, 2, NA, 3)
y <- x[1] + x[2] + x[3] + x[4] #will result in NA
y

[1] NA
In [30]:
z <- sum(x) #will result in NA
z

[1] NA
In [31]:
#one way to fix this is to use the na.rm = T option, as follows:
z <- sum(x, na.rm = T)
z

6
In [32]:
#another way to remove missing data is by using the na.omit() function

managerIDtestDatecountrygenderageitem1item2item3item4item5agecat
1 10/24/08US M 32 5 4 5 5 5 Young
2 10/28/08US F 45 3 5 2 5 5 Young
3 10/1/08 UK F 25 3 5 5 5 2 Young
4 10/12/08UK M 39 3 3 4 NA NA Young
5 5/1/09 UK F NA 2 2 1 2 1 NA
In [33]:
newdata <- na.omit(leadership)
newdata

managerIDtestDatecountrygenderageitem1item2item3item4item5agecat
1 10/24/08US M 32 5 4 5 5 5 Young
2 10/28/08US F 45 3 5 2 5 5 Young
3 10/1/08 UK F 25 3 5 5 5 2 Young

## Date values¶

In [38]:
#convert "testDate" into a date type by using the as.Date() function
#default format in R is yyyy-mm-dd
myformat <- "%m/%d/%y"
leadership$testDate <- as.Date(leadership$testDate, myformat)

managerIDtestDatecountrygenderageitem1item2item3item4item5agecat
1 2008-10-24US M 32 5 4 5 5 5 Young
2 2008-10-28US F 45 3 5 2 5 5 Young
3 2008-10-01UK F 25 3 5 5 5 2 Young
4 2008-10-12UK M 39 3 3 4 NA NA Young
5 2009-05-01UK F NA 2 2 1 2 1 NA
In [39]:
#when R stores dates internally, they're represented as number of days since 1 Jan 1970 with negative values for earlier dates
#so you can perform arithmetic operations on them
startDate <- as.Date("2004-02-13")
endDate <- as.Date("2011-01-22")
days <- endDate - startDate
days

Time difference of 2535 days
In [40]:
#you can use the difftime() function to calculate a time interval
#for example if you want to calculate how old someone born on 29 May 2015 is, you could do it as follows:
today <- Sys.Date()
dob <- as.Date("2015-05-29")
difftime(today, dob, units = "weeks")

Time difference of 116.4286 weeks
In [41]:
#you can convert date variables into character variables using the as.character() function
#this conversion allows you to apply a range of character functions to the data values
#(subsetting, replacement, concatenation, etc)


## Type conversions¶

In [42]:
#example of converting one data type into another
a <- c(1, 2, 3)
is.numeric(a)

TRUE
In [43]:
is.vector(a)

TRUE
In [47]:
b <- as.character(a)
b

1. '1'
2. '2'
3. '3'
In [48]:
is.numeric(b)

FALSE
In [49]:
is.vector(b)

TRUE
In [50]:
is.character(b)

TRUE

## Sorting data¶

In [51]:
#the order() function lets you sort a data frame in R.  The default sorting order is ascending
#prepend the sorting variable with a minus sign to indicate descending order
#example:
newdata <- leadership[order(leadership$age),] #creates a new dataset sorted from youngest manager to oldest manager newdata  managerIDtestDatecountrygenderageitem1item2item3item4item5agecat 33 2008-10-01UK F 25 3 5 5 5 2 Young 11 2008-10-24US M 32 5 4 5 5 5 Young 44 2008-10-12UK M 39 3 3 4 NA NA Young 22 2008-10-28US F 45 3 5 2 5 5 Young 55 2009-05-01UK F NA 2 2 1 2 1 NA In [52]: #another example attach(leadership) newdata <- leadership[order(gender, age),] #creates a new dataset sorting rows into female, followed by male #and youngest to oldest within each gender detach(leadership) newdata  The following objects are masked _by_ .GlobalEnv: age, country, gender  managerIDtestDatecountrygenderageitem1item2item3item4item5agecat 33 2008-10-01UK F 25 3 5 5 5 2 Young 22 2008-10-28US F 45 3 5 2 5 5 Young 55 2009-05-01UK F NA 2 2 1 2 1 NA 11 2008-10-24US M 32 5 4 5 5 5 Young 44 2008-10-12UK M 39 3 3 4 NA NA Young In [53]: #another example: attach(leadership) newdata <- leadership[order(gender, -age),] #creates a new dataset sorting rows into female, followed by male #and oldest to youngest within each gender detach(leadership) newdata  The following objects are masked _by_ .GlobalEnv: age, country, gender  managerIDtestDatecountrygenderageitem1item2item3item4item5agecat 55 2009-05-01UK F NA 2 2 1 2 1 NA 22 2008-10-28US F 45 3 5 2 5 5 Young 33 2008-10-01UK F 25 3 5 5 5 2 Young 44 2008-10-12UK M 39 3 3 4 NA NA Young 11 2008-10-24US M 32 5 4 5 5 5 Young ## Merging datasets¶ In [55]: #if your data exists in different locations, you will need to combine it  ### Adding columns to a data frame¶ In [54]: #adding columns to a data frame: #to merge two data frames horizontally, you use the merge() function #syntax: total <- merge(dataFrameA, dataFrameB, by = "ID") #by = the common key variable  ### Horizontal concatenation¶ In [56]: #another way to merge two matrices or data frames where you don't need to specify a common key, #is by using the cbind() function. syntax: #total <- cbind(a, b) #cbind() concatenates objects a and b. each object must have the same number of rows and be sorted in the same order  ### Adding rows to a data frame¶ In [57]: #use the rbind() function to add rows to a data frame. syntax: #total <- rbind(a, b) #a and b must have the same number of variables but they don't need to be in the same order #if a has more variables than b, then either 1) delete the extra variables or 2) create the same variable in b and assign them to NAs  ## Subsetting datasets¶ In [58]: #R has powerful indexing features to select and exclude variables, observations or both  ### Selecting (keeping) variables¶ In [60]: #selecting using [row indices, column indices]. example: newdata <- leadership[, c(6:10)] #selects all rows and columns 6 through 10 of the leadership data frame newdata  item1item2item3item4item5 5 4 5 5 5 3 5 2 5 5 3 5 5 5 2 3 3 4 NANA 2 2 1 2 1 In [62]: #another example myvars <- c("item1", "item2", "item3", "item4", "item5") newdata <- leadership[myvars] newdata  item1item2item3item4item5 5 4 5 5 5 3 5 2 5 5 3 5 5 5 2 3 3 4 NANA 2 2 1 2 1 ### Excluding (dropping) variables¶ In [63]: #there are many reasons to exclude variables. for example, if a variable has many missing values, #you may want to drop it from further analysis  In [64]: #here is an example: lets say you want to drop item4 and item5 from the leadership data frame. #here is one way you can do this: myvars <- names(leadership) %in% c("item4", "item5") #names(leadership) produces a character vector with names of all variables #names(leadership) %in% c("item4", "item5") returns a logical vector with TRUE for each element in #names(leadership) that matches either "item4" or "item5" newdata <- leadership[!myvars] #the ! operator reverses the logical values newdata  managerIDtestDatecountrygenderageitem1item2item3agecat 1 2008-10-24US M 32 5 4 5 Young 2 2008-10-28US F 45 3 5 2 Young 3 2008-10-01UK F 25 3 5 5 Young 4 2008-10-12UK M 39 3 3 4 Young 5 2009-05-01UK F NA 2 2 1 NA In [65]: #another way of dropping variables #since item4 is column 9 and item 5 is column 10, you can also do: newdata <- leadership[c(-9, -10)] newdata  managerIDtestDatecountrygenderageitem1item2item3agecat 1 2008-10-24US M 32 5 4 5 Young 2 2008-10-28US F 45 3 5 2 Young 3 2008-10-01UK F 25 3 5 5 Young 4 2008-10-12UK M 39 3 3 4 Young 5 2009-05-01UK F NA 2 2 1 NA ### Selecting observations¶ In [66]: #selecting or excluding observations is a key aspect of data preparation #here are some examples: newdata <- leadership[1:3,] #selects the first three rows of data newdata  managerIDtestDatecountrygenderageitem1item2item3item4item5agecat 1 2008-10-24US M 32 5 4 5 5 5 Young 2 2008-10-28US F 45 3 5 2 5 5 Young 3 2008-10-01UK F 25 3 5 5 5 2 Young In [67]: newdata <- leadership[leadership$gender == "M" & #selects all men aged > 30
leadership$age > 30,] newdata  managerIDtestDatecountrygenderageitem1item2item3item4item5agecat 11 2008-10-24US M 32 5 4 5 5 5 Young 44 2008-10-12UK M 39 3 3 4 NA NA Young In [68]: #another way of selecting men aged > 30 by attaching the leadership data frame attach(leadership) newdata <- leadership[gender == "M" & age > 30,] detach(leadership) newdata  The following objects are masked _by_ .GlobalEnv: age, country, gender  managerIDtestDatecountrygenderageitem1item2item3item4item5agecat 11 2008-10-24US M 32 5 4 5 5 5 Young 44 2008-10-12UK M 39 3 3 4 NA NA Young In [69]: #limit analysis to observations collected between 01/01/2009 and 12/31/2009 newdata <- leadership[leadership$testDate >= "2009-01-01" & leadership\$testDate <= "2009-12-31",]
newdata

managerIDtestDatecountrygenderageitem1item2item3item4item5agecat
55 2009-05-01UK F NA 2 2 1 2 1 NA

### The subset() function¶

In [71]:
#the subset() function is a really powerful function to select or exclude variables and observations
#examples:
#this statement selects all observations where the age is either 35 or over OR under 24 and returns 4 columns (items 1 through 4)
newdata <- subset(leadership, age >= 35 | age < 24, select = c(item1, item2, item3, item4))
newdata

item1item2item3item4
23 5 2 5
43 3 4 NA
In [73]:
#another example
#this statement selects all men > 25 years old and all columns from gender to item 5 (and everything in between)
newdata <- subset(leadership, gender == "M" & age > 25, select = gender:item5)
newdata

genderageitem1item2item3item4item5
1M 325 4 5 5 5
4M 393 3 4 NANA

### Random samples¶

In [81]:
#sampling from larger datasets is a common practice in data mining and machine learning
#for example, you may want to create a predictive model from one sample and validate its effectiveness on the other
#the sample() function lets you take a random sample (with or without replacement) from size n of a dataset
#R has extensive facilities for sampling (see the "sampling" package)


## Using SQL statements to manipulate data frames¶

In [89]:
#using the "sqldf" package, you can use SQL statements within R
#experienced SQl users will find the sqldf package a useful addition to the R data preparation toolkit