In [1]:

```
#@author: Venky Rao raove@us.ibm.com
#@last edited: 4 Sep 2017
#@source: materials, data and examples adapted from R in Action 2nd Edition by Dr. Robert Kabacoff
```

In [3]:

```
#a bar plot displays the distribution (frequency) of a catagorical variable through vertical or horizontal bars
# we will use the Arthritis data frame distributed with the vcd package
install.packages("vcd")
```

In [4]:

```
#load the vcd library
library(vcd)
```

In [5]:

```
counts <- table(Arthritis$Improved) #the variable Improved in the Arthritis dataset records
#patient outcomes for individuals receiving a placebo or drug
counts #display counts
```

In [6]:

```
#simple vertical bar plot
barplot(counts, #dataset = counts
main = "Simple Bar Plot", #main = title of the plot
xlab = "Improvement", #xlab = x-axis label
ylab = "Frequency") #ylab = y-axis label
```

In [7]:

```
#simple horizontal bar plot
barplot(counts, #dataset = counts
main = "Horizontal Bar Plot", #main = title of the plot
xlab = "Improvement", #xlab = x-axis label
ylab = "Frequency", #ylab = y-axis label
horiz = T) #option for horizontal plot
```

In [8]:

```
#consider the cross tabulation of treatment type and improvement status
counts <- table(Arthritis$Improved, Arthritis$Treatment)
counts
```

In [10]:

```
#stacked bar plot
barplot(counts, #dataset = counts
main = "Stacked Bar Plot", #main = title of the plot
xlab = "Treatment", #xlab = x-axis label
ylab = "Frequency", #ylab = y-axis label
col = c("red", "yellow", "green"), #colors of the stacked portions
legend = rownames(counts)) #rownames are improvement status
```

In [11]:

```
#grouped bar plot
barplot(counts, #dataset = counts
main = "Grouped Bar Plot", #main = title of the plot
xlab = "Treatment", #xlab = x-axis label
ylab = "Frequency", #ylab = y-axis label
col = c("red", "yellow", "green"), #colors of the stacked portions
legend = rownames(counts), #rownames are improvement status
beside = T) #each column of the matrix are juxtaposed rather than stacked
```

In [18]:

```
#bar plot for sorted mean values
states <- data.frame(state.region, state.x77) #create a data frame by adding the state region to the state.x77 dataset
states #display the data frame
```

In [19]:

```
means <- aggregate(states$Illiteracy, by = list(state.region), FUN = mean) #aggregate by mean illiteracy levels of states and group by regions
means #display means
```

In [20]:

```
means <- means[order(means$x),] #sort by means, smallest to largest
means #display the means object
```

In [21]:

```
barplot(means$x, #data for the plot is the column x of the means data frame
names.arg = means$Group.1) #names of the bars
title("Mean Illiteracy Rate") #adds a title to the bar plot
```

In [4]:

```
#fitting labels in a bar plot
par(mar = c(5, 8, 4, 2)) #mar is a numerical vector indicating margin size,
#where c(bottom, left, top, right) is expressed in lines.
#the default is c(5, 4, 4, 2) + 0.1
par(las = 2) #las â€“ A numeric value indicating the orientation of the tick mark labels
#and any other text added to a plot after its initialization. The options are
#as follows: always parallel to the axis (the default, 0), always horizontal (1),
#always perpendicular to the axis (2), and always vertical (3)
library(vcd) #load the vcd package library for the Arthritis dataset
counts <- table(Arthritis$Improved) #store the table results in the counts object
barplot(counts, #counts = data
main = "Treatment Outcome", #title of the graph
horiz = T, #horizontal orientation of the bars
cex.names = 0.8, #size of the plotted text; 0.8x of normal size
names.arg = c("No Improvement", "Some Improvement",
"Marked Improvement")) #names of the plotted text
```

In [2]:

```
#a spinogram is a specialized type of a box plot. In a spinogram, a stacked bar plot
#is rescaled so that each bar is is 1 and the segment heights represent proportions.
#spinograms are created using the spine() function of the vcd package, as follows:
library(vcd) #load the vcd package
attach(Arthritis) #attach the arthritis dataset
counts <- table(Treatment, Improved) #create a contingency table and store it in the counts objects
spine(counts, main = "Spinogram Example") #create the spinogram
detach(Arthritis) #detach the Arthritis dataset
```

In [4]:

```
#pie charts are created by the function pie(x, labels) where:
#x = non negative numeric vector indicating the area of each slice; and
#labels = character vector of slice labels. here is an example:
par(mfrow = c(2,2)) #creates a matrix of 2 rows x 2 cols filled in by row
slices <- c(10, 12, 4, 16, 8) #vector of areas
lbls <- c("US", "UK", "Australia", "Germany", "France") #vector of labels
pie(slices, labels = lbls, main = "Simple Pie Chart") #create pie chart
```

In [6]:

```
#adds percentages to the pie chart:
par(mfrow = c(2,2)) #creates a matrix of 2 rows x 2 cols filled in by row
#chart 1
slices <- c(10, 12, 4, 16, 8) #vector of areas
lbls <- c("US", "UK", "Australia", "Germany", "France") #vector of labels
pie(slices, labels = lbls, main = "Simple Pie Chart") #create pie chart
#chart 2
pct <- round(slices/sum(slices) * 100) #calculate percentages
lbls2 <- paste(lbls, " ", pct, "%", spe = "") #add percentages to the labels
pie(slices, labels = lbls2, col = rainbow(length(lbls2)), main = "Pie Chart with Percentages") #create pie chart
```

In [8]:

```
#install the plotrix package for creating 3D pie charts
install.packages("plotrix")
```

In [9]:

```
#add a 3D pie chart:
par(mfrow = c(2,2)) #creates a matrix of 2 rows x 2 cols filled in by row
#chart 1
slices <- c(10, 12, 4, 16, 8) #vector of areas
lbls <- c("US", "UK", "Australia", "Germany", "France") #vector of labels
pie(slices, labels = lbls, main = "Simple Pie Chart") #create pie chart
#chart 2
pct <- round(slices/sum(slices) * 100) #calculate percentages
lbls2 <- paste(lbls, " ", pct, "%", spe = "") #add percentages to the labels
pie(slices, labels = lbls2, col = rainbow(length(lbls2)), main = "Pie Chart with Percentages") #create pie chart
#chart 3
library(plotrix) #load the plotrix package
pie3D(slices, labels = lbls, explode = 0.1, main = "3D Pie Chart")
```

In [10]:

```
#add a pie chart created from a table:
par(mfrow = c(2,2)) #creates a matrix of 2 rows x 2 cols filled in by row
#chart 1
slices <- c(10, 12, 4, 16, 8) #vector of areas
lbls <- c("US", "UK", "Australia", "Germany", "France") #vector of labels
pie(slices, labels = lbls, main = "Simple Pie Chart") #create pie chart
#chart 2
pct <- round(slices/sum(slices) * 100) #calculate percentages
lbls2 <- paste(lbls, " ", pct, "%", spe = "") #add percentages to the labels
pie(slices, labels = lbls2, col = rainbow(length(lbls2)), main = "Pie Chart with Percentages") #create pie chart
#chart 3
library(plotrix) #load the plotrix package
pie3D(slices, labels = lbls, explode = 0.1, main = "3D Pie Chart")
#chart 4
mytable <- table(state.region) #creates a table from the region column of the state dataset
lbls3 <- paste(names(mytable), "\n", mytable, sep = "") #creates the labels vector
pie(mytable, labels = lbls3, main = "Pie Chart from a Table\n (with sample sizes)") #create pie chart
```

In [11]:

```
#fan plots are created using the fan.plot() function of the plotrix package
library(plotrix) #load the plotrix package
slices <- c(10, 12, 4, 16, 8) #vector of sizes
lbls <- c("US", "UK", "Australia", "Germany", "France") #vector of labels
fan.plot(slices, labels = lbls, main = "Fan Plot")
```

In [12]:

```
#histograms display the distribution of a continuous variable (unlike bar plots and pie charts which display categorical variables)
#by dividing the range of scores into a specified number of bins on the x-axis and displaying the frequency of scores in each bin
#on the y-axis. You create histograms with the function hist(x), where x is a numeric vector of values
#the option "freq = FALSE" crates a plot based on probability densities rather than frequencies
#the "breaks" option controls the number of bins. The default produces equally spaced breaks when defining the cells of the histogram
```

In [22]:

```
#here is an example of four variations of a histogram (chart 1 only):
par(mfrow = c(2, 2)) #creates a matrix of 2 rows x 2 cols filled in by row
#chart 1
hist(mtcars$mpg) #simple histogram
```

In [23]:

```
#here is an example of four variations of a histogram (charts 1 and 2):
par(mfrow = c(2, 2)) #creates a matrix of 2 rows x 2 cols filled in by row
#chart 1
hist(mtcars$mpg) #simple histogram
#chart 2
hist(mtcars$mpg, #data
breaks = 12, #number of breaks specified
col = "red", #color specified
xlab = "Mile per Gallon", #x-axis label
main = "Colored histogram with 12 bins") #title of the graph
```

In [25]:

```
#here is an example of four variations of a histogram (charts 1, 2 and 3):
par(mfrow = c(2, 2)) #creates a matrix of 2 rows x 2 cols filled in by row
#chart 1
hist(mtcars$mpg) #simple histogram
#chart 2
hist(mtcars$mpg, #data
breaks = 12, #number of breaks specified
col = "red", #color specified
xlab = "Mile per Gallon", #x-axis label
main = "Colored histogram with 12 bins") #title of the graph
#chart 3
hist(mtcars$mpg, #data
freq = F, #"freq = FALSE" crates a plot based on probability densities rather than frequencies
breaks = 12, #number of breaks specified
col = "red", #color specified
xlab = "Miles Per Gallon", #x-axis label
main = "Histogram, rug plot, density curve") #title of the graph
rug(jitter(mtcars$mpg)) #rug() creates a set of tick marks along the base of a plot
#using the jitter() function which couterintuitively adding random noise
#to a plot can sometimes make it easier to read.
#Jittering is particularly useful for small datasets with at least one discrete position.
lines(density(mtcars$mpg), col = "blue", lwd = 2) #adds a density curve to an existing plot
```

In [27]:

```
#here is an example of four variations of a histogram (charts 1, 2, 3 and 4):
par(mfrow = c(2, 2)) #creates a matrix of 2 rows x 2 cols filled in by row
#chart 1
hist(mtcars$mpg) #simple histogram
#chart 2
hist(mtcars$mpg, #data
breaks = 12, #number of breaks specified
col = "red", #color specified
xlab = "Mile per Gallon", #x-axis label
main = "Colored histogram with 12 bins") #title of the graph
#chart 3
hist(mtcars$mpg, #data
freq = F, #"freq = FALSE" crates a plot based on probability densities rather than frequencies
breaks = 12, #number of breaks specified
col = "red", #color specified
xlab = "Miles Per Gallon", #x-axis label
main = "Histogram, rug plot, density curve") #title of the graph
rug(jitter(mtcars$mpg)) #rug() creates a set of tick marks along the base of a plot
#using the jitter() function which couterintuitively adding random noise
#to a plot can sometimes make it easier to read.
#Jittering is particularly useful for small datasets with at least one discrete position.
lines(density(mtcars$mpg), col = "blue", lwd = 2) #adds a kernel density curve to an existing plot
#chart 4
x <- mtcars$mpg
h <- hist(x, #data
breaks = 12, #number of breaks specified
col = "red", #color specified
xlab = "Mile per Gallon", #x-axis label
main = "Histogram with normal curve and box") #title of the graph
#code for superimposing a normal curve (credit to Peter Dalgaard)
xfit <- seq(min(x), max(x), length = 40)
yfit <- dnorm(xfit, mean = mean(x), sd = sd(x))
yfit <- yfit * diff(h$mids[1:2]) * length(x)
lines(xfit, yfit, col = "blue", lwd = 2)
#adds a box() around the graph
box()
```

In [28]:

```
#kernel density estimation is a nonparametric method for estimating the probablity distribution function
#of a random variable. the function is: plot(density(x)) where x = numeric vector.
#here is the code for creating 2 examples of kernel density plots:
par(mfrow = c(2, 1)) #creates a matrix of 2 rows x 1 col filled in by row
#chart 1
d <- density(mtcars$mpg) #stores the density in object d
plot(d) #creates a minimal graph with all defaults in place
#chart 2
d <- density(mtcars$mpg) #stores the density in object d
plot(d, main = "Kernel Density of Miles Per Gallon") #creates a minimal graph with a title
polygon(d, col = "red", border = "blue") #colors the curve blue and fills the area under the curve with red
rug(mtcars$mpg, col = "brown") #adds a brown rug, i.e. creates a set of tick marks along the base of a plot
```

In [29]:

```
#kernel density plots can be used to compare groups. to do this, you need the "sm" package
install.packages("sm")
```

In [35]:

```
#comparative kernel density plots
library(sm) #load the sm package
attach(mtcars) #attach the mtcars dataset
#creates a grouping factor
cyl.f <- factor(cyl, levels = c(4, 6, 8), #data = cyl from the mtcars dataset; levels = vector of 3 factors
labels = c("4 cylinder", "6 cylinder", "8 cylinder")) #labels = factor labels
#creates the comparative kd plots
sm.density.compare(mpg, cyl, xlab = "Miles per Gallon")
#adds a title to the chart
title(main = "MPG Distribution by Car Cylinders")
#adds a legend
colfill <- c(2:(1 + length(levels(cyl.f))))
legend("topright", inset = 0.05, levels(cyl.f), fill = colfill)
#detach the mtcars dataset
detach(mtcars)
```

In [37]:

```
#a box-and-whiskers plot describes the distribution of a continuous variable by plotting
#its five-number summary: min, lower quartile (25th percentile), median (50th percentile),
#upper quartile (75th percentile) and the max.
#it can also display outliers i.e. values that lie outside 1.5 * IQR or inter-quartile range
#IQR = upper quartile - lower quartile. here is an example:
#create the box plot:
boxplot(mtcars$mpg, main = "Box Plot", ylab = "Miles Per Gallon")
#print relevant stats
boxplot.stats(mtcars$mpg)
```

In [38]:

```
#boxplots can be created for individual variables or for variables by group
#the format is boxplot(formula, data = data frame)
#where formula is a formula and data denotes a data frame (or list) providing the data
#example of a formula: y ~ A, where a separate box plot for numeric variable y is generated
#for each value of categorical variable A.
#the formula y ~ A*B would produce a boxplot of numeric variable y for each combinaiton of levels
#in categorical variables A and B
#adding option "varwidth = TRUE" makes the box plot widths proportional to the square root of their sample sizes
#add "horizontal = TRUE" to reverse axis orientation
```

In [39]:

```
#the following code revisits the impact of four, six and eight cylinders on auto mpg with parallel box plots
boxplot(mpg ~ cyl, data = mtcars, #formula, data
main = "Car Mileage Data", #title
xlab = "Number of cylinders", ylab = "Miles Per Gallon") #x- and y-axis labels
```