In [1]:
#@author: Venky Rao raove@us.ibm.com
#@last edited: 4 Sep 2017
#@source: materials, data and examples adapted from R in Action 2nd Edition by Dr. Robert Kabacoff

Basic graphs in R

Bar plots

In [3]:
#a bar plot displays the distribution (frequency) of a catagorical variable through vertical or horizontal bars
# we will use the Arthritis data frame distributed with the vcd package
install.packages("vcd")
Installing package into ‘/gpfs/global_fs01/sym_shared/YPProdSpark/user/s17c-9f3318fc11f06c-d37a4b9405b6/R/libs’
(as ‘lib’ is unspecified)
In [4]:
#load the vcd library
library(vcd)
Loading required package: grid

Attaching package: ‘grid’

The following object is masked from ‘package:SparkR’:

    explode

Simple bar plots

In [5]:
counts <- table(Arthritis$Improved) #the variable Improved in the Arthritis dataset records
                                    #patient outcomes for individuals receiving a placebo or drug
counts #display counts
  None   Some Marked 
    42     14     28 
In [6]:
#simple vertical bar plot
barplot(counts, #dataset = counts
       main = "Simple Bar Plot", #main = title of the plot
       xlab = "Improvement", #xlab = x-axis label
       ylab = "Frequency") #ylab = y-axis label
In [7]:
#simple horizontal bar plot
barplot(counts, #dataset = counts
       main = "Horizontal Bar Plot", #main = title of the plot
       xlab = "Improvement", #xlab = x-axis label
       ylab = "Frequency", #ylab = y-axis label
       horiz = T) #option for horizontal plot

Stacked and grouped bar plots

In [8]:
#consider the cross tabulation of treatment type and improvement status
counts <- table(Arthritis$Improved, Arthritis$Treatment)
counts
        
         Placebo Treated
  None        29      13
  Some         7       7
  Marked       7      21
In [10]:
#stacked bar plot
barplot(counts, #dataset = counts
       main = "Stacked Bar Plot", #main = title of the plot
       xlab = "Treatment", #xlab = x-axis label
       ylab = "Frequency", #ylab = y-axis label
       col = c("red", "yellow", "green"), #colors of the stacked portions
        legend = rownames(counts)) #rownames are improvement status
In [11]:
#grouped bar plot
barplot(counts, #dataset = counts
       main = "Grouped Bar Plot", #main = title of the plot
       xlab = "Treatment", #xlab = x-axis label
       ylab = "Frequency", #ylab = y-axis label
       col = c("red", "yellow", "green"), #colors of the stacked portions
        legend = rownames(counts), #rownames are improvement status
       beside = T) #each column of the matrix are juxtaposed rather than stacked

Mean bar plots

In [18]:
#bar plot for sorted mean values
states <- data.frame(state.region, state.x77) #create a data frame by adding the state region to the state.x77 dataset
states #display the data frame
state.regionPopulationIncomeIlliteracyLife.ExpMurderHS.GradFrostArea
AlabamaSouth 3615 3624 2.1 69.05 15.1 41.3 20 50708
AlaskaWest 365 6315 1.5 69.31 11.3 66.7 152 566432
ArizonaWest 2212 4530 1.8 70.55 7.8 58.1 15 113417
ArkansasSouth 2110 3378 1.9 70.66 10.1 39.9 65 51945
CaliforniaWest 21198 5114 1.1 71.71 10.3 62.6 20 156361
ColoradoWest 2541 4884 0.7 72.06 6.8 63.9 166 103766
ConnecticutNortheast 3100 5348 1.1 72.48 3.1 56.0 139 4862
DelawareSouth 579 4809 0.9 70.06 6.2 54.6 103 1982
FloridaSouth 8277 4815 1.3 70.66 10.7 52.6 11 54090
GeorgiaSouth 4931 4091 2.0 68.54 13.9 40.6 60 58073
HawaiiWest 868 4963 1.9 73.60 6.2 61.9 0 6425
IdahoWest 813 4119 0.6 71.87 5.3 59.5 126 82677
IllinoisNorth Central11197 5107 0.9 70.14 10.3 52.6 127 55748
IndianaNorth Central 5313 4458 0.7 70.88 7.1 52.9 122 36097
IowaNorth Central 2861 4628 0.5 72.56 2.3 59.0 140 55941
KansasNorth Central 2280 4669 0.6 72.58 4.5 59.9 114 81787
KentuckySouth 3387 3712 1.6 70.10 10.6 38.5 95 39650
LouisianaSouth 3806 3545 2.8 68.76 13.2 42.2 12 44930
MaineNortheast 1058 3694 0.7 70.39 2.7 54.7 161 30920
MarylandSouth 4122 5299 0.9 70.22 8.5 52.3 101 9891
MassachusettsNortheast 5814 4755 1.1 71.83 3.3 58.5 103 7826
MichiganNorth Central 9111 4751 0.9 70.63 11.1 52.8 125 56817
MinnesotaNorth Central 3921 4675 0.6 72.96 2.3 57.6 160 79289
MississippiSouth 2341 3098 2.4 68.09 12.5 41.0 50 47296
MissouriNorth Central 4767 4254 0.8 70.69 9.3 48.8 108 68995
MontanaWest 746 4347 0.6 70.56 5.0 59.2 155 145587
NebraskaNorth Central 1544 4508 0.6 72.60 2.9 59.3 139 76483
NevadaWest 590 5149 0.5 69.03 11.5 65.2 188 109889
New HampshireNortheast 812 4281 0.7 71.23 3.3 57.6 174 9027
New JerseyNortheast 7333 5237 1.1 70.93 5.2 52.5 115 7521
New MexicoWest 1144 3601 2.2 70.32 9.7 55.2 120 121412
New YorkNortheast 18076 4903 1.4 70.55 10.9 52.7 82 47831
North CarolinaSouth 5441 3875 1.8 69.21 11.1 38.5 80 48798
North DakotaNorth Central 637 5087 0.8 72.78 1.4 50.3 186 69273
OhioNorth Central10735 4561 0.8 70.82 7.4 53.2 124 40975
OklahomaSouth 2715 3983 1.1 71.42 6.4 51.6 82 68782
OregonWest 2284 4660 0.6 72.13 4.2 60.0 44 96184
PennsylvaniaNortheast 11860 4449 1.0 70.43 6.1 50.2 126 44966
Rhode IslandNortheast 931 4558 1.3 71.90 2.4 46.4 127 1049
South CarolinaSouth 2816 3635 2.3 67.96 11.6 37.8 65 30225
South DakotaNorth Central 681 4167 0.5 72.08 1.7 53.3 172 75955
TennesseeSouth 4173 3821 1.7 70.11 11.0 41.8 70 41328
TexasSouth 12237 4188 2.2 70.90 12.2 47.4 35 262134
UtahWest 1203 4022 0.6 72.90 4.5 67.3 137 82096
VermontNortheast 472 3907 0.6 71.64 5.5 57.1 168 9267
VirginiaSouth 4981 4701 1.4 70.08 9.5 47.8 85 39780
WashingtonWest 3559 4864 0.6 71.72 4.3 63.5 32 66570
West VirginiaSouth 1799 3617 1.4 69.48 6.7 41.6 100 24070
WisconsinNorth Central 4589 4468 0.7 72.48 3.0 54.5 149 54464
WyomingWest 376 4566 0.6 70.29 6.9 62.9 173 97203
In [19]:
means <- aggregate(states$Illiteracy, by = list(state.region), FUN = mean) #aggregate by mean illiteracy levels of states and group by regions
means #display means
Group.1x
Northeast 1.000000
South 1.737500
North Central0.700000
West 1.023077
In [20]:
means <- means[order(means$x),] #sort by means, smallest to largest
means #display the means object
Group.1x
3North Central0.700000
1Northeast 1.000000
4West 1.023077
2South 1.737500
In [21]:
barplot(means$x, #data for the plot is the column x of the means data frame
       names.arg = means$Group.1) #names of the bars
title("Mean Illiteracy Rate") #adds a title to the bar plot

Tweaking bar plots

In [4]:
#fitting labels in a bar plot
par(mar = c(5, 8, 4, 2)) #mar is a numerical vector indicating margin size,
                         #where c(bottom, left, top, right) is expressed in lines.
                         #the default is c(5, 4, 4, 2) + 0.1
par(las = 2) #las – A numeric value indicating the orientation of the tick mark labels
             #and any other text added to a plot after its initialization. The options are
             #as follows: always parallel to the axis (the default, 0), always horizontal (1),
             #always perpendicular to the axis (2), and always vertical (3)
library(vcd) #load the vcd package library for the Arthritis dataset
counts <- table(Arthritis$Improved) #store the table results in the counts object
barplot(counts, #counts = data
       main = "Treatment Outcome", #title of the graph
       horiz = T, #horizontal orientation of the bars
       cex.names = 0.8, #size of the plotted text; 0.8x of normal size
       names.arg = c("No Improvement", "Some Improvement",
                    "Marked Improvement")) #names of the plotted text

Spinograms

In [2]:
#a spinogram is a specialized type of a box plot.  In a spinogram, a stacked bar plot
#is rescaled so that each bar is is 1 and the segment heights represent proportions.
#spinograms are created using the spine() function of the vcd package, as follows:
library(vcd) #load the vcd package
attach(Arthritis) #attach the arthritis dataset
counts <- table(Treatment, Improved) #create a contingency table and store it in the counts objects
spine(counts, main = "Spinogram Example") #create the spinogram
detach(Arthritis) #detach the Arthritis dataset
Loading required package: grid

Attaching package: ‘grid’

The following object is masked from ‘package:SparkR’:

    explode

Pie charts

In [4]:
#pie charts are created by the function pie(x, labels) where:
#x = non negative numeric vector indicating the area of each slice; and
#labels = character vector of slice labels.  here is an example:
par(mfrow = c(2,2)) #creates a matrix of 2 rows x 2 cols filled in by row
slices <- c(10, 12, 4, 16, 8) #vector of areas
lbls <- c("US", "UK", "Australia", "Germany", "France") #vector of labels
pie(slices, labels = lbls, main = "Simple Pie Chart") #create pie chart
In [6]:
#adds percentages to the pie chart:
par(mfrow = c(2,2)) #creates a matrix of 2 rows x 2 cols filled in by row
#chart 1
slices <- c(10, 12, 4, 16, 8) #vector of areas
lbls <- c("US", "UK", "Australia", "Germany", "France") #vector of labels
pie(slices, labels = lbls, main = "Simple Pie Chart") #create pie chart
#chart 2
pct <- round(slices/sum(slices) * 100) #calculate percentages
lbls2 <- paste(lbls, " ", pct, "%", spe = "") #add percentages to the labels
pie(slices, labels = lbls2, col = rainbow(length(lbls2)), main = "Pie Chart with Percentages") #create pie chart
In [8]:
#install the plotrix package for creating 3D pie charts
install.packages("plotrix")
Installing package into ‘/gpfs/global_fs01/sym_shared/YPProdSpark/user/s17c-9f3318fc11f06c-d37a4b9405b6/R/libs’
(as ‘lib’ is unspecified)
In [9]:
#add a 3D pie chart:
par(mfrow = c(2,2)) #creates a matrix of 2 rows x 2 cols filled in by row
#chart 1
slices <- c(10, 12, 4, 16, 8) #vector of areas
lbls <- c("US", "UK", "Australia", "Germany", "France") #vector of labels
pie(slices, labels = lbls, main = "Simple Pie Chart") #create pie chart
#chart 2
pct <- round(slices/sum(slices) * 100) #calculate percentages
lbls2 <- paste(lbls, " ", pct, "%", spe = "") #add percentages to the labels
pie(slices, labels = lbls2, col = rainbow(length(lbls2)), main = "Pie Chart with Percentages") #create pie chart
#chart 3
library(plotrix) #load the plotrix package
pie3D(slices, labels = lbls, explode = 0.1, main = "3D Pie Chart")
In [10]:
#add a pie chart created from a table:
par(mfrow = c(2,2)) #creates a matrix of 2 rows x 2 cols filled in by row
#chart 1
slices <- c(10, 12, 4, 16, 8) #vector of areas
lbls <- c("US", "UK", "Australia", "Germany", "France") #vector of labels
pie(slices, labels = lbls, main = "Simple Pie Chart") #create pie chart
#chart 2
pct <- round(slices/sum(slices) * 100) #calculate percentages
lbls2 <- paste(lbls, " ", pct, "%", spe = "") #add percentages to the labels
pie(slices, labels = lbls2, col = rainbow(length(lbls2)), main = "Pie Chart with Percentages") #create pie chart
#chart 3
library(plotrix) #load the plotrix package
pie3D(slices, labels = lbls, explode = 0.1, main = "3D Pie Chart")
#chart 4
mytable <- table(state.region) #creates a table from the region column of the state dataset
lbls3 <- paste(names(mytable), "\n", mytable, sep = "") #creates the labels vector
pie(mytable, labels = lbls3, main = "Pie Chart from a Table\n (with sample sizes)") #create pie chart

Fan plots

In [11]:
#fan plots are created using the fan.plot() function of the plotrix package
library(plotrix) #load the plotrix package
slices <- c(10, 12, 4, 16, 8) #vector of sizes
lbls <- c("US", "UK", "Australia", "Germany", "France") #vector of labels
fan.plot(slices, labels = lbls, main = "Fan Plot")

Histograms

In [12]:
#histograms display the distribution of a continuous variable (unlike bar plots and pie charts which display categorical variables)
#by dividing the range of scores into a specified number of bins on the x-axis and displaying the frequency of scores in each bin
#on the y-axis.  You create histograms with the function hist(x), where x is a numeric vector of values
#the option "freq = FALSE" crates a plot based on probability densities rather than frequencies
#the "breaks" option controls the number of bins.  The default produces equally spaced breaks when defining the cells of the histogram
In [22]:
#here is an example of four variations of a histogram (chart 1 only):
par(mfrow = c(2, 2)) #creates a matrix of 2 rows x 2 cols filled in by row
#chart 1
hist(mtcars$mpg) #simple histogram
In [23]:
#here is an example of four variations of a histogram (charts 1 and 2):
par(mfrow = c(2, 2)) #creates a matrix of 2 rows x 2 cols filled in by row
#chart 1
hist(mtcars$mpg) #simple histogram
#chart 2
hist(mtcars$mpg, #data
    breaks = 12, #number of breaks specified
    col = "red", #color specified
    xlab = "Mile per Gallon", #x-axis label
    main = "Colored histogram with 12 bins") #title of the graph
In [25]:
#here is an example of four variations of a histogram (charts 1, 2 and 3):
par(mfrow = c(2, 2)) #creates a matrix of 2 rows x 2 cols filled in by row
#chart 1
hist(mtcars$mpg) #simple histogram
#chart 2
hist(mtcars$mpg, #data
    breaks = 12, #number of breaks specified
    col = "red", #color specified
    xlab = "Mile per Gallon", #x-axis label
    main = "Colored histogram with 12 bins") #title of the graph
#chart 3
hist(mtcars$mpg, #data
    freq = F, #"freq = FALSE" crates a plot based on probability densities rather than frequencies
    breaks = 12, #number of breaks specified
    col = "red", #color specified
    xlab = "Miles Per Gallon", #x-axis label
    main = "Histogram, rug plot, density curve") #title of the graph
rug(jitter(mtcars$mpg)) #rug() creates a set of tick marks along the base of a plot
                        #using the jitter() function which couterintuitively adding random noise
                        #to a plot can sometimes make it easier to read.
                        #Jittering is particularly useful for small datasets with at least one discrete position.
lines(density(mtcars$mpg), col = "blue", lwd = 2) #adds a density curve to an existing plot
In [27]:
#here is an example of four variations of a histogram (charts 1, 2, 3 and 4):
par(mfrow = c(2, 2)) #creates a matrix of 2 rows x 2 cols filled in by row
#chart 1
hist(mtcars$mpg) #simple histogram
#chart 2
hist(mtcars$mpg, #data
    breaks = 12, #number of breaks specified
    col = "red", #color specified
    xlab = "Mile per Gallon", #x-axis label
    main = "Colored histogram with 12 bins") #title of the graph
#chart 3
hist(mtcars$mpg, #data
    freq = F, #"freq = FALSE" crates a plot based on probability densities rather than frequencies
    breaks = 12, #number of breaks specified
    col = "red", #color specified
    xlab = "Miles Per Gallon", #x-axis label
    main = "Histogram, rug plot, density curve") #title of the graph
rug(jitter(mtcars$mpg)) #rug() creates a set of tick marks along the base of a plot
                        #using the jitter() function which couterintuitively adding random noise
                        #to a plot can sometimes make it easier to read.
                        #Jittering is particularly useful for small datasets with at least one discrete position.
lines(density(mtcars$mpg), col = "blue", lwd = 2) #adds a kernel density curve to an existing plot
#chart 4
x <- mtcars$mpg
h <- hist(x, #data
         breaks = 12, #number of breaks specified
         col = "red", #color specified
         xlab = "Mile per Gallon", #x-axis label
         main = "Histogram with normal curve and box") #title of the graph
#code for superimposing a normal curve (credit to Peter Dalgaard)
xfit <- seq(min(x), max(x), length = 40)
yfit <- dnorm(xfit, mean = mean(x), sd = sd(x))
yfit <- yfit * diff(h$mids[1:2]) * length(x)
lines(xfit, yfit, col = "blue", lwd = 2)
#adds a box() around the graph
box()

Kernel density plots

In [28]:
#kernel density estimation is a nonparametric method for estimating the probablity distribution function
#of a random variable.  the function is: plot(density(x)) where x = numeric vector.
#here is the code for creating 2 examples of kernel density plots:
par(mfrow = c(2, 1)) #creates a matrix of 2 rows x 1 col filled in by row
#chart 1
d <- density(mtcars$mpg) #stores the density in object d
plot(d) #creates a minimal graph with all defaults in place
#chart 2
d <- density(mtcars$mpg) #stores the density in object d
plot(d, main = "Kernel Density of Miles Per Gallon") #creates a minimal graph with a title
polygon(d, col = "red", border = "blue") #colors the curve blue and fills the area under the curve with red
rug(mtcars$mpg, col = "brown") #adds a brown rug, i.e. creates a set of tick marks along the base of a plot
In [29]:
#kernel density plots can be used to compare groups.  to do this, you need the "sm" package
install.packages("sm")
Installing package into ‘/gpfs/global_fs01/sym_shared/YPProdSpark/user/s17c-9f3318fc11f06c-d37a4b9405b6/R/libs’
(as ‘lib’ is unspecified)
In [35]:
#comparative kernel density plots
library(sm) #load the sm package
attach(mtcars) #attach the mtcars dataset
#creates a grouping factor
cyl.f <- factor(cyl, levels = c(4, 6, 8), #data = cyl from the mtcars dataset; levels = vector of 3 factors
               labels = c("4 cylinder", "6 cylinder", "8 cylinder")) #labels = factor labels
#creates the comparative kd plots
sm.density.compare(mpg, cyl, xlab = "Miles per Gallon")
#adds a title to the chart
title(main = "MPG Distribution by Car Cylinders")
#adds a legend
colfill <- c(2:(1 + length(levels(cyl.f))))
legend("topright", inset = 0.05, levels(cyl.f), fill = colfill)
#detach the mtcars dataset
detach(mtcars)
The following objects are masked from mtcars (pos = 3):

    am, carb, cyl, disp, drat, gear, hp, mpg, qsec, vs, wt

The following objects are masked from mtcars (pos = 4):

    am, carb, cyl, disp, drat, gear, hp, mpg, qsec, vs, wt

The following objects are masked from mtcars (pos = 5):

    am, carb, cyl, disp, drat, gear, hp, mpg, qsec, vs, wt

Box plots

In [37]:
#a box-and-whiskers plot describes the distribution of a continuous variable by plotting
#its five-number summary: min, lower quartile (25th percentile), median (50th percentile),
#upper quartile (75th percentile) and the max.
#it can also display outliers i.e. values that lie outside 1.5 * IQR or inter-quartile range
#IQR = upper quartile - lower quartile.  here is an example:
#create the box plot:
boxplot(mtcars$mpg, main = "Box Plot", ylab = "Miles Per Gallon")
#print relevant stats
boxplot.stats(mtcars$mpg)
$stats
  1. 10.4
  2. 15.35
  3. 19.2
  4. 22.8
  5. 33.9
$n
32
$conf
  1. 17.1191615196633
  2. 21.2808384803367
$out

Using parallel box plots to compare groups

In [38]:
#boxplots can be created for individual variables or for variables by group
#the format is boxplot(formula, data = data frame)
#where formula is a formula and data denotes a data frame (or list) providing the data
#example of a formula: y ~ A, where a separate box plot for numeric variable y is generated
#for each value of categorical variable A.
#the formula y ~ A*B would produce a boxplot of numeric variable y for each combinaiton of levels
#in categorical variables A and B
#adding option "varwidth = TRUE" makes the box plot widths proportional to the square root of their sample sizes
#add "horizontal = TRUE" to reverse axis orientation
In [39]:
#the following code revisits the impact of four, six and eight cylinders on auto mpg with parallel box plots
boxplot(mpg ~ cyl, data = mtcars, #formula, data
       main = "Car Mileage Data", #title
       xlab = "Number of cylinders", ylab = "Miles Per Gallon") #x- and y-axis labels