Use Spark for R to load data and run SQL queries¶

This notebook introduces basic Spark concepts and helps you to start using Spark for R.

Some familiarity with R is recommended. This notebook runs on R with Spark.

In this notebook, you'll use the publicly available mtcars data set from Motor Trend magazine to learn some basic R. You'll learn how to load data, create a Spark DataFrame, aggregate data, run mathematical formulas, and run SQL queries against the data.

Table of contents¶

This notebook contains these main sections:

Load a DataFrame
Initialize an SQLContext
Create a Spark DataFrame
Aggregate data after grouping by columns
Operate on columns
Run SQL queries from the Spark DataFrame

1. Load a DataFrame¶

A DataFrame is a distributed collection of data that is organized into named columns. The built-in R DataFrame called mtcars includes observations on the following 11 variables:

[, 1] mpg Miles / (US) gallon
[, 2] cyl Number of cylinders
[, 3] disp Displacement (cu. in.)
[, 4] hp Gross horsepower
[, 5] drat Rear axle ratio
[, 6] wt Weight (1000 lbs)
[, 7] qsec 1/4 mile time (seconds)
[, 8] vs 0 = V-engine, 1 = straight engine
[, 9] am Transmission (0 = automatic, 1 = manual)
[,10] gear Number of forward gears
[,11] carb Number of carburetors

Preview the first 3 rows of the DataFrame by using the head() function:

In [1]:

head(mtcars, 3)

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21.0	6	160	110	3.90	2.620	16.46	0	1	4	4
Mazda RX4 Wag	21.0	6	160	110	3.90	2.875	17.02	0	1	4	4
Datsun 710	22.8	4	108	93	3.85	2.320	18.61	1	1	4	1

Convert the car name data, which appears in the row names, into an actual column so that Spark can read it as a column:

In [2]:

mtcars$car <- rownames(mtcars)
mtcars <- mtcars[,c(12,1:11)]
rownames(mtcars) <- 1:nrow(mtcars)
head(mtcars)

car	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21.0	6	160	110	3.90	2.620	16.46	0	1	4	4
Mazda RX4 Wag	21.0	6	160	110	3.90	2.875	17.02	0	1	4	4
Datsun 710	22.8	4	108	93	3.85	2.320	18.61	1	1	4	1
Hornet 4 Drive	21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
Hornet Sportabout	18.7	8	360	175	3.15	3.440	17.02	0	0	3	2
Valiant	18.1	6	225	105	2.76	3.460	20.22	1	0	3	1

2. Initialize an SQLContext¶

To work with a DataFrame, you need an SQLContext. You create this SQLContext by using sparkRSQL.init(sc). A SparkContext named sc, which has been created for you, is used to initialize the SQLContext:

In [3]:

sqlContext <- sparkR.session(sc)

Obtaining Spark session....
Spark session obtained.

3. Create a Spark DataFrame¶

Using the SQLContext and the loaded local DataFrame, create a Spark DataFrame and print the schema, or structure, of the DataFrame:

In [4]:

sdf <- createDataFrame(mtcars, schema = NULL) 
printSchema(sdf)

root
 |-- car: string (nullable = true)
 |-- mpg: double (nullable = true)
 |-- cyl: double (nullable = true)
 |-- disp: double (nullable = true)
 |-- hp: double (nullable = true)
 |-- drat: double (nullable = true)
 |-- wt: double (nullable = true)
 |-- qsec: double (nullable = true)
 |-- vs: double (nullable = true)
 |-- am: double (nullable = true)
 |-- gear: double (nullable = true)
 |-- carb: double (nullable = true)

Display the content of the Spark DataFrame:

In [5]:

SparkR::head(sdf, 32)

car	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21.0	6	160.0	110	3.90	2.620	16.46	0	1	4	4
Mazda RX4 Wag	21.0	6	160.0	110	3.90	2.875	17.02	0	1	4	4
Datsun 710	22.8	4	108.0	93	3.85	2.320	18.61	1	1	4	1
Hornet 4 Drive	21.4	6	258.0	110	3.08	3.215	19.44	1	0	3	1
Hornet Sportabout	18.7	8	360.0	175	3.15	3.440	17.02	0	0	3	2
Valiant	18.1	6	225.0	105	2.76	3.460	20.22	1	0	3	1
Duster 360	14.3	8	360.0	245	3.21	3.570	15.84	0	0	3	4
Merc 240D	24.4	4	146.7	62	3.69	3.190	20.00	1	0	4	2
Merc 230	22.8	4	140.8	95	3.92	3.150	22.90	1	0	4	2
Merc 280	19.2	6	167.6	123	3.92	3.440	18.30	1	0	4	4
Merc 280C	17.8	6	167.6	123	3.92	3.440	18.90	1	0	4	4
Merc 450SE	16.4	8	275.8	180	3.07	4.070	17.40	0	0	3	3
Merc 450SL	17.3	8	275.8	180	3.07	3.730	17.60	0	0	3	3
Merc 450SLC	15.2	8	275.8	180	3.07	3.780	18.00	0	0	3	3
Cadillac Fleetwood	10.4	8	472.0	205	2.93	5.250	17.98	0	0	3	4
Lincoln Continental	10.4	8	460.0	215	3.00	5.424	17.82	0	0	3	4
Chrysler Imperial	14.7	8	440.0	230	3.23	5.345	17.42	0	0	3	4
Fiat 128	32.4	4	78.7	66	4.08	2.200	19.47	1	1	4	1
Honda Civic	30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2
Toyota Corolla	33.9	4	71.1	65	4.22	1.835	19.90	1	1	4	1
Toyota Corona	21.5	4	120.1	97	3.70	2.465	20.01	1	0	3	1
Dodge Challenger	15.5	8	318.0	150	2.76	3.520	16.87	0	0	3	2
AMC Javelin	15.2	8	304.0	150	3.15	3.435	17.30	0	0	3	2
Camaro Z28	13.3	8	350.0	245	3.73	3.840	15.41	0	0	3	4
Pontiac Firebird	19.2	8	400.0	175	3.08	3.845	17.05	0	0	3	2
Fiat X1-9	27.3	4	79.0	66	4.08	1.935	18.90	1	1	4	1
Porsche 914-2	26.0	4	120.3	91	4.43	2.140	16.70	0	1	5	2
Lotus Europa	30.4	4	95.1	113	3.77	1.513	16.90	1	1	5	2
Ford Pantera L	15.8	8	351.0	264	4.22	3.170	14.50	0	1	5	4
Ferrari Dino	19.7	6	145.0	175	3.62	2.770	15.50	0	1	5	6
Maserati Bora	15.0	8	301.0	335	3.54	3.570	14.60	0	1	5	8
Volvo 142E	21.4	4	121.0	109	4.11	2.780	18.60	1	1	4	2

Try different ways of retrieving subsets of the data. For example, get the first 5 values in the mpg column:

In [6]:

SparkR::head(select(sdf, sdf$mpg),5)

mpg
21.0
21.0
22.8
21.4
18.7

Filter the DataFrame to retain only rows with mpg values that are less than 18:

In [7]:

SparkR::head(SparkR::filter(sdf, sdf$mpg < 18))

car	mpg	cyl	disp	hp	drat	wt	qsec	vs	gear	carb
Duster 360	14.3	8	360.0	245	3.21	3.57	15.84	0	3	4
Merc 280C	17.8	6	167.6	123	3.92	3.44	18.90	1	4	4
Merc 450SE	16.4	8	275.8	180	3.07	4.07	17.40	0	3	3
Merc 450SL	17.3	8	275.8	180	3.07	3.73	17.60	0	3	3
Merc 450SLC	15.2	8	275.8	180	3.07	3.78	18.00	0	3	3
Cadillac Fleetwood	10.4	8	472.0	205	2.93	5.25	17.98	0	3	4

4. Aggregate data after grouping by columns¶

Spark DataFrames support a number of common functions to aggregate data after grouping. For example, you can compute the average weight of cars as a function of the number of cylinders:

In [8]:

SparkR::head(summarize(groupBy(sdf, sdf$cyl), wtavg = avg(sdf$wt)))

cyl	wtavg
8	3.999214
4	2.285727
6	3.117143

You can also sort the output from the aggregation to determine the most popular cylinder configuration in the DataFrame:

In [9]:

car_counts <-summarize(groupBy(sdf, sdf$cyl), count = n(sdf$wt))
SparkR::head(arrange(car_counts, desc(car_counts$count)))

cyl	count
8	14
4	11
6	7

5. Operate on columns¶

SparkR provides a number of functions that you can apply directly to columns for data processing. In the following example, a basic arithmetic function converts lbs to metric tons:

In [10]:

sdf$wtTon <- sdf$wt * 0.45
SparkR::head(select(sdf, sdf$car, sdf$wt, sdf$wtTon),6)

car	wt	wtTon
Mazda RX4	2.620	1.17900
Mazda RX4 Wag	2.875	1.29375
Datsun 710	2.320	1.04400
Hornet 4 Drive	3.215	1.44675
Hornet Sportabout	3.440	1.54800
Valiant	3.460	1.55700

6. Run SQL queries from the Spark DataFrame¶

You can register a Spark DataFrame as a temporary table and then run SQL queries over the data. The sql function enables an application to run SQL queries programmatically and returns the result as a DataFrame:

In [11]:

createOrReplaceTempView(sdf, "cars")

highgearcars <- sql("SELECT car, gear FROM cars WHERE gear >= 5")
SparkR::head(highgearcars)

car	gear
Porsche 914-2	5
Lotus Europa	5
Ford Pantera L	5
Ferrari Dino	5
Maserati Bora	5

That's it!¶

You successfully completed this notebook! You learned how to load a DataFrame, view and filter the data, aggregate the data, perform operations on the data in specific columns, and run SQL queries against the data. For more information about Spark, see the Spark Quick Start Guide.

Love this notebook? Don't have an account yet?
Share it with your colleagues and help them discover the power of Cloud Pak for Data! Sign Up