Using Spark in RStudio

Last updated: Nov 21, 2024

Although the RStudio IDE cannot be started in a Spark with R environment runtime, you can use Spark in your R scripts and Shiny apps by accessing Spark kernels programmatically.

RStudio uses the sparklyr package to connect to Spark from R. The sparklyr package includes a dplyr interface to Spark data frames as well as an R interface to Spark’s distributed machine learning pipelines.

You can connect to Spark from RStudio:

By connecting to a Spark kernel that runs locally in the RStudio container in IBM watsonx.ai Studio

RStudio includes sample code snippets that show you how to connect to a Spark kernel in your applications for both methods.

To use Spark in RStudio after you have launched the IDE:

Locate the ibm_sparkaas_demos directory under your home directory and open it. The directory contains the following R scripts:
- A readme with details on the included R sample scripts
- spark_kernel_basic_local.R includes sample code of how to connect to a local Spark kernel
- spark_kernel_basic_remote.R includes sample code of how to connect to a remote Spark kernel
- The files sparkaas_flights.Rand sparkaas_mtcars.R are two examples of how to use Spark in a small sample application
Use the sample code snippets in your R scripts or applications to help you get started using Spark.

Connecting to Spark from RStudio

To connect to Spark from RStudio using the Sparklyr R package, you need a Spark with R environment. You can either use the default Spark with R environment that is provided or create a custom Spark with R environment. To create a custom environment, see Creating environment templates.

Follow these steps after you launch RStudio in an RStudio environment:

Use the following sample code to get a listing of the Spark environment details and to connect to a Spark kernel from your RStudio session:

# load spark R packages
library(ibmwsrspark)
library(sparklyr)

# load kernels
kernels <- load_spark_kernels()

# display kernels
display_spark_kernels()

# get spark kernel Configuration

conf <- get_spark_config(kernels[1])
# Set spark configuration
conf$spark.driver.maxResultSize <- "1G"
# connect to Spark kernel

sc <- spark_connect(config = conf)