0 / 0
Using Spark in RStudio
Last updated: Oct 09, 2024
Using Spark in RStudio

Although the RStudio IDE cannot be started in a Spark with R environment runtime, you can use Spark in your R scripts and Shiny apps by accessing Spark kernels programmatically. RStudio uses the sparklyr package to connect to Spark from R. The sparklyr package includes a dplyr interface to Spark data frames as well as an R interface to Spark’s distributed machine learning pipelines.

You can connect to Spark from RStudio:

  • By connecting to a Spark kernel that runs locally in the RStudio container in IBM Watson Studio

RStudio includes sample code snippets that show you how to connect to a Spark kernel in your applications for both methods.

To use Spark in RStudio after you have launched the IDE:

  1. Locate the ibm_sparkaas_demos directory under your home directory and open it. The directory contains the following R scripts:

    • A readme with details on the included R sample scripts
    • spark_kernel_basic_local.R includes sample code of how to connect to a local Spark kernel
    • spark_kernel_basic_remote.R includes sample code of how to connect to a remote Spark kernel
    • The files sparkaas_flights.Rand sparkaas_mtcars.R are two examples of how to use Spark in a small sample application
  2. Use the sample code snippets in your R scripts or applications to help you get started using Spark.

Connecting to Spark from RStudio

To connect to Spark from RStudio using the Sparklyr R package, you need a Spark with R environment. You can either use the default Spark with R environment that is provided or create a custom Spark with R environment. To create a custom environment, see Creating environment templates.

Follow these steps after you launch RStudio in an RStudio environment:

Use the following sample code to get a listing of the Spark environment details and to connect to a Spark kernel from your RStudio session:

# load spark R packages
library(ibmwsrspark)
library(sparklyr)

# load kernels
kernels <- load_spark_kernels()

# display kernels
display_spark_kernels()

# get spark kernel Configuration

conf <- get_spark_config(kernels[1])
# Set spark configuration
conf$spark.driver.maxResultSize <- "1G"
# connect to Spark kernel

sc <- spark_connect(config = conf)

Then to disconnect from Spark, use:

# disconnect
spark_disconnect(sc)

Examples of these commands are provided in the readme under /home/wsuser/ibm_sparkaas_demos.

Parent topic: RStudio

Generative AI search and answer
These answers are generated by a large language model in watsonx.ai based on content from the product documentation. Learn more