Although the RStudio IDE cannot be started in a Spark with R environment runtime, you can use Spark in your R scripts and Shiny apps by accessing Spark kernels programmatically. RStudio uses the sparklyr
package to connect to Spark
from R. The sparklyr
package includes a dplyr
interface to Spark data frames as well as an R interface to Spark’s distributed machine learning pipelines.
You can connect to Spark from RStudio:
- By connecting to a Spark kernel that runs locally in the RStudio container in IBM Watson Studio
RStudio includes sample code snippets that show you how to connect to a Spark kernel in your applications for both methods.
To use Spark in RStudio after you have launched the IDE:
-
Locate the
ibm_sparkaas_demos
directory under your home directory and open it. The directory contains the following R scripts:- A readme with details on the included R sample scripts
spark_kernel_basic_local.R
includes sample code of how to connect to a local Spark kernelspark_kernel_basic_remote.R
includes sample code of how to connect to a remote Spark kernel- The files
sparkaas_flights.R
andsparkaas_mtcars.R
are two examples of how to use Spark in a small sample application
-
Use the sample code snippets in your R scripts or applications to help you get started using Spark.
Connecting to Spark from RStudio
To connect to Spark from RStudio using the Sparklyr
R package, you need a Spark with R environment. You can either use the default Spark with R environment that is provided or create a custom Spark with R environment. To create
a custom environment, see Creating environment templates.
Follow these steps after you launch RStudio in an RStudio environment:
Use the following sample code to get a listing of the Spark environment details and to connect to a Spark kernel from your RStudio session:
# load spark R packages
library(ibmwsrspark)
library(sparklyr)
# load kernels
kernels <- load_spark_kernels()
# display kernels
display_spark_kernels()
# get spark kernel Configuration
conf <- get_spark_config(kernels[1])
# Set spark configuration
conf$spark.driver.maxResultSize <- "1G"
# connect to Spark kernel
sc <- spark_connect(config = conf)
Then to disconnect from Spark, use:
# disconnect
spark_disconnect(sc)
Examples of these commands are provided in the readme under /home/wsuser/ibm_sparkaas_demos
.
Parent topic: RStudio