Add an Amazon EMR Spark service

If you have Watson Studio Enterprise, you can continue working with the data that you have stored in Amazon Simple Storage Service in IBM Watson Studio by running your notebook in an Amazon Elastic Map Reduce cluster. Before you can run a notebook on Amazon Elastic Map Reduce and access your data, you must set up the Amazon EMR service and associate it with your project.

You can add one or more Amazon EMR Spark services to a project but the same service can't be added to another project. Each notebook runs on a dedicated kernel in an Amazon EMR service. You can stop the kernel from the Kernel menu on the notebook action bar.

If you have more than one Amazon EMR Spark service, you can change the Amazon EMR service that is associated with a project. Click Settings on the project page and in the Associate Services section, remove the existing Amazon EMR service and add a new one.

You can learn how to use Amazon EMR in Watson Studio by opening the sample notebook: Analyze accident reports on Amazon EMR Spark.

Known limitations and unsupported features

The following list details the limitations for users and includes the notebook features that are not enabled for Amazon EMR:

  • Only Python notebook kernels are supported.
  • Initializing the Python Spark kernel can take some time depending on the time it takes to initialize the PySpark shell on the cluster.
  • Notebooks that run on Amazon EMR can't be scheduled.
  • You can't add files and data source connections to the notebook (the Find and Add Data icon is disabled).
  • You can't monitor the execution of the Spark jobs for code cells in a notebook. To monitor your running Spark jobs, use the Spark history server provided by Amazon EMR.
  • You can't stop the notebook kernel from the notebook's Actions menu on the project page. You can only stop the kernel from the notebook action bar.

Add an Amazon EMR Spark service to a project

To enable notebooks to run on your Amazon EMR cluster:

  1. Create an EMR cluster and set up a Jupyter Kernel Gateway. The Jupyter Kernel Gateway is a web server that supports communication between Watson Studio and the Jupyter notebook kernels on Amazon EMR. See Step 1.
  2. Ensure that the connection to the Kernel Gateway is secure. See Step 2.
  3. Associate this Kernel Gateway web server to Amazon EMR with the project that you add your notebook to in Watson Studio. See Step 3.

Step 1: Create an EMR cluster and set up the Kernel Gateway

Before you can add a Amazon EMR Spark service to your project, you must create a cluster on Amazon EMR and set up a Jupyter Kernel Gateway:

  1. Open the Amazon EMR console.
  2. Click Services and select EMR in the analytics section.
  3. Click Create Cluster and follow the steps to create a cluster.
  4. When the cluster is running, log into the master node of the cluster as the default Hadoop user.
  5. In the shell of the master node, enter the following commands:

    a. wget https://raw.githubusercontent.com/IBMDataScience/kernelgateway-setup/master/install_kg_emr_bootstrap_script.sh

    This command downloads the Kernel Gateway setup script.

    b. chmod +x install_kg_emr_bootstrap_script.sh && ./install_kg_emr_bootstrap_script.sh --port <kernelgateway-port> --token <personal-access-token>

    This command runs the Kernel Gateway setup script.

    <kernelgateway-port> is the port for the Kernel Gateway web server to listen on.
    Ensure that the security group assigned to the EMR master node allows inbound connections from everywhere on the port you choose.

    <personal-access-token> is your secure password to the Kernel Gateway. You can select your own password. Remember this access token for when you add the service to your project in Watson Studio.

    The script installs and starts the Kernel Gateway web server. The script returns the URL to the Kernel Gateway which you need when you add this service to your project in Watson Studio.

Step 2: Create a secure connection

Perform the following steps to ensure that the connection to the Kernel Gateway is secure. These steps detail one way of how to secure your connection.

  1. Register a new domain if you don't already have an existing one. For example, emr.example.com.
  2. Request a certificate for the registered domain if you don't have a certificate:

    a. Associate a contact email address with the domain. This address can be in an MX record or can be one of the five common system admin addresses, for example, admin@emr.example.com or hostmaster@emr.example.com. AWS Certificate Manager (ACM) needs this address to verify that you control the domain for which the ACM certificate is issued.

    b. Request an SSL certificate for the registered domain by using Amazon Certificate Manager.

  3. Create a Classic Load Balancer with an HTTPS Listener. Note: This might incur additional costs.
  4. Assign the certificate to your load balancer and bind your registered Amazon EMR instance to it. For details see SSL/TLS Certificates for Classic Load Balancers.
  5. Configure the load balancer:

    a. Because the Kernel Gateway forwards websocket connections, define port forwarding to route connection requests. For example: secure TCP port 443 -> TCP port <kernelgateway-port>

    Note: Ensure that this setting is in accordance with your company security guidelines.

    b. Configure the load balancer's health check by entering the following field values:

    Ping protocol: HTTP

    Ping port: <kernelgateway-port>

    Ping path: /api/swagger.json

    The load balancer sends a request to the registered Amazon EMR instance at the Kernel Gateway port and the /api/swagger.json path every 30 seconds (default interval) to check whether the Kernel Gateway web server is still functional.

  6. Add a CNAME record with the load balancer's DNS name to your domain.

    The domain record will be updated within 24 hours. Thereafter, you can associate the Amazon EMR Spark service with a project in Watson Studio.

    When the state of the load balancer is marked as InService on the Amazon EMR console, you can run notebooks in your Amazon EMR Spark service.

    Using a static endpoint has the added advantage that the Amazon EMR service continues to be valid as long as the load balancer is running. If you stop a cluster and create a new one, you can assign the load balancer to the new cluster, and your associated service will run again.

Step 3: Associate the Amazon EMR Spark service with a project

Now add this newly created service to your project:

  1. Open your project in Watson Studio.
  2. From the Project page, go to Settings and Associated Services (to view a list of all associated services connected to the project).
  3. Click add associated service, then select Amazon EMR from the list.
  4. In the Amazon EMR page, you can add a new service. Enter your personal access token and the Kernel Gateway URL (the load balancer URL).
  5. Click save to add the service to the project. The Amazon EMR Spark Service is added to the list of Associated Services.