Loading and accessing data in a notebook
You can integrate data into notebooks by loading the data into a data structure or container, for example, a pandas.DataFrame, numpy.array, Spark RDD, or Spark DataFrame. If you created a notebook from one of the sample notebooks, the instructions in that notebook will guide you through loading data. To load data into your own notebooks, you can choose one of these options:
- Add a file from your local system
- Use a free data set from the Gallery
- Load data from a data source connection
- Use the
project-liblibrary to interact with project assets:
- Write your own code with Python functions to work with data and IBM Cloud Object Storage in notebooks
- Use an API function or operating system command to access the data
Load data from local files
To access data from a local file, you can load the file from within a notebook, or first load the file into your project. From your notebook, you add automatically generated code to access the data by using the
Insert to code function. The inserted code serves as a quick start to allow you to easily begin working with data sets.
Insert to code function supports file types such as CSV, JSON and XLSX. To learn which data structures are generated for which notebook language, see Data load support. For file types that are not supported, you can only insert the file credentials. With the credentials, you can write your own code to load the file data into a DataFrame or other data structure in a notebook cell.
To add a file from your local system to your notebook:
- Click the Find and Add Data icon (), and then browse a data file or drag it into your notebook sidebar.
Click in an empty code cell in your notebook and then click the Insert to code link below the file and choose how to load the data.
Code is generated and added to your notebook for you. The generated code imports any required packages, accesses the data file with the object storage credentials, and loads the data into a DataFrame or RDD. Note that, for production systems, you should carefully review the generated code to determine if you should write your own code that better meets your needs.
To manually add file credentials and write code for the file access method and the DataFrame yourself:
- Add the file to your object storage by clicking the Find and Add Data icon (), and then browsing the data file or dragging it into your notebook sidebar.
- Click in an empty code cell in your notebook and then click the Insert to code > Insert Credentials function from the Files notebook sidebar.
- Insert your credentials to the appropriate method for your notebook language to access the data in your notebook. For example, see this code in a blog for Python and in a notebook for R.
- Reference the data access method in the appropriate read method for your language to load the data into a DataFrame or other data structure.
Load data sets from the Gallery
The data sets on the Gallery contain open data. Watch this short video to see how to work with public data sets in the Gallery.
To add a data set from the Gallery in your notebook, you copy the data set into a project:
- Find the card for the data set that you want to add.
- Click the Add to Project icon from the action bar, select the project, and click Add. Clicking View project takes you to the project Overview page. The data asset is added to the list of data assets on the project’s Assets page.
- Open your notebook in edit mode, and then click the Data icon, to see your data set.
- To start working with the data in your data set, click Insert to code under the file name and choose how to load the data to your notebook. Note that the inserted code serves as a quick start to begin working with a data set or connection. For production systems, you should carefully review the inserted code to determine if you should write your own code that better meets your needs.
Load data from data source connections
You must create a connection to an IBM data service or an external data source before you can add data from that data source to your notebook. See Adding connections to projects.
Insert to code function supports some database connections. To learn which database connections are supported, see Data load support. For database connections that are not supported, you can only insert the database connection credentials. With the credentials, you can write your own code to load the data into a DataFrame or other data structure in a notebook cell.
To load data from an existing data source connection into a data structure in your notebook:
- Open the notebook in edit mode.
- Click in an empty code cell, click Find and Add Data, and then click the Connections tab to see your connections.
- Click Insert to code under the connection name.
- If necessary, enter your personal credentials for locked data connections that are marked with a key icon (). This is a one-time step that permanently unlocks the connection for you. After you have unlocked the connection, the key icon is no longer displayed. See Adding connections to projects.
For IBM Db2 Warehouse on Cloud (previously named IBM dashDB) and Databases for PostgreSQL:
- Choose how to load the data to your notebook.
- Select the schema and choose a table. Code is generated and added to your notebook for you.
- Run the cell.
For all other connections:
- Run the cell to load your credentials.
- Open the database connection that references your credentials. Note that when connecting to Db2 on zOS, the Db2 driver needs to access the license JAR file, which must be added the classpath. See Locating the license JAR file.
- Load the data into a DataFrame or other data structure.
Use an API function or operating system command to access the data
You can use API functions or operating system commands in your notebook to access data, for example, the
Wget command to access data by using the HTTP, HTTPS or FTP protocols.When you use these types of API functions and commands, you need to include code that sets the project access token. See Manually add the project access token.