Loading and accessing data in a notebook
You can integrate data into notebooks by loading the data into a data structure or container, for example, a pandas. DataFrame, numpy.array, Spark RDD, or Spark DataFrame. If you created a notebook from one of the sample notebooks, the instructions in that notebook will guide you through loading data. To load data into your own notebooks, you can choose one of these options:
- Add a file from your local system
- Use a free data set from the Gallery
- Load data from a data source connection
- Use the
project-liblibrary to interact with project assets:
- Write your own code with Python functions to work with data and IBM Cloud Object Storage in notebooks
- Use an API function or operating system command to access the data
Important: Make sure that the environment in which the notebook is started has enough memory to store the data that you load to the notebook. Oftentimes this means that the environment must have significantly more memory than the total size of the data loaded to the notebook. The reason is that some data frameworks, like pandas, can hold multiple copies of the data in memory.
Load data from local files
To access data from a local file, you can load the file from within a notebook, or first load the file into your project. From your notebook, you add automatically generated code to access the data by using the
Insert to code function. The inserted code serves as a quick start to allow you to easily begin working with data sets.
Insert to code function supports file types such as CSV, JSON and XLSX. To learn which data structures are generated for which notebook language, see Data load support. For file types that are not supported, you can only insert the file credentials. With the credentials, you can write your own code to load the file data into a DataFrame or other data structure in a notebook cell.
To add a file from your local system to your notebook:
- Click the Find and Add Data icon (), and then browse a data file or drag it into your notebook sidebar.
- Click in an empty code cell in your notebook and then click the Insert to code link below the file.
To manually add file credentials and write code for the file access method and the DataFrame yourself:
- Add the file to your object storage by clicking the Find and Add Data icon (), and then browsing the data file or dragging it into your notebook sidebar.
- Click in an empty code cell in your notebook and then click the Insert to code > Insert Credentials function from the Files notebook sidebar.
- Insert your credentials to the appropriate method for your notebook language to access the data in your notebook. For example, see this code in a blog for Python.
- Reference the data access method in the appropriate read method for your language to load the data into a DataFrame or other data structure.
Load data sets from the Gallery
The data sets on the Gallery contain open data. Watch this short video to see how to work with public data sets in the Gallery.
To add a data set from the Gallery in your notebook, you copy the data set into a project:
- Find the card for the data set that you want to add.
- Click the Add to Project icon from the action bar, select the project, and click Add. Clicking View project takes you to the project Overview page. The data asset is added to the list of data assets on the project’s Assets page.
- Open your notebook in edit mode, and then click the Data icon, to see your data set.
- To start working with the data in your data set, click Insert to code under the file name and choose how to load the data to your notebook. Note that the inserted code serves as a quick start to begin working with a data set or connection. For production systems, you should carefully review the inserted code to determine if you should write your own code that better meets your needs.
Load data from data source connections
You must create a connection to an IBM data service or an external data source before you can add data from that data source to your notebook. See Adding connections to projects.
Insert to code function supports some database connections. To learn which database connections are supported, see Data load support. For database connections that are not supported, you can only insert the database connection credentials. With the credentials, you can write your own code to load the data into a DataFrame or other data structure in a notebook cell.
To load data from an existing data source connection into a data structure in your notebook:
- Open the notebook in edit mode.
- Click in an empty code cell, click Find and Add Data, and then click the Connections tab to see your connections.
- Click Insert to code under the connection name.
If necessary, enter your personal credentials for locked data connections that are marked with a key icon (). This is a one-time step that permanently unlocks the connection for you. After you have unlocked the connection, the key icon is no longer displayed. See Adding connections to projects.
- If the connection is supported, choose how to load the data to your notebook. Select the schema and choose a table.
- If the connection is not supported, load the credentials and open the database connection that references your credentials. Note that when connecting to Db2 on zOS, the Db2 driver needs to access the license JAR file, which must be added the classpath. See Locating the license JAR file. Write code to load the data.
Use an API function or operating system command to access the data
You can use API functions or operating system commands in your notebook to access data, for example, the
Wget command to access data by using the HTTP, HTTPS or FTP protocols.When you use these types of API functions and commands, you need to include code that sets the project access token. See Manually add the project access token.