Buckets, file paths, and partitions in Cloud Object Storage

   

Table of contents

Introduction

Key components of Cloud Object Storage

Preferred practices for partition design

   

Introduction

Where and how your data is stored in IBM Cloud Object Storage (COS) can directly influence query performance and resource utilization. Buckets and partitions play important roles in data organization.

Let’s look at COS components to understand how to best store your data for optimum query performance. We’ll use the Clickstream sample data to demonstrate some partition strategies. The following code snippet is an example of the source data schema in JSON format:

    { 
        customer_id": "12649", 
        "time_stamp": "2018-01-02 12:33:43", 
        "click_event_type": "add_to_cart", 
        “product_name": "Office Supplies", 
        "product_category": "Home Products", 
        "product_price": "12.99", 
        "total_price_of_basket": "41.54", 
        "total_number_of_items_in_basket": "6", 
        "total_number_of_distinct_items_in_basket": "2"
    }

   

Key Components of Cloud Object Storage

The following components are key elements of Cloud Object Storage.

Bucket

A bucket is a logical abstraction that is used to provide a “container” for data. Buckets are created only in COS. For example, you might create a bucket that is called “blackfriday” to be the container for all streaming data from Black Friday store sales and another bucket that is called “postxmas” for store sales on 26-Dec. You create a third bucket that is called “clickstream” to contain streaming data about all online sales activity.

You select which bucket to use in the Properties pane of the COS operator in the streams flow canvas. When you set up the Cloud Object Storage operator for your streams flow in the canvas, click Select the Data Asset icon that is next to the File path field. The Select Data Asset window opens.

Important

In the Select Data Asset window, you can see all buckets in the COS instance, but you can access only buckets that have the same location as the COS Connector URL that you selected.

To see what buckets you can access, do the following steps. Tip: Use two browser windows for side-by-side work.

  1. In the first browser window, get the COS endpoint by doing the following steps:

    a. Go to the Projects page of the project, and then click the Assets tab.

    b. In the Data assets section, click Cloud Object Storage.

    c. In the Edit Connection window, go to the Login URL field, and then note the string immediately following https://s3. For example, if the Login URL value is https://s3.eu-geo.objectstorage.softlayer.net, then the string to note is eu-geo.

  2. In the second browser window, find which buckets are accessible by doing the following steps:

    a. Go to your {{site.data.keyword.Bluemix_short} Dashboard.

    b. Click the name of your {{site.data.keyword.cos_short}} service. The Buckets window opens and lists all buckets and their locations.

    Partition buckets

    c. In the LOCATION column, note locations that match the string in the COS URL from step 1c. Only those buckets are accessible. In our example, only buckets clickstream and crossregion0eu0geo are accessible because their location is eu-geo.

Connecting to any other bucket, for example crossregion0ap0geo, gives the following message:


Not_found: this error occurred while accessing the connectors service: The assets request failed: 
CDICO2015E: The crossregion0ap0geo container does not exist or you do not have sufficient permissions.

 

File path

The file path is the complete path to the file where you want to store data. The file path includes the bucket name, an optional file path, an optional folder name, and the file name.

The following file paths are examples of valid file paths:

  • /my_bucket/my_folder/my_file.csv
  • /my_bucket/my_new_folder/my_new_file.csv

In the File path field, you can create folders and files within existing buckets. For this example, let’s call our file name “%TIME.parquet”, and we put data into a bucket called "clickstream". We'll talk about the system variable %TIME in the next section.

File path for clickstream bucket

   

System variables

You can add system variables (%TIME, %OBJECTNUM, and %PARTITIONS) to the file path to make it unique and to improve query performance.

Let’s look at each system variable.

%TIME is the time when the COS object is created. The default time format is yyyyMMdd_HHmmss.

The system variable %TIME can be added anywhere in the path after the bucket name. It can even be the file name. The variable is typically used to make dynamic file names when you expect the application to create multiple files.

The following file paths are examples of valid file paths with %TIME.

  • /my_bucket/event_%TIME.parquet
  • /my_bucket/%TIME.parquet
  • /my_bucket/my_new_folder/my_new_file_%TIME.csv
  • /my_bucket/geo/uk/geo_%TIME.parquet
  • /my_bucket/geo/uk/my_new_folder/%TIME_event.parquet

   

%OBJECTNUM is an object number, starting at 0, when a new object is created for writing. Objects with the same name will be overwritten. Typically, %OBJECTNUM is added after the file name.

The following file paths are examples of valid file paths with %OBJECTNUM.

  • /my_bucket/event_%OBJECTNUM.parquet
  • /my_bucket/geo/uk/geo_%OBJECTNUM.parquet
  • /my_bucket/%OBJECTNUM_event.csv
  • /my_bucket/%OBJECTNUM_%TIME.csv

Note: If partitioning is used, %OBJECTNUM is managed globally for all partitions in the COS object, rather than independently for each partition.

 

Partition

A partition is data that is grouped by a common attribute in the incoming schema. When you configure the Cloud Object Storage operator in the streams flow canvas, you create and add partitions to buckets.

Use partitions when you need to reduce the amount of data that queries must process. Streaming gives you access to massive amounts of data. Querying the entire data set might not be possible or even necessary. To improve query performance, break the data into chunks, or partitions, and then query the partition that you need.

For example, you might want to get information about online shopping users who put an item into their cart. Your first partition is “click_event_type” so that you can query on the clickstream “add_to_cart” event. Next, you add the partition “customer_id” because you want to analyze each customer’s online shopping behavior.

Two partitions

Recall that we defined the File path by using an existing bucket and a new file name that is based on time. File path for clickstream bucket

When the streams flow runs, COS automatically places the partitions “click_event_type” and “customer_id” immediately before the last part of the object name, “%TIME.parquet”. As a result, the object in COS is clickstream/click_event_type/customer_id/%TIME.parquet as seen in the following screen capture.

Objects in clickstream bucket

Invalid partition values

If a value in a partition is not valid or is missing, it is replaced by the string __HIVE_DEFAULT_PARTITION__ in the COS OBJECT NAME list in COS.

For example, if the value of "customer_id" is not valid or is missing, "customer_id" might be clickstream/click_event_type=add_to_cart/customer_id=__HIVE_DEFAULT_PARTITION__/20171022_124948.parquet.

   

   

%PARTITIONS places the partitions that you defined anywhere in the object name in COS.

By default, partitions are placed immediately before the last part of the object name and in the order that they were created. You can change the location of these partitions (but not their order) in the COS object name by using the %PARTITIONS system variable.

In our example, the partition click_event_type is first, followed by the partition customer_id. By default, they are both placed before %TIME.parquet.

Example of the default position of partitions in an object name

Suppose that the file path is clickstream/store_location/%TIME.parquet. You added the folder store_location, and you defined the partitions click_event_type and customer_id.

The OBJECT NAME in COS would be something like clickstream/store_location/click_event_type=add_to_cart/customer_id=13579/20171022_124948.parquet.

Let’s see how the partition placement changes by using %PARTITIONS. Suppose that the file path now is clickstream/%PARTITIONS/store_location/%TIME.parquet.

The OBJECT NAME in COS would be clickstream/click_event_type=add_to_cart/customer_id=13579/store_location/20171022_124948.parquet.

   


   

Preferred practices for partition design

Now that we understand the basic components of COS, let’s see how we can design partitions to improve query performance and optimize resource utilization.

General principles for partition design

  1. Think about what kind of queries you need. For example, you might need to build monthly reports or sales by product line.

  2. Do not define a partition on an attribute so that you end up with too many partitions. Reducing the number of partitions can greatly improve performance and resource consumption.

  3. Do not define a partition on an attribute so that you end up with many small-sized files.

The following questions can help you to implement these principles

  1. Which attribute is queried most often?

    Add the attribute that is queried most often as the first partition so that it is immediately after the bucket name. In the previous example, we wanted to report on online users who add items to their shopping cart, so “click_event_type” was the first partition.

  2. How is the data ingested?

    If your data is organized by attributes, then an attribute might be a good partition.

  3. Can the date attribute be split into year, month, and day?

    If so, use a Code operator to create new attributes for year, month, and day. Those new attributes must be in the Code operator’s output schema that is input to COS. You can then partition on one of the new attributes. The new attributes do not exist in the original streaming data, but they do exist in the COS partition.

    For example, in the Code operator, split the “time_stamp” attribute into three new attributes called “YYYY”, “MM”, and “DD”. Instead of partitioning on “time_stamp”, partition on “YYYY”. This partitioning results in fewer, but larger, files.

  4. Does the data have low cardinality, meaning a few possible values?

    Data with low cardinality is suitable for partitioning. In Clickstream sample data, “time_stamp”, “session_duration”, and “product_price” typically have high cardinality, so they are unsuitable for partitioning. “Product_category” has low cardinality, so it is a suitable candidate.

  5. How many partitions are you planning?

    Partitioning by too many attributes results in many small-sized files. As a result, query performance becomes slow and resource consumption is high.

    Typically, a partition size of about 1 GB is appropriate.