Clickstream Example Pipeline
Clickstream - What’s it all about?
Clickstream is the recording of areas of the screen that a user clicks while web browsing. As the user clicks anywhere in the web page, the action is logged. The log contains information such as time, URL, the user’s machine, type of browser,
type of event (for example, browsing, checking out, logging in, logging out with purchase, removing from cart, logging out with purchase), product information (for example, ID, category, and price), total purchase in basket, number of items in
basket, and session duration. This information can give valuable clues about what visitors are doing on your web site, and about the visitors themselves.
Clickstream analysis is useful for web activity analysis and market research. The navigation path can indicate purchase interests and price range. You can identify browsing patterns to determine the probability that the user will place an order.
Business use cases for Clickstream events
Let’s say that your online retail store wants to find out what shoppers are doing in your web site. What pages are they visiting? Do they buy online after visiting those pages, or do they leave without purchasing anything? Do the same shoppers return for more purchases, or do they come once and never return? How many times does a visitor browse a page before making a purchase?
A data scientist can combine this clickstream data with your retail store’s ERP data to identify each shopper’s preferences and price range. The data scientist can also combine the clickstream data with social media data about the shopper to offer targeted offers.
Example Clickstream data
The sample data that is used in the Clickstream streams flow contains formatted data from user actions in a web page. The data includes: customer ID, time stamp, type of click event, name of the product, category of the product, price, total price of all products in the basket, total number of all products in the basket, number of distinct items in the basket, and how long the user was in the site.
Our goal is to store data in an IBM Cloud Object Storage database when the online user has added something to the shopping cart. The data will be used for off-line analysis.
Description of operators
The following screen capture shows how the clickstream example streams flow looks in the canvas:
Let’s look more closely at these three operators.
Sample Data operator
Sample Data is the source of clickstream data for the streams flow. We supply this data. The following screen captures show the clickstream properties and some of its schema attributes:
The following screen capture shows the properties and the schema of the sample data.
The streams flow ingests the sample data. The schema attributes include customer ID, time zone, type of click event, total price of items in the user’s shopping cart, and so on. In this example flow, we’re interested in the click_event_type attribute.
Next, we want to pull out only the data when a user puts something in the shopping cart. We use the Filter operator to select data where the click event type is
add_to_cart. All other tuples are ignored.
Cloud Object Storage (COS)
COS provides cloud storage for massive amounts of unstructured data. If you do not yet have a Cloud Object Storage instance,
you must provision one when you select Clickstream Example in the
The Cloud Object Storage operator is the target of the streams flow.
In our example, the data is stored in a COS bucket called
crossregion0eu0geo. In the bucket is a file object called
add_to_cart. The tuples whose
click_event_type == “add_to_cart” will be stored in in file
The following screen capture shows the COS properties.
Running the Clickstream example pipeline
When you click in the canvas, the Clickstream example pipeline is automatically created for you. The
Flow shows all operators and the flow of data between them in the streams flow. Hover your mouse pointer over a data flow to show its throughput speed and event size.
Ingest Rate shows the number of events that are submitted to the pipeline per second for each streams flow source.
Throughput shows the throughput of input and output flows, if they exist. It also shows events that have errors.
By default, the Sample Data operator in the Flow graph is selected. Click Filter or Cloud Object Storage to see its Throughput.
Note that in the