This notebook demonstrates the use of visual analytics tools that help you measure and understand user journeys within the dialog logic of your Watson Assistant system, discover where user abandonments take place, and reason about possible sources of issues. The visual interface can also help to better understand how users interact with different elements of the dialog flow, and gain more confidence in making changes to it. The source of information is your Watson Assistant skill definitions and conversation logs.
As described in Watson Assistant Continuous Improvement Best Practices, you can use this notebook to measure and understand in detail the behavior of users in areas that are not performing well, e.g. having low Task Completion Rates.
Task Completion Rate - is the percentage of user journeys within key tasks/flows of your virtual assistant that reach a successful resolution. This metric is one of the metrics you can use to measure the Effectiveness of your assistant.
This notebook assumes some familiarity with the Watson Assistant dialog programming model, such as skills (formerly workspaces), and dialog nodes. Some familiarity with Python is recommended. This notebook runs on Python.
Load Assistant Skills and Logs
2.1 Load option one: from a Watson Assistant instance
2.2 Load option two: from JSON files
2.3 Load option three: from IBM Cloud Object Storage (using Watson Studio)
2.4 Load option four: from custom location
Visualizing user journeys and abandonments
4.1 Visualize dialog flow (turn-based)
4.2 Visualize dialog flow (milestone-based)
4.3 Select conversations at point of abandonment
Analyzing abandoned conversations
5.1 Explore conversation transcripts for qualitative analysis
5.2 Identify key words and phrases at point of abandonment
5.2.1 Summarize frequent keywords and phrases
Advanced Topics
7.1 Locating important dialog nodes in your assistant
7.1.1 Searching programatically
7.1.2 Interactive Search and Exploration
7.2 Filtering
7.3 Advanced keyword analysis: Comparing abandoned vs. Successful conversations
# Note, on Watson Studio the pip magic command `%pip` is not supported from within the notebook. Use !pip instead.
!pip install --user conversation_analytics_toolkit
import nltk
nltk.download('words')
nltk.download('punkt')
nltk.download('stopwords')
import conversation_analytics_toolkit
from conversation_analytics_toolkit import wa_assistant_skills
from conversation_analytics_toolkit import transformation
from conversation_analytics_toolkit import filtering2 as filtering
from conversation_analytics_toolkit import analysis
from conversation_analytics_toolkit import visualization
from conversation_analytics_toolkit import selection as vis_selection
from conversation_analytics_toolkit import wa_adaptor
from conversation_analytics_toolkit import transcript
from conversation_analytics_toolkit import flows
from conversation_analytics_toolkit import keyword_analysis
from conversation_analytics_toolkit import sentiment_analysis
import json
import pandas as pd
from pandas.io.json import json_normalize
from IPython.core.display import display, HTML
# set pandas to show more rows and columns
import pandas as pd
#pd.set_option('display.max_rows', 200)
pd.set_option('display.max_colwidth', None)
A project token is used to access data assets from notebooks.
To set a project token:
Open another browser window, go to your Watson Studio Project page, and click the Settings tab. Scroll down to Access tokens and click New token. Give the token a name and the Editor access role.
Click in an empty line in the cell below. Use the menu item with the three vertical dots, and choose Insert project token.
Copy the inserted string and replace the line project = Project(project_id='', project_access_token='')
with the equivalent string inserted by the menu item.
Note The Insert project token menu action sometimes creates a new cell in your notebook. You may need to find that cell, then cut and paste the token information into the cell below.
For more information about project tokens, see Manually add the project token.
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
# from project_lib import Project
# project = Project(project_id='', project_access_token='')
These artifacts can be loaded from multiple sources, such as directly from Watson Assistant log or message APIs, or from other locations such as local/remote file system, Cloud Object Storage, or a Database.
Note: below are a set of options to load workspace(s) and log data from different sources into the notebook. Use only one of these methods, and then skip to section 2.2
This notebook uses the Watson Assistant v1 API to access your skill definition and your logs. Provide your Watson Assistant credentials and the workspace id that you want to fetch data from.
You can access the values you need for this configuration from the Watson Assistant user interface. Go to the Skills page and select View API Details from the menu of a skill tile.
IAMAuthenticator
is your API Key under Service Credentialsservice.set_service_url
is the portion of the Legacy v1 Workspace URL that ends with /instances
. For example, https://api.us-south.assistant.watson.cloud.ibm.com
. This value will be different depending on the location of your service instance. Do not pass in the entire Workspace URL.For Section 2.1.2, the value of workspace_id
can be found on the same View API Details page. The value of the Skill ID can be used for the workspace_id variable. If you are using versioning in Watson Assistant, this ID represents the Development version of your skill definition.
For more information about authentication and finding credentials in the Watson Assistant UI, please see Watson Assistant v1 API in the offering documentation.
import ibm_watson
from ibm_watson import AssistantV1
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator
authenticator = IAMAuthenticator("YOUR API KEY") # Add API key
# service = AssistantV1(version='2019-02-28',authenticator = authenticator)
service = AssistantV1(version='2021-06-14',authenticator = authenticator)
service.set_service_url("SERVICE URL") # Add service URL, for example "https://api.us-south.assistant.watson.cloud.ibm.com"
Fetch the workspace for the workspace id given in workspace_id
variable.
#select a workspace by specific id
#workspace_id = '' # Add workspace ID
# or fetch one via the APIs
workspaces=service.list_workspaces().get_result()
workspace_id = workspaces['workspaces'][0]['workspace_id']
#fetch the workspace
workspace=service.get_workspace(
workspace_id=workspace_id,
export=True
).get_result()
# set query parameters
limit_number_of_records=5000
# example of time range query
query_filter = "response_timestamp>=2019-10-30,response_timestamp<2019-10-31"
#query_filter = None
# Fetch the logs for the workspace
df_logs = wa_adaptor.read_logs(service, workspace_id, limit_number_of_records, query_filter)
Note: This cell should be run to load the sample data this notebook uses as examplar. To load your own logs, please replace the requests.get
argument with the location of your JSON files and rectify the rest of the code accordingly. In this example, we will analyze a banking sample dataset.
import requests
# this example uses Watson Assistant data sample on github
# pull sample workspace from watson developer cloud
response = requests.get("https://raw.githubusercontent.com/watson-developer-cloud/assistant-dialog-flow-analysis/master/data/banking-sample/wa-workspace.json").text
workspace = json.loads(response)
# NOTE: the workspace_id is typically available inside the workspace object.
# If you've used the `export skill` feature in Watson Assistant UI, you can find the skill id
# by clicking the `skill`-->`View API details` and copying the value of skill_id
workspace_id = workspace["workspace_id"]
#workpace_id = ''
# pull logs sample from watson develop cloud
response = requests.get("https://raw.githubusercontent.com/watson-developer-cloud/assistant-dialog-flow-analysis/master/data/banking-sample/wa-logs.json").text
df_logs = pd.DataFrame.from_records(json.loads(response))
print("loaded {} log records".format(str(len(df_logs))))
# @hidden_cell
# The project token is an authorization token that is used by Watson Studio to access project resources.
# For more details on project tokens, refer to https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/token.html
# ======
# from project_lib import Project
# project = Project(project_id='f4240f6e-bc9a-4ca8-a32f-a7f3331062b3', project_access_token='p-4ada4354712b8f898327adb616e2b7a51b82889d')
# workspace_file = "wa-workspace.json"
# log_files = "wa-logs.json"
# workspace = json.loads(project.get_file(workspace_file))
# df_logs = pd.DataFrame.from_records(json.loads(project.get_file(log_files)))
# Depending on your production environment, your logs and workspace files might be stored in different locations.
# such as NoSQL databases, Cloud Object Storage files, etc.
# use custom code here, and make sure you load the workspace as a python dictionary, and the df_logs as a pandas DataFrame.
#workspace =
#df_logs =
Note: If your logs are in a custom format or you wish to extract additional fields, you may need to customize the extraction process. You can learn more about this topic here.
Create the WA_Assistant_Skills class, to organize relevant workspace objects for the analysis. The Extract and Transform phase will use this class to match the workspace_id column in your logs, and fetch additional relevant information from the skill.
Call the add_skill()
function to add relevant workspaces for the analysis. You can add multiple workspaces that correspond to different versions of a skill, or multiple skills of a multi-skill assistant (e.g. if you have your own custom code that routes messages to different dialog skills and want to analyze the collection of skills together)
#if you have more than one skill, you can add multiple skill definitions
skill_id = workspace_id
assistant_skills = wa_assistant_skills.WA_Assistant_Skills()
assistant_skills.add_skill(skill_id, workspace)
#validate the number of workspace_ids
print("workspace_ids in skills: " + pd.DataFrame(assistant_skills.list_skills())["skill_id"].unique())
print("workspace_ids in logs: "+ df_logs.workspace_id.unique())
Call the to_canonical_WA_v2()
function to perform the extract and transform steps
If your assistant is multi-skill, set skill_id_field="workspace_id"
to link information in the logs, with the corresponding workspace object based on the value of the workspace_id
attribute in the logs.
df_logs_canonical = transformation.to_canonical_WA_v2(df_logs, assistant_skills, skill_id_field=None, include_nodes_visited_str_types=True)
#df_logs_canonical = transformation.to_canonical_WA_v2(df_logs, assistant_skills, skill_id_field="workspace_id", include_nodes_visited_str_types=True)
# the rest of the notebook runs on the df_logs_to_analyze object.
df_logs_to_analyze = df_logs_canonical.copy(deep=False)
df_logs_to_analyze.head(2)
The dialog flow visualization is an interactive tool for investigating user journeys, visits and abandonments within the steps of the dialog system.
The visualization aggregates the temporal sequences of steps from Watson Assistant logs. The interaction allows to explore the distribution of visits across steps and where abandonment takes place, and also select the conversations that visit a certain step for further exploration and analysis.
You can use the visualization to understand entire end-to-end journeys with the complete log dataset, or you can use filters to focus your exploration on particular journeys or sub-journeys.
This notebook demonstrates how to construct the visualization at two levels of abstraction
The following mouse interactions are supported:
When you hover or click on a node the following information is displayed:
The examples below demonstrate how to create a turn-based flow analysis: (1) for all conversations in the dataset; and (2) for a subset of conversations that were escalated (denoted by visiting the node Transfer to Live Agent)
Visualizing all conversations on a turn-by-turn basis can help you to discover all existing conversation flows in your assistant
title = "All Conversations"
turn_based_path_flows = analysis.aggregate_flows(df_logs_to_analyze, mode="turn-based", on_column="turn_label", max_depth=400, trim_reroutes=False)
# increase the width of the Jupyter output cell
display(HTML("<style>.container { width:95% !important; }</style>"))
config = {
'commonRootPathName': title, # label for the first root node
'height': 800, # control the visualization height. Default 600
'nodeWidth': 250,
'maxChildrenInNode': 6, # control the number of immediate children to show (and collapse rest into *others* node). Default 5
'linkWidth' : 400, # control the width between pathflow layers. Default 360 'sortByAttribute': 'flowRatio' # control the sorting of the chart. (Options: flowRatio, dropped_offRatio, flows, dropped_off, rerouted)
'sortByAttribute': 'flowRatio',
'title': title,
'mode': "turn-based"
}
jsondata = json.loads(turn_based_path_flows.to_json(orient='records'))
visualization.draw_flowchart(config, jsondata, python_selection_var="selection")
Sometime you might want to explore a subset of conversations that meet a certain criteria, or look at the conversations only from a specific point onwards. The example below shows how to filter conversations that pass through two specific dialog nodes.
For more details about using filters, please refer to section 7.2
# filter the conversations that include escalation
title2="Banking Card Escalated"
filters = filtering.ChainFilter(df_logs_to_analyze).setDescription(title2)
# node with condition on the #Banking-Card_Selection (node_1_1510880732839) and visit the node "Transfer To Live Agent" (node_25_1516679473977)
filters.by_dialog_node_id('node_1_1510880732839')\
.by_dialog_node_id('node_25_1516679473977')
filters.printConversationFilters()
# get a reference to the dataframe. Note: you can get access to intermediate dataframes by calling getDataFrame(index)
filtered_df = filters.getDataFrame()
turn_based_path_flows = analysis.aggregate_flows(filtered_df, mode="turn-based", on_column="turn_label", max_depth=400, trim_reroutes=False)
config = {
'commonRootPathName': title2, 'title': title2,
'height': 800, 'nodeWidth': 250, 'maxChildrenInNode': 6, 'linkWidth' : 400, 'sortByAttribute': 'flowRatio',
'mode': "turn-based"
}
jsondata = json.loads(turn_based_path_flows.to_json(orient='records'))
visualization.draw_flowchart(config, jsondata, python_selection_var="selection")
Use the milestone-based dialog visualization to measure the flow of visits and abandonment between key points of interest (aka "milestones") in your assistant.
The milestone-based visualization requires two extra steps:
Use the mode="milestone-based"
to configure the flow aggregation and visualization steps. The visualization uses a special Other node to model conversations that are flowing to other parts of the dialog which were not defined to be of interest in the milestone definitions.
In this notebook we demonstrate how to produce a milestone dialog flow key points of interest that are part of the Schedule Appointment task of the assistant.
#define the milestones and corresponding node ids for the `Schedule Appointment` task
milestone_analysis = analysis.MilestoneFlowGraph(assistant_skills.get_skill_by_id(skill_id))
milestone_analysis.add_milestones(["Appointment scheduling start", "Schedule time", "Enter zip code", "Branch selection",
"Enter purpose of appointment", "Scheduling completion"])
milestone_analysis.add_node_to_milestone("node_21_1513047983871", "Appointment scheduling start")
milestone_analysis.add_node_to_milestone("handler_28_1513048122602", "Schedule time")
milestone_analysis.add_node_to_milestone("handler_31_1513048234102", "Enter zip code")
milestone_analysis.add_node_to_milestone("node_3_1517200453933", "Branch selection")
milestone_analysis.add_node_to_milestone("node_41_1513049128006", "Enter purpose of appointment")
milestone_analysis.add_node_to_milestone("node_43_1513049260736", "Scheduling completion")
#enrich with milestone information - will add a column called 'milestone'
milestone_analysis.enrich_milestones(df_logs_to_analyze)
#remove all log records without a milestone
df_milestones = df_logs_to_analyze[pd.isna(df_logs_to_analyze["milestone"]) == False]
#optionally, remove consecutive milestones for a more simplified flow visualization representation
df_milestones = analysis.simplify_flow_consecutive_milestones(df_milestones)
# compute the aggregate flows of milestones
computed_flows= analysis.aggregate_flows(df_milestones, mode="milestone-based", on_column="milestone", max_depth=30, trim_reroutes=False)
config = {
'commonRootPathName': 'All Conversations', # label for the first root node
'height': 800, # control the visualization height. Default 600
'maxChildrenInNode': 6, # control the number of immediate children to show (and collapse the rest into *other* node). Default 5
# 'linkWidth' : 400, # control the width between pathflow layers. Default 360 '
'sortByAttribute': 'flowRatio', # control the sorting of the chart. (Options: flowRatio, dropped_offRatio, flows, dropped_off, rerouted)
'title': "Abandoned Conversations in Appointment Schedule Flow",
'showVisitRatio' : 'fromTotal', # default: 'fromTotal'. 'fromPrevious' will compute percentages from previous step,
'mode': 'milestone-based'
}
jsondata = json.loads(computed_flows.to_json(orient='records'))
visualization.draw_flowchart(config, jsondata, python_selection_var="milestone_selection")
Note: The rest of this notebook will demonstrate selection and analysis on selections made in the milestone-based dialog flow (designated by setting the python_selection_var variable to milestone_selection
. To select and analyze conversations from the turn-based dialog flow, set the variable to selection
instead).
Note:
Selecting a node in the visualization will also copy the selection from the visualization into the variable designated by
python_selection_var
, thus making the selection available to other cells of this notebook.
Before you run the next cell, you will interact with the milestone dialog visualization above to select a portion of the dialog to analyze. First, interact with the milestone dialog visualization to observe visit frequencies and abandonments within the milestones of Schedule Appointment
. Click on nodes to drill down and expand the next step in sequence. Navigate along this path: Appointment scheduling start
-->Schedule time
-->Enter zip code
-->Branch selection
to observe a relative high proportion of abandonments that occur in the middle of the flow. Select the Branch selection
node. Note the large volume and ratio of abandoned conversations. Now run the following cell to process conversations that were abandoned at your point of selection.
#the selection variable contains details about the selected node, and conversations that were abandoned at that point
print("Selected Path: ",milestone_selection["path"])
#fetch the dropped off conversations from the selection
dropped_off_conversations = vis_selection.to_dataframe(milestone_selection)["dropped_off"]
print("The selection contains {} records, with a reference back to the converstion logs".format(str(len(dropped_off_conversations))))
dropped_off_conversations.head()
After selecting a large group of abandoned conversations and their corresponding log records, you can apply additional analyses to better understand why these conversations were lost.
Some possible reasons could be:
This section of the notebook demonstrates two visual components that can help you investigate the user utterances in the abandoned conversations:
Try to navigate to the 3rd conversation (conversation_id == 0Aw68rNq6kSGxaDurGG1NSf3c9LtLK3kurWm
) using the toggle buttons, and scroll down to view the user's last utterance before abandoning the conversation (where user utterance is wrong map
). In this conversation, the assistant response in the previous step wasn't satisfactory and when the user communicated that to the assistant, the assistant didn't understand his utterance. This may indicate that some modification to the dialog logic might be needed to better respond in this situation, as well as the service itself might need to be fixed.
You might want to check if this situation occurs in other conversations too. A complementary approach is to try to find frequent terms in user utterances and see how prevalent this is across all abandoned conversations (see next section for details).
Adding sentiment will allow you to observe negative utterances more quickly in the transcripts. You can generate other type of analysis insights, by enriching the insights_tag
column
df_logs_to_analyze = sentiment_analysis.add_sentiment_columns(df_logs_to_analyze)
#create insights, and highlights annotation for the transcript visualization
NEGATIVE_SENTIMENT_THRESHOLD=-0.15
df_logs_to_analyze["insights_tags"] = df_logs_to_analyze.apply(lambda x: ["Negative Sentiment"] if x.sentiment < NEGATIVE_SENTIMENT_THRESHOLD else [], axis=1)
df_logs_to_analyze["highlight"] = df_logs_to_analyze.apply(lambda x: True if x.sentiment < NEGATIVE_SENTIMENT_THRESHOLD else False, axis=1)
# fetch the conversation records
dropped_off_conversations = vis_selection.fetch_logs_by_selection(df_logs_to_analyze, dropped_off_conversations)
# visualize using the transcript visualization
dfc = transcript.to_transcript(dropped_off_conversations)
config = {'debugger': True}
visualization.draw_transcript(config, dfc)
The analysis performs some basic linguistic processing from a group of utterances, such as removal of stop words, or extraction of the base form of words, and then computes their frequencies. Frequencies for words that appear together in sequence (bi-grams, tri-grams) are also computed.
Finally, the visualization displays the most frequent words and phrases.
# gather user utterances from the dropped off conversations - last utterances and all utterances
last_utterances_abandoned=vis_selection.get_last_utterances_from_selection(milestone_selection, df_logs_to_analyze)
all_utterances_abandoned=vis_selection.get_all_utterances_from_selection(milestone_selection, df_logs_to_analyze)
Analyze the last user utterances prior to abandonment to potentially identify common issues at that point.
# analyze the last user input before abandonment
num_unigrams=10
num_bigrams=15
custom_stop_words=["would","pm","ok","yes","no","thank","thanks","hi","i","you"]
data = keyword_analysis.get_frequent_words_bigrams(last_utterances_abandoned, num_unigrams,num_bigrams,custom_stop_words)
config = {'flattened': True, 'width' : 800, 'height' : 500}
visualization.draw_wordpackchart(config, data)
Note: in the visual above, the term wrong map
appears quite often. Other relevant keywords and phrases such as error
, map error
, wrong location
, wrong branches
are also observed.
A conversation can be view and measured as being composed of one or more logical tasks (aka "high level flows"). This section of the notebook demonstrates measuring transactional tasks (aka "flows") by defining their corresponding starting (parent) and successful ending dialog nodes.
You can define and measure a task by providing a mapping to dialog nodes that correspond to the start and successful end of a task.
You can use the programmatic and interactive search options as showed in section 7.1 to locate and copy corresponding dialog node ids, and use them in the flows definition as shown below.
The example below shows how to define the Credit card payment and Schedule appointments tasks. In this example, the starting point was mapped to the node that has a condition on the corresponding intent, and the completion nodes to the nodes that generate the confirmation response.
Note:
Parent and completion nodes can be defined using one or more nodes. Defining multiple parent nodes is useful if a single logical flow is implemented across different branches of the dialog tree. Defining multiple completion nodes can be relevant if you have more than one location in the dialog that can determine successful ending of the flow.
# a flow is defined by a name, one or more "starting/parent_nodes" and one or more "success/completion nodes".
# All the nodes which are descendants to the parent nodes are considered to be part of the flow
# A flow is considered successful if reaches the completion node
flow_defs_initial = {
'flows': [{
'name': 'Credit card payment',
'parent_nodes': ['node_3_1511326054650'], #condition on #Banking-Billing_Payment_Enquiry || #Banking-Billing_Making_Payments
'completion_nodes': ['node_8_1512531332315'] # Display of confirmation "Thank you for your payment..."
},
{
'name': 'Schedule appointment',
'parent_nodes': ['node_21_1513047983871'], #condition on '#Business_Information-Make_Appointment'
'completion_nodes': ['node_43_1513049260736'] #Display Appointment Confirmation
}]
}
#create a list of all the nodes that map to a flow including descendant nodes
flow_defs = flows.enrich_flows_by_workspace(flow_defs_initial, workspace)
Using the task's flow definition and enrichment of the logs, we can now measure the visits in each flow, and the completion percentages.
The Abandoned state refers to conversation that terminated in the middle of the flow. Rerouted refers to conversation that left the scope of the flow and didn't return. Completed refers to conversations that successfully reached the completion point.
#enrich the logs dataframe with additional columns ["flow", "flow_state"] that represent the state of the flow
df_logs_to_analyze = flows.enrich_canonical_by_flows(df_logs_to_analyze, flow_defs)
flow_outcome_summary = flows.count_flows(df_logs_to_analyze, flow_defs)
print(flow_outcome_summary)
flows.plot_flow_outcomes(flow_outcome_summary)
Note: as shown in above figure, the Schedule Appointment task has a relatively large volume of conversations with poor effectiveness (as 65% of conversations are abandoned). As a next step you might want to drill down to understand in more detail where exactly in the dialog logic the conversations were abandoned and why.
In order to measure the performance of specific tasks, or understand user journeys between specific points of the dialog, you will need to be able to reference dialog nodes by their unique node_id
. This section demonstrates how to find the node_id
of nodes of interest in your dialog using two complementary techniques: a programmatic API, and an interactive visual component.
The WA_Assistant_Skills
class provides utility functions for searching dialog nodes in your assistant or in a specific skill.
The re_search_in_dialog_nodes()
supports a case-insensitive, regular expression-based search. You can search for strings that appears in the node's title, condition, or id.
Sample usage of the API:
re_search_in_dialog_nodes(search_term)
- search in all fields, in all skillsre_search_in_dialog_nodes(search_term, keys=['condition'], in_skill=skill_id)
- search only in condition fields of nodes in specific skillExamples of search terms:
"card"
- search for a word"@CC_Types"
- search for an entity"#General_Conversation-VA_State"
- search for an intent'#.*banking.*card'
- search for intent that includes banking and card# example of searching for all occurences of the word 'Card'
search_term='Card'
results = assistant_skills.re_search_in_dialog_nodes(search_term)
results.head(5)
You can use the draw_wa_dialog_chart()
to visualize the dialog nodes of a specific skill in the same tree layout as in Watson Assistant Dialog Editor. You can interact with the visualization to navigate to, or search for, a particular node, from which you can copy its node_id
workspace = assistant_skills.get_skill_by_id(skill_id)
data = {
'workspace': workspace
}
config = {}
visualization.draw_wa_dialog_chart(config, data)
You can use a built-in filter to narrow down your log records and focus on specific conversations or journey steps. There are two types of filters
by_dialog_node_id
, by_turn_label
, by_date_range
, by_dialog_node_str
trim_from_node_id
, trim_from_turn_label
You can create a chain of filters to work in sequence to narrow down the log records for specific exploration activities.
Below is an example of a chained filter that finds conversations that pass through the 'Collect Appointment Data' node during Jan 2020
import datetime
import pytz
filters = filtering.ChainFilter(df_logs_to_analyze).setDescription("Filter: collect Appointment Data during Jan 2020")
filters.by_dialog_node_id('node_22_1513048049461') # corresponding to 'Collect Appointment Data' node.
# You can use the search utilities described earlier in the notebook to find this node
# You can also use cf.by_turn_label('Collect Appointment Data') to filter on information in the turn label
start_date = datetime.datetime(2020, 1, 1, 0, 0, 0, 0, pytz.UTC)
end_date = datetime.datetime(2020, 1, 31, 0, 0, 0, 0, pytz.UTC)
filters.by_date_range(start_date,end_date)
filters.printConversationFilters()
# get a reference to the dataframe. Note: you can get access to itermediate dataframes by calling getDataFrame(index)
filtered_df = filters.getDataFrame()
print("number of unique conversations in filtered dataframe: {}".format(len(filtered_df["conversation_id"].unique())))
Sometimes looking at the last utterances of the abandoned conversations is not enough to find the root cause of a problem. A more advanced approach is to look also at the conversations that successfully completed the flow, and compare which keywords and phrases, are statistically associated more with the abandoned group, not only for the last utterance before the drop off point, but in general at all the utterances of the conversation.
#get the logs of conversations that continue to successful completion
scheduling_completed_filter = filtering.ChainFilter(df_logs_to_analyze).setDescription("Appointement Scheduling flow - Completed")
scheduling_completed_filter.by_dialog_node_id('node_21_1513047983871') # started the Appointment Scheduling flow
scheduling_completed_filter.by_dialog_node_id('node_3_1517200453933') # passed through the "Branch selection" node
scheduling_completed_filter.by_dialog_node_id('node_43_1513049260736') # reached the completion node of Scheduling Appointment flow
scheduling_completed_filter.printConversationFilters()
#get the user utterances
scheduling_completed_df = scheduling_completed_filter.getDataFrame()
all_utterances_completed=scheduling_completed_df[scheduling_completed_df.request_text!=""].request_text.tolist()
print("Gathered {} utterances from {} successful journeys".format(str(len(all_utterances_completed)),
str(len(scheduling_completed_df["conversation_id"].unique()))))
Which keywords/phrases are statistically more associated with the all utterances in abandoned conversations than with completed ones
num_keywords=25
custom_stop_words=["would","pm","ok","yes","no","thank","thanks","hi","i","you"]
data = keyword_analysis.get_data_for_comparison_visual(all_utterances_abandoned, all_utterances_completed, num_keywords,custom_stop_words)
config = {'debugger': True, 'flattened': True, 'width' : 800, 'height' : 600}
visualization.draw_wordpackchart(config, data)
Note: as shown above when doing an outcome-driven analysis only terms that are statistically associated with the dropped off conversations are highlighted, for example next
, day
, and day tomorrow
The analysis presented in this notebook can help you measure the effectiveness of specific tasks/flows within the dialog flows of your assistant skills. The visual components can be used to find large groups of conversations with potentially common issues to improve. The flow analysis can help you discover existing journeys, and focus on specific journey points where many conversations are lost. The transcript and visual keywords/phrases analysis helps you explore those conversations to a greater depth and detect potential issues.
We suggest the following possible next steps:
For more information, please check Watson Assistant Continuous Improvement Best Practices.
#to export all user utterances in the dropoff point of flow visualization selection
project.save_data("abandoned-user-utterances.csv",
dropped_off_conversations[dropped_off_conversations["request_text"] != ""].to_csv(columns=["request_text"],
index=False, header=False))
#to export all user utterances in the dropoff point of flow visualization selection
project.save_data("abandoned-conversation-ids.csv",
dropped_off_conversations.to_csv(columns=["conversation_id"], index=False,header=False))
#to export all columns of the canonical model for abandoned conversations
project.save_data("abandoned-logs.csv", dropped_off_conversations.to_csv(index=False))
#to export specific conversation, e.g. 00KjvlWcGozRTcSYTrlGqj4JYtYH5gjbvw3j
conversation_id_to_export = '00KjvlWcGozRTcSYTrlGqj4JYtYH5gjbvw3j'
project.save_data(conversation_id_to_export + ".csv",
df_logs_to_analyze[df_logs_to_analyze["conversation_id"] == conversation_id_to_export].to_csv(index=False))
#to export user utterances for intent training with Watson Recommends
from conversation_analytics_toolkit import export
sentences = dropped_off_conversations[dropped_off_conversations["request_text"] != ""].reset_index()
sentences = sentences[["request_text"]]
sentences.columns = ['example']
filtered_sentences = export.filter_sentences(sentences, min_complexity = 3, max_complexity = 20)
df_sentences = pd.DataFrame(data={"training_examples": filtered_sentences})
project.save_data("utterances-for-Watson-Intent-Recommendations.csv",
df_sentences.to_csv(sep=',',index=False, header=False))
Avi Yaeli is a Research Staff Member at IBM Research AI organization, where he develops algorithms and tools for discovering insights in complex data. His research interests include data mining, information visualization, natural language processing, and cloud computing. Avi has published more than 30 papers and patents.
Sergey Zeltyn, Ph.D. in Statistics, is a Data Scientist in IBM Research - Haifa. His research agenda includes text analytics, machine learning, forecasting and operations research. Sergey has broad experience in data analysis, model development and implementation. His research has been published at top operations research and statistical journals.
The authors would like to thank the following members of IBM Watson, Research and Services for their contributions and feedback of the underlying technology and notebook: Eric Wayne, Zhe Zhang, Kyle Croutwater, David Boaz, Kalyan Dutia, Erika Agostinelli, and Tony Hickman.
Copyright © 2020 IBM. This notebook and its source code are released under the terms of the MIT License.