This notebook demonstrates how to analyze financial customer complaints using Watson NLP.
The data that is used in this notebook is taken from the Consumer Complaint Database that is published by the Consumer Financial Protection Bureau (CFPB), an U.S. government agency. The Consumer Complaint Database is a collection of complaints about consumer financial products and services that the CFPB sent to companies for response. A complaint contains a consumer’s narrative description of their experience if the consumer opts to share this information publicly and after the CFPB has taken steps to remove all personal information. In this notebook, you will focus on complaints that contain narrative descriptions to show how to use Watson NLP.
The data is publicly available at https://www.consumerfinance.gov/data-research/consumer-complaints/.
Watson NLP offers so-called blocks for various NLP tasks. This notebooks shows:
ensemble_classification-workflow_en_tone-stock
). This workflow model classifies the tone of a document as excited, frustrated, sad, polite, impolite, satisfied and sympathetic.ensemble_classification-workflow_en_emotion-stock
). This workflow model classifies the emotion of a document into anger, disgust, fear, joy or sadness.You can step through the notebook execution cell by cell, by selecting Shift-Enter or you can execute the entire notebook by selecting Cell -> Run All from the menu.
Note: If you have other notebooks currently running with the NLP Environment environment, stop their kernels before running this notebook. All these notebooks share the same runtime environment, and if they are running in parallel, you may encounter memory issues. To stop the kernel of another notebook, open that notebook, and select File > Stop Kernel.
Begin by importing and initializing some helper libraries that are used throughout the notebook.
import os
import pandas as pd
# we want to show large text snippets to be able to explore the relevant text
pd.options.display.max_colwidth = 400
import watson_nlp
The data can be downloaded via an API from https://www.consumerfinance.gov/data-research/consumer-complaints/. For this notebook, the complaints for one month will be downloaded and only those that contain the consumer narrative text. The data is exported in CSV format. The URL to retrieve this data is:
url = "https://www.consumerfinance.gov/data-research/consumer-complaints/search/api/v1/?date_received_max=2021-03-30&date_received_min=2021-02-28&field=all&format=csv&has_narrative=true&no_aggs=true&size=18102"
Read the data into a dataframe.
You can find a detailed explanation of the available columns here: https://www.consumerfinance.gov/complaint/data-use/#:~:text=Types%20of%20complaint%20data%20we%20publish .
In the analysis, you will focus on the Product column and the column with the complaint text Consumer complaint narrative.
df_all = pd.read_csv(url)
text_col = 'Consumer complaint narrative'
# In this example, we take only the first 1000 complaints in the dataset for further analysis.
# Set df to df_all to run on the complete dataset.
df_small = df_all.head(1000)
df = df_small
df.head(3)
Date received | Product | Sub-product | Issue | Sub-issue | Consumer complaint narrative | Company public response | Company | State | ZIP code | Tags | Consumer consent provided? | Submitted via | Date sent to company | Company response to consumer | Timely response? | Consumer disputed? | Complaint ID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 03/10/21 | Credit reporting, credit repair services, or other personal consumer reports | Credit reporting | Incorrect information on your report | Account status incorrect | In XX/XX/XXXX I moved with a current XXXX account and service ( so, I thought ) I transferred my account to the new apartment, in fact I went into the store and had no problem not at any time was I told I had a past due balance from my old apartment and all of my mail was forwarded. My account and service set up was seamless.. Skip to late XXXX of XXXX and I received a phone call from Sequium ... | Company believes it acted appropriately as authorized by contract or law | Sequium Asset Solutions, LLC | CO | 80112 | None | Consent provided | Web | 03/10/21 | Closed with explanation | Yes | NaN | 4201725 |
1 | 03/10/21 | Credit reporting, credit repair services, or other personal consumer reports | Credit reporting | Improper use of your report | Credit inquiries on your report that you don't recognize | Upon reviewing my Equifax Credit Report I noticed a hard inquiry for XXXX XXXX XXXX XXXX XXXX which I did not authorize or was aware of. \n\nInquiry Date : XX/XX/2020 Company XXXX XXXX XXXX, | None | EQUIFAX, INC. | CA | 90064 | None | Consent provided | Web | 03/10/21 | Closed with explanation | Yes | NaN | 4201710 |
2 | 03/10/21 | Debt collection | Credit card debt | Written notification about debt | Didn't receive enough information to verify debt | XXXX XXXX XXXX XXXX XXXX XXXX, NY XXXX Social Security # XXXX DOB : XX/XX/1955 XXXX XXXX XXXXXXXX XXXX XXXX XXXX XXXX, XXXX, Texas XXXX XXXX XXXX XXXX XXXX, XXXX XXXXXXXX XXXX XXXX, XXXX, GA XXXX XXXX XXXX XXXX, XXXX XXXX. XXXX XXXX, XXXX, PA XXXX DISCLOSURE : THIS IS NOT AN IDENTITY THEFT DISPUTE, PLEASE REFRAIN FROM TAKING ANY POSITION OF IDENTITY THEFT EITHER WITH ANY CREDIT REPORTING A... | None | PORTFOLIO RECOVERY ASSOCIATES INC | NY | 11550 | None | Consent provided | Web | 03/10/21 | Closed with explanation | Yes | NaN | 4200781 |
You can look at all products that are available in the data set to do further analysis around these product groups.
df['Product'].value_counts().sort_values().plot(kind='barh')
<Axes: >
The tone classification model predicts the most prevalent tones of a document text. Available tones are excited, frustrated, sad, polite, impolite, satisfied and sympathetic. Each tone is assigned a confidence, so we can either use the highest-rated tone or we can assign several tones to a document e.g. by taking all tones whose confidence exceeds a certain threshold.
In customer complaints, you would expect the tone to be sad or frustrated. Let's see if the analysis confirms this assumption.
Start with loading the tone workflow model for English:
tone_model = watson_nlp.load('ensemble_classification-workflow_en_tone-stock')
Create a helper function to run the tone analysis on a single complaint. It will return all tones that have a confidence that is higher than 1/7.
def classify_tone(complaint_text):
# run the tone model
tone_result = tone_model.run(complaint_text)
tone_classes = [c.to_dict() for c in tone_result.classes]
tone_conf = [c['class_name'] for c in tone_classes if c['confidence'] > 0.14]
return tone_conf
Run the tone classification on the dataframe and show the tones with the product and the complaint text. Note: This cell will run for several minutes.
For better progress feedback, the cell is using progress_apply
from the tqdm
library. You can also use apply
directly, i.e. df[text_col].apply(..)
.
from tqdm.notebook import tqdm
tqdm.pandas(colour='green')
# run tone classification and create a dataframe holding the tones
tone = df[text_col].progress_apply(lambda text: classify_tone(text))
tone_df = pd.DataFrame(tone)
tone_df.rename(inplace=True, columns={text_col:'Tones'})
# combine with our complaint dataframe
text_tone_df = df[["Product", text_col]].merge(tone_df, how='left', left_index=True, right_index=True)
text_tone_df.head()
0%| | 0/1000 [00:00<?, ?it/s]
Product | Consumer complaint narrative | Tones | |
---|---|---|---|
0 | Credit reporting, credit repair services, or other personal consumer reports | In XX/XX/XXXX I moved with a current XXXX account and service ( so, I thought ) I transferred my account to the new apartment, in fact I went into the store and had no problem not at any time was I told I had a past due balance from my old apartment and all of my mail was forwarded. My account and service set up was seamless.. Skip to late XXXX of XXXX and I received a phone call from Sequium ... | [sad, polite, frustrated] |
1 | Credit reporting, credit repair services, or other personal consumer reports | Upon reviewing my Equifax Credit Report I noticed a hard inquiry for XXXX XXXX XXXX XXXX XXXX which I did not authorize or was aware of. \n\nInquiry Date : XX/XX/2020 Company XXXX XXXX XXXX, | [polite] |
2 | Debt collection | XXXX XXXX XXXX XXXX XXXX XXXX, NY XXXX Social Security # XXXX DOB : XX/XX/1955 XXXX XXXX XXXXXXXX XXXX XXXX XXXX XXXX, XXXX, Texas XXXX XXXX XXXX XXXX XXXX, XXXX XXXXXXXX XXXX XXXX, XXXX, GA XXXX XXXX XXXX XXXX, XXXX XXXX. XXXX XXXX, XXXX, PA XXXX DISCLOSURE : THIS IS NOT AN IDENTITY THEFT DISPUTE, PLEASE REFRAIN FROM TAKING ANY POSITION OF IDENTITY THEFT EITHER WITH ANY CREDIT REPORTING A... | [sad, polite] |
3 | Debt collection | XXXX XXXX XXXX XXXX XXXX XXXX, NY XXXX Social Security # XXXX DOB : XX/XX/XXXX XXXX XXXX XXXX, P. O. Box XXXX, XXXX, Texas XXXX XXXX XXXX XXXX XXXX, XXXX XXXX. Box XXXX, XXXX, GA XXXX XXXX XXXX XXXX, P. O. Box XXXX, XXXX, PA XXXX DISCLOSURE : THIS IS NOT AN IDENTITY THEFT DISPUTE, PLEASE REFRAIN FROM TAKING ANY POSITION OF IDENTITY THEFT EITHER WITH ANY CREDIT REPORTING AGENCY OR ANY SUBSCRIBE... | [frustrated] |
4 | Debt collection | failed to validate and delete inaccurate information on my credit report after months of me disputing the inaccurate Information | [sad, frustrated] |
Use the explode
function to transform the tones list to separate rows for each tone. That way, you can count the occurrences for each tone in a subsequent step.
exp_tones = text_tone_df.explode('Tones')
# Count tone occurrences and use the relative frequency. unstack() creates a column for each tone.
unstacked = exp_tones.groupby('Product')['Tones'].value_counts(normalize=True).unstack()
# Plot a horizontal bar chart
unstacked.plot.barh(stacked=True).legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
<matplotlib.legend.Legend at 0x7f2b662f1750>
As expected, most complaints are classified as sad or frustrated, but still using a polite tone. There is no strong indicator that some products have a higher frustration rate than others.
The emotion classification model classifies the emotion of a document text. Available emotions are anger, disgust, fear, joy and sadness. As for tones, each emotion is assigned a confidence score. This time you will concentrate on the emotion with the highest confidence score.
You would expect anger and sadness to be the most prevalent emotions in the complaint data set.
Start with loading the emotion workflow model for English:
emotion_model = watson_nlp.load('ensemble_classification-workflow_en_emotion-stock')
Again, use a helper model to run the model on a single complaint. The classes are ordered by the confidence score. So you can use the first emotion as the prevalent emotion with the highest confidence.
def classify_emotion(complaint_text):
# run the emotion model
emotion_result = emotion_model.run(complaint_text)
# get the first emotion as the one with the highest confidence
top_emotion = emotion_result.classes[0].to_dict()['class_name']
return top_emotion
Run the emotion classification on the dataframe and show the highest ranked emotion with the product and the complaint text. Note: This cell will run for several minutes.
For better progress feedback, the cell is using progress_apply
from the tqdm
library. You can also use apply
directly, i.e. df[text_col].apply(..)
.
# run emotion classification and create a dataframe holding the results
emotion = df[text_col].progress_apply(lambda text: classify_emotion(text))
emotion_df = pd.DataFrame(emotion)
emotion_df.rename(inplace=True, columns={text_col:'Emotion'})
# combine with our complaint dataframe
text_emotion_df = df[["Product", text_col]].merge(emotion_df, how='left', left_index=True, right_index=True)
text_emotion_df.head(3)
0%| | 0/1000 [00:00<?, ?it/s]
Product | Consumer complaint narrative | Emotion | |
---|---|---|---|
0 | Credit reporting, credit repair services, or other personal consumer reports | In XX/XX/XXXX I moved with a current XXXX account and service ( so, I thought ) I transferred my account to the new apartment, in fact I went into the store and had no problem not at any time was I told I had a past due balance from my old apartment and all of my mail was forwarded. My account and service set up was seamless.. Skip to late XXXX of XXXX and I received a phone call from Sequium ... | sadness |
1 | Credit reporting, credit repair services, or other personal consumer reports | Upon reviewing my Equifax Credit Report I noticed a hard inquiry for XXXX XXXX XXXX XXXX XXXX which I did not authorize or was aware of. \n\nInquiry Date : XX/XX/2020 Company XXXX XXXX XXXX, | sadness |
2 | Debt collection | XXXX XXXX XXXX XXXX XXXX XXXX, NY XXXX Social Security # XXXX DOB : XX/XX/1955 XXXX XXXX XXXXXXXX XXXX XXXX XXXX XXXX, XXXX, Texas XXXX XXXX XXXX XXXX XXXX, XXXX XXXXXXXX XXXX XXXX, XXXX, GA XXXX XXXX XXXX XXXX, XXXX XXXX. XXXX XXXX, XXXX, PA XXXX DISCLOSURE : THIS IS NOT AN IDENTITY THEFT DISPUTE, PLEASE REFRAIN FROM TAKING ANY POSITION OF IDENTITY THEFT EITHER WITH ANY CREDIT REPORTING A... | sadness |
unstacked = text_emotion_df.groupby('Product')['Emotion'].value_counts(normalize=True).unstack()
unstacked.plot.barh(stacked=True).legend(loc='center left',bbox_to_anchor=(1.0, 0.5))
<matplotlib.legend.Legend at 0x7f2c978b6ce0>
As expected, the most prevalent emotions in the complaints are sadness and anger. In contrast to the tones classification, we picked only the emotion with the highest confidence score and not multiple emotions with a score above a certain threshold. Sadness seems to be the 'stronger' emotion overall, with higher confidences than anger. Companies might have a look at products or complaints showing emotion anger, because the customers that created those complaints might be 'pissed-off' most.
This notebook shows you how to use the Watson NLP library and how quickly and easily you can get started with Watson NLP by running the pretrained models for tone and emotion classification and entity extraction. You learned how easy you can extract custom terms using dictionaries.
Copyright © 2021 IBM. This notebook and its source code are released under the terms of the MIT License.