When dealing with large collections of text representing people's opinions, such as product reviews, survey responses, customer feedback, or social media posts, understanding the key issues within the data can be challenging. Manually reviewing thousands of comments is time-consuming and cost-prohibitive. Existing automated approaches are typically limited to identifying recurring phrases or concepts and gauging overall sentiment. Such methods often fail to provide fine-grained, actionable insights. Key Point Summarization maps the input texts to a set of automatically-extracted short sentences and phrases, termed Key Points, which provide a concise plain-text summary of the data. The prevalence of each key point is quantified as the number of its matching sentences.
In this tutorial, you will gain hands-on experience using Key Point Summarization (KPS) to analyze and derive insights from free-text feedback. The data we will use is a community survey conducted in the city of Austin. In this survey, the citizens of Austin were asked "If there was ONE thing you could share with the Mayor regarding the City of Austin (any comment, suggestion, etc.), what would it be?".
KPS utilizes fine-tuned language models for data analysis. Therefore, an environment based on runtime 24.1 and with a GPU is required to run this tutorial effectively.
The goal of this notebook is to demonstrate how Key Points Summarization can be used for extracting meaningful insights from reviews, surveys and customer feedback
This notebook contains the following parts:
First thing we need to do is run the KPS backend. This service runs in the background and performs the analysis.
from keypoint_matching.BackendRunnerWatsonStudio import BackendRunnerWatsonStudio
runner = BackendRunnerWatsonStudio()
runner.start_backend()
Now we can create a client that connects to the backend and uses it.
from key_points_summarization_api.api.clients.keypoints_client import KpsClientWatsonStudio
client = KpsClientWatsonStudio()
Let's run self_check and make sure all is configured correctly and working. self_check outputs {'status': 'UP'} when all is working or {'status': 'DOWN'} if a problem is detected. We can also see if GPUs are used or not.
client.run_self_check()
Next, we read the Austin survey dataset from the dataset_austin.csv file.
import pandas as pd
def get_comments_texts():
url = "https://raw.githubusercontent.com/IBMDataScience/sample-notebooks/master/Files/Data/dataset_austin.csv"
df = pd.read_csv(url)
texts = [str(text) for text in df['text'].tolist()]
texts = [text for text in texts if len(text)<3000]
return texts
We load the data into a list of strings and limit the number of comments for quick analysis:
texts = get_comments_texts()
print(f'There are {len(texts)} comments in the dataset')
limit_comments = 500
texts = texts[:limit_comments]
print(f'Analysing {len(texts)} comments')
For running an analysis on our own data, we upload the data file (e.g. a CSV file) to the project's assets. We create a project-token (in the project's UI, under: manage -> access control -> access tokens) and set it in the "project_token" variable. We can now read the file:
file_name = '<file_name>'
project_token = '<project_token>' # The project's token.
from ibm_watson_studio_lib import access_project_or_space
wslib = access_project_or_space({"token": project_token})
file_data = wslib.load_data(file_name)
import pandas as pd
df = pd.read_csv(file_data)
print(df.head(10))
We will now analyze the comments using the client.run_full_kps_flow
method. This may take a little while. Run time depends on the input size.
Each dataset is temporarily stored in a domain, to which the analysis is applied.
domain = f'austin_test' # describes the dataset
kps_result = client.run_full_kps_flow(domain, texts)
Results are now available. Let’s print a summary of the analysis. For example, we can print the top 40 key points in the dataset, along with the top 3 matching sentences for each key point. The total number of matches for each key point is also indicated.
kps_result.print_result(n_sentences_per_kp = 3, title = "Austin sample", n_top_kps = 40)
We can also export the results into files, including summary and full analysis csv files, as well as a user-friendly Word Document report.
import os
output_dir = f'./kps_results/{domain}/'
os.makedirs(output_dir, exist_ok = True)
kps_result.export_to_all_outputs(output_dir=output_dir, result_name=domain)
!ls -al {output_dir}
In order to store the results files persistently, we can upload them to the project's assets.
project_token = '<project_token>' # The project's token.
def upload_files_from_directory_to_project_assets(output_dir, project_token):
import os
from ibm_watson_studio_lib import access_project_or_space
wslib = access_project_or_space({"token": project_token})
for file_name in os.listdir(output_dir):
file_path = os.path.join(output_dir, file_name)
# Check if it's a file (and not a subdirectory)
if os.path.isfile(file_path):
# Read file content
with open(file_path, 'rb') as file:
file_content = file.read()
wslib.save_data(file_name, file_content)
print(f"Uploaded {file_name}")
upload_files_from_directory_to_project_assets(output_dir, project_token)
When we're done, we can stop the backend for a clean termination.
runner.stop_backend()
Yoav Katz Roy Bar-Haim Yoav Kantor Lilach Edelstein
Copyright © 2024 IBM. This notebook and its source code are released under the terms of the MIT License.