from IPython.display import Markdown, display, HTML
import warnings
warnings.filterwarnings('ignore')
Dialog Skill Analysis for Classic Watson Assistant (WA) is intended for use by chatbot designers, developers & data scientists who would like to experiment with and improve on their existing dialog skill design with the classic experience.
We assume familiarity with the Watson Assistant product as well as concepts involved in dialog skill design like intent, entities, utterances etc.
python 3.9 or greater is required. For dependency requirements, please refer to requirements.txt
https://github.com/watson-developer-cloud/assistant-dialog-skill-analysis
!pip install --index-url https://pypi.python.org/simple -U "pip"
!git clone https://github.com/watson-developer-cloud/assistant-skill-analysis.git
!pip install ./assistant-skill-analysis
Looking in indexes: https://pypi.python.org/simple Requirement already satisfied: pip in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (23.3) Collecting pip Downloading pip-23.3.2-py3-none-any.whl.metadata (3.5 kB) Downloading pip-23.3.2-py3-none-any.whl (2.1 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 80.9 MB/s eta 0:00:00 Installing collected packages: pip Attempting uninstall: pip Found existing installation: pip 23.3 Uninstalling pip-23.3: Successfully uninstalled pip-23.3 Successfully installed pip-23.3.2 Cloning into 'assistant-skill-analysis'... remote: Enumerating objects: 775, done. remote: Counting objects: 100% (302/302), done. remote: Compressing objects: 100% (139/139), done. remote: Total 775 (delta 197), reused 220 (delta 153), pack-reused 473 Receiving objects: 100% (775/775), 255.83 KiB | 2.24 MiB/s, done. Resolving deltas: 100% (397/397), done. Processing ./assistant-skill-analysis Preparing metadata (setup.py) ... done Collecting scikit-learn~=1.2.2 (from assistant-skill-analysis==2.0.1) Downloading scikit_learn-1.2.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.6 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.6/9.6 MB 112.3 MB/s eta 0:00:0000:0100:01 Collecting pandas~=1.4.3 (from assistant-skill-analysis==2.0.1) Downloading pandas-1.4.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.6 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11.6/11.6 MB 110.4 MB/s eta 0:00:00a 0:00:01 Requirement already satisfied: tabulate in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from assistant-skill-analysis==2.0.1) (0.8.10) Requirement already satisfied: matplotlib in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from assistant-skill-analysis==2.0.1) (3.7.1) Collecting nltk (from assistant-skill-analysis==2.0.1) Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.5/1.5 MB 98.2 MB/s eta 0:00:00 Requirement already satisfied: seaborn in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from assistant-skill-analysis==2.0.1) (0.12.2) Collecting ibm-watson>=4.5.0 (from assistant-skill-analysis==2.0.1) Downloading ibm-watson-7.0.1.tar.gz (389 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 389.3/389.3 kB 62.7 MB/s eta 0:00:00 Installing build dependencies ... done Getting requirements to build wheel ... done Preparing metadata (pyproject.toml) ... done Requirement already satisfied: scipy>=1.2.0 in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from assistant-skill-analysis==2.0.1) (1.10.1) Requirement already satisfied: jupyter in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from assistant-skill-analysis==2.0.1) (1) Collecting spacy~=2.3.2 (from assistant-skill-analysis==2.0.1) Downloading spacy-2.3.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.9 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.9/4.9 MB 106.8 MB/s eta 0:00:0000:01 Requirement already satisfied: ibm-cos-sdk>=2.11.0 in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from assistant-skill-analysis==2.0.1) (2.12.0) Collecting nbconvert>=7.7.1 (from assistant-skill-analysis==2.0.1) Downloading nbconvert-7.14.2-py3-none-any.whl.metadata (7.7 kB) Collecting numpy~=1.26.0 (from assistant-skill-analysis==2.0.1) Downloading numpy-1.26.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.2/61.2 kB 14.7 MB/s eta 0:00:00 Requirement already satisfied: ibm-cos-sdk-core==2.12.0 in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from ibm-cos-sdk>=2.11.0->assistant-skill-analysis==2.0.1) (2.12.0) Requirement already satisfied: ibm-cos-sdk-s3transfer==2.12.0 in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from ibm-cos-sdk>=2.11.0->assistant-skill-analysis==2.0.1) (2.12.0) Requirement already satisfied: jmespath<1.0.0,>=0.10.0 in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from ibm-cos-sdk>=2.11.0->assistant-skill-analysis==2.0.1) (0.10.0) Requirement already satisfied: python-dateutil<3.0.0,>=2.8.2 in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from ibm-cos-sdk-core==2.12.0->ibm-cos-sdk>=2.11.0->assistant-skill-analysis==2.0.1) (2.8.2) Requirement already satisfied: requests<3.0,>=2.27.1 in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from ibm-cos-sdk-core==2.12.0->ibm-cos-sdk>=2.11.0->assistant-skill-analysis==2.0.1) (2.31.0) Requirement already satisfied: urllib3<1.27,>=1.26.9 in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from ibm-cos-sdk-core==2.12.0->ibm-cos-sdk>=2.11.0->assistant-skill-analysis==2.0.1) (1.26.18) Collecting websocket-client>=1.1.0 (from ibm-watson>=4.5.0->assistant-skill-analysis==2.0.1) Downloading websocket_client-1.7.0-py3-none-any.whl.metadata (7.9 kB) Requirement already satisfied: ibm-cloud-sdk-core==3.*,>=3.3.6 in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from ibm-watson>=4.5.0->assistant-skill-analysis==2.0.1) (3.16.5) Requirement already satisfied: PyJWT<3.0.0,>=2.4.0 in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from ibm-cloud-sdk-core==3.*,>=3.3.6->ibm-watson>=4.5.0->assistant-skill-analysis==2.0.1) (2.4.0) Requirement already satisfied: beautifulsoup4 in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from nbconvert>=7.7.1->assistant-skill-analysis==2.0.1) (4.12.0) Collecting bleach!=5.0.0 (from nbconvert>=7.7.1->assistant-skill-analysis==2.0.1) Downloading bleach-6.1.0-py3-none-any.whl.metadata (30 kB) Collecting defusedxml (from nbconvert>=7.7.1->assistant-skill-analysis==2.0.1) Downloading defusedxml-0.7.1-py2.py3-none-any.whl (25 kB) Requirement already satisfied: jinja2>=3.0 in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from nbconvert>=7.7.1->assistant-skill-analysis==2.0.1) (3.1.2) Requirement already satisfied: jupyter-core>=4.7 in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from nbconvert>=7.7.1->assistant-skill-analysis==2.0.1) (5.3.0) Collecting jupyterlab-pygments (from nbconvert>=7.7.1->assistant-skill-analysis==2.0.1) Downloading jupyterlab_pygments-0.3.0-py3-none-any.whl.metadata (4.4 kB) Requirement already satisfied: markupsafe>=2.0 in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from nbconvert>=7.7.1->assistant-skill-analysis==2.0.1) (2.1.1) Collecting mistune<4,>=2.0.3 (from nbconvert>=7.7.1->assistant-skill-analysis==2.0.1) Downloading mistune-3.0.2-py3-none-any.whl.metadata (1.7 kB) Collecting nbclient>=0.5.0 (from nbconvert>=7.7.1->assistant-skill-analysis==2.0.1) Downloading nbclient-0.9.0-py3-none-any.whl.metadata (7.8 kB) Requirement already satisfied: nbformat>=5.7 in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from nbconvert>=7.7.1->assistant-skill-analysis==2.0.1) (5.7.0) Requirement already satisfied: packaging in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from nbconvert>=7.7.1->assistant-skill-analysis==2.0.1) (23.0) Collecting pandocfilters>=1.4.1 (from nbconvert>=7.7.1->assistant-skill-analysis==2.0.1) Downloading pandocfilters-1.5.0-py2.py3-none-any.whl (8.7 kB) Requirement already satisfied: pygments>=2.4.1 in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from nbconvert>=7.7.1->assistant-skill-analysis==2.0.1) (2.15.1) Collecting tinycss2 (from nbconvert>=7.7.1->assistant-skill-analysis==2.0.1) Downloading tinycss2-1.2.1-py3-none-any.whl (21 kB) Requirement already satisfied: traitlets>=5.1 in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from nbconvert>=7.7.1->assistant-skill-analysis==2.0.1) (5.7.1) Requirement already satisfied: pytz>=2020.1 in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from pandas~=1.4.3->assistant-skill-analysis==2.0.1) (2022.7) Requirement already satisfied: joblib>=1.1.1 in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from scikit-learn~=1.2.2->assistant-skill-analysis==2.0.1) (1.1.1) Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from scikit-learn~=1.2.2->assistant-skill-analysis==2.0.1) (2.2.0) Collecting murmurhash<1.1.0,>=0.28.0 (from spacy~=2.3.2->assistant-skill-analysis==2.0.1) Downloading murmurhash-1.0.10-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.0 kB) Collecting cymem<2.1.0,>=2.0.2 (from spacy~=2.3.2->assistant-skill-analysis==2.0.1) Downloading cymem-2.0.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.4 kB) Collecting preshed<3.1.0,>=3.0.2 (from spacy~=2.3.2->assistant-skill-analysis==2.0.1) Downloading preshed-3.0.9-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.2 kB) Collecting thinc<7.5.0,>=7.4.1 (from spacy~=2.3.2->assistant-skill-analysis==2.0.1) Downloading thinc-7.4.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 MB 80.1 MB/s eta 0:00:00 Collecting blis<0.8.0,>=0.4.0 (from spacy~=2.3.2->assistant-skill-analysis==2.0.1) Downloading blis-0.7.11-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.4 kB) Collecting wasabi<1.1.0,>=0.4.0 (from spacy~=2.3.2->assistant-skill-analysis==2.0.1) Downloading wasabi-0.10.1-py3-none-any.whl (26 kB) Collecting srsly<1.1.0,>=1.0.2 (from spacy~=2.3.2->assistant-skill-analysis==2.0.1) Downloading srsly-1.0.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB) Collecting catalogue<1.1.0,>=0.0.7 (from spacy~=2.3.2->assistant-skill-analysis==2.0.1) Downloading catalogue-1.0.2-py2.py3-none-any.whl (16 kB) Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from spacy~=2.3.2->assistant-skill-analysis==2.0.1) (4.65.0) Requirement already satisfied: setuptools in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from spacy~=2.3.2->assistant-skill-analysis==2.0.1) (65.6.3) Collecting plac<1.2.0,>=0.9.6 (from spacy~=2.3.2->assistant-skill-analysis==2.0.1) Downloading plac-1.1.3-py2.py3-none-any.whl (20 kB) Requirement already satisfied: contourpy>=1.0.1 in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from matplotlib->assistant-skill-analysis==2.0.1) (1.0.5) Requirement already satisfied: cycler>=0.10 in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from matplotlib->assistant-skill-analysis==2.0.1) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from matplotlib->assistant-skill-analysis==2.0.1) (4.25.0) Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from matplotlib->assistant-skill-analysis==2.0.1) (1.4.4) Requirement already satisfied: pillow>=6.2.0 in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from matplotlib->assistant-skill-analysis==2.0.1) (10.0.1) Requirement already satisfied: pyparsing>=2.3.1 in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from matplotlib->assistant-skill-analysis==2.0.1) (3.0.9) Requirement already satisfied: click in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from nltk->assistant-skill-analysis==2.0.1) (8.0.4) Collecting regex>=2021.8.3 (from nltk->assistant-skill-analysis==2.0.1) Downloading regex-2023.12.25-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 40.9/40.9 kB 9.9 MB/s eta 0:00:00 Requirement already satisfied: six>=1.9.0 in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from bleach!=5.0.0->nbconvert>=7.7.1->assistant-skill-analysis==2.0.1) (1.16.0) Collecting webencodings (from bleach!=5.0.0->nbconvert>=7.7.1->assistant-skill-analysis==2.0.1) Downloading webencodings-0.5.1-py2.py3-none-any.whl (11 kB) Requirement already satisfied: platformdirs>=2.5 in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from jupyter-core>=4.7->nbconvert>=7.7.1->assistant-skill-analysis==2.0.1) (2.5.2) Requirement already satisfied: jupyter-client>=6.1.12 in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from nbclient>=0.5.0->nbconvert>=7.7.1->assistant-skill-analysis==2.0.1) (8.1.0) Requirement already satisfied: fastjsonschema in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from nbformat>=5.7->nbconvert>=7.7.1->assistant-skill-analysis==2.0.1) (2.16.2) Requirement already satisfied: jsonschema>=2.6 in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from nbformat>=5.7->nbconvert>=7.7.1->assistant-skill-analysis==2.0.1) (4.17.3) Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from requests<3.0,>=2.27.1->ibm-cos-sdk-core==2.12.0->ibm-cos-sdk>=2.11.0->assistant-skill-analysis==2.0.1) (2.0.4) Requirement already satisfied: idna<4,>=2.5 in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from requests<3.0,>=2.27.1->ibm-cos-sdk-core==2.12.0->ibm-cos-sdk>=2.11.0->assistant-skill-analysis==2.0.1) (3.4) Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from requests<3.0,>=2.27.1->ibm-cos-sdk-core==2.12.0->ibm-cos-sdk>=2.11.0->assistant-skill-analysis==2.0.1) (2023.11.17) Requirement already satisfied: soupsieve>1.2 in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from beautifulsoup4->nbconvert>=7.7.1->assistant-skill-analysis==2.0.1) (2.4) Requirement already satisfied: attrs>=17.4.0 in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from jsonschema>=2.6->nbformat>=5.7->nbconvert>=7.7.1->assistant-skill-analysis==2.0.1) (23.1.0) Requirement already satisfied: pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0 in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from jsonschema>=2.6->nbformat>=5.7->nbconvert>=7.7.1->assistant-skill-analysis==2.0.1) (0.18.0) Requirement already satisfied: pyzmq>=23.0 in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from jupyter-client>=6.1.12->nbclient>=0.5.0->nbconvert>=7.7.1->assistant-skill-analysis==2.0.1) (23.2.0) Requirement already satisfied: tornado>=6.2 in /opt/conda/envs/Python-RT23.1/lib/python3.10/site-packages (from jupyter-client>=6.1.12->nbclient>=0.5.0->nbconvert>=7.7.1->assistant-skill-analysis==2.0.1) (6.3.3) Downloading nbconvert-7.14.2-py3-none-any.whl (256 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 256.4/256.4 kB 43.7 MB/s eta 0:00:00 Downloading numpy-1.26.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.2/18.2 MB 97.5 MB/s eta 0:00:00:00:0100:01 Downloading bleach-6.1.0-py3-none-any.whl (162 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 162.8/162.8 kB 34.2 MB/s eta 0:00:00 Downloading blis-0.7.11-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.2 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.2/10.2 MB 110.6 MB/s eta 0:00:0000:010:01 Downloading cymem-2.0.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (46 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 46.1/46.1 kB 11.5 MB/s eta 0:00:00 Downloading mistune-3.0.2-py3-none-any.whl (47 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 48.0/48.0 kB 11.9 MB/s eta 0:00:00 Downloading murmurhash-1.0.10-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29 kB) Downloading nbclient-0.9.0-py3-none-any.whl (24 kB) Downloading preshed-3.0.9-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (156 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 156.9/156.9 kB 33.4 MB/s eta 0:00:00 Downloading regex-2023.12.25-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (773 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 774.0/774.0 kB 77.4 MB/s eta 0:00:00 Downloading srsly-1.0.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (369 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 369.2/369.2 kB 62.1 MB/s eta 0:00:00 Downloading websocket_client-1.7.0-py3-none-any.whl (58 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 58.5/58.5 kB 10.0 MB/s eta 0:00:00 Downloading jupyterlab_pygments-0.3.0-py3-none-any.whl (15 kB) Building wheels for collected packages: assistant-skill-analysis, ibm-watson Building wheel for assistant-skill-analysis (setup.py) ... done Created wheel for assistant-skill-analysis: filename=assistant_skill_analysis-2.0.1-py3-none-any.whl size=56186 sha256=101e976bbb426bbaad786bfd3bf745dbe96a8dfaf9dd67e6363a164054d506eb Stored in directory: /tmp/wsuser/.cache/pip/wheels/9c/40/17/84bf40608976beb419a821728198dd0c995d4d242cec338eb0 Building wheel for ibm-watson (pyproject.toml) ... done Created wheel for ibm-watson: filename=ibm_watson-7.0.1-py3-none-any.whl size=389784 sha256=8f356f830b701b63ac3d620ec7e2bdbc430ff8031d916361e869742515b9dad0 Stored in directory: /tmp/wsuser/.cache/pip/wheels/34/df/f4/f8edc5ba0637dd4bfb2029741ae20402976a49d1b6bc113553 Successfully built assistant-skill-analysis ibm-watson Installing collected packages: webencodings, wasabi, plac, cymem, websocket-client, tinycss2, srsly, regex, pandocfilters, numpy, murmurhash, mistune, jupyterlab-pygments, defusedxml, catalogue, bleach, preshed, pandas, nltk, blis, thinc, scikit-learn, nbclient, ibm-watson, spacy, nbconvert, assistant-skill-analysis Attempting uninstall: numpy Found existing installation: numpy 1.23.5 Uninstalling numpy-1.23.5: Successfully uninstalled numpy-1.23.5 Attempting uninstall: pandas Found existing installation: pandas 1.5.3 Uninstalling pandas-1.5.3: Successfully uninstalled pandas-1.5.3 Attempting uninstall: scikit-learn Found existing installation: scikit-learn 1.1.1 Uninstalling scikit-learn-1.1.1: Successfully uninstalled scikit-learn-1.1.1 ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. autoai-libs 1.15.2 requires numpy<1.24,>=1.20.3; python_version >= "3.9", but you have numpy 1.26.3 which is incompatible. autoai-libs 1.15.2 requires scikit-learn<1.2,>=1.0.2; python_version >= "3.9", but you have scikit-learn 1.2.2 which is incompatible. autoai-ts-libs 3.0.17 requires numpy<1.24,>=1.19.2; python_version >= "3.9", but you have numpy 1.26.3 which is incompatible. autoai-ts-libs 3.0.17 requires scikit-learn<=1.1.1,>=1.0.2; python_version >= "3.9", but you have scikit-learn 1.2.2 which is incompatible. lale 0.7.10 requires numpy<1.24, but you have numpy 1.26.3 which is incompatible. lale 0.7.10 requires scikit-learn<=1.2.0,>=1.0.0, but you have scikit-learn 1.2.2 which is incompatible. numba 0.57.0 requires numpy<1.25,>=1.21, but you have numpy 1.26.3 which is incompatible. tensorflow 2.12.0 requires numpy<1.24,>=1.22, but you have numpy 1.26.3 which is incompatible. Successfully installed assistant-skill-analysis-2.0.1 bleach-6.1.0 blis-0.7.11 catalogue-1.0.2 cymem-2.0.8 defusedxml-0.7.1 ibm-watson-7.0.1 jupyterlab-pygments-0.3.0 mistune-3.0.2 murmurhash-1.0.10 nbclient-0.9.0 nbconvert-7.14.2 nltk-3.8.1 numpy-1.26.3 pandas-1.4.4 pandocfilters-1.5.0 plac-1.1.3 preshed-3.0.9 regex-2023.12.25 scikit-learn-1.2.2 spacy-2.3.9 srsly-1.0.7 thinc-7.4.6 tinycss2-1.2.1 wasabi-0.10.1 webencodings-0.5.1 websocket-client-1.7.0
# Standard python libraries
import sys, os
import json
import importlib
from collections import Counter
# External python libraries
import pandas as pd
import numpy as np
import nltk
nltk.download('stopwords')
nltk.download('punkt')
import ibm_watson
# Internal python libraries
from assistant_skill_analysis.utils import skills_util, lang_utils
from assistant_skill_analysis.highlighting import highlighter
from assistant_skill_analysis.data_analysis import summary_generator
from assistant_skill_analysis.data_analysis import divergence_analyzer
from assistant_skill_analysis.data_analysis import similarity_analyzer
from assistant_skill_analysis.term_analysis import chi2_analyzer
from assistant_skill_analysis.term_analysis import keyword_analyzer
from assistant_skill_analysis.term_analysis import entity_analyzer
from assistant_skill_analysis.confidence_analysis import confidence_analyzer
from assistant_skill_analysis.inferencing import inferencer
from assistant_skill_analysis.experimentation import data_manipulator
[nltk_data] Downloading package stopwords to /home/wsuser/nltk_data... [nltk_data] Package stopwords is already up-to-date! [nltk_data] Downloading package punkt to /home/wsuser/nltk_data... [nltk_data] Package punkt is already up-to-date!
Please provide access credentials for an existing dialog skill that you would like to analyze.
Have your API Key & Workspace ID values handy
importlib.reload(skills_util)
# Change Assistant API version if needed
# Find Latest --> https://cloud.ibm.com/docs/services/assistant?topic=assistant-release-notes
API_VERSION = '2020-04-01'
# choose a datacenter to use
datacenters = {
'dallas': ('https://api.us-south.assistant.watson.cloud.ibm.com', 'https://iam.cloud.ibm.com/identity/token'),
'washington': ('https://api.us-east.assistant.watson.cloud.ibm.com', 'https://iam.cloud.ibm.com/identity/token'),
'frankfurt' : ('https://api.eu-de.assistant.watson.cloud.ibm.com', 'https://iam.cloud.ibm.com/identity/token'),
'sydney' : ('https://api.au-syd.assistant.watson.cloud.ibm.com', 'https://iam.cloud.ibm.com/identity/token'),
'tokyo' : ('https://api.jp-tok.assistant.watson.cloud.ibm.com', 'https://iam.cloud.ibm.com/identity/token'),
'london' : ('https://api.eu-gb.assistant.watson.cloud.ibm.com', 'https://iam.cloud.ibm.com/identity/token'),
}
URL, authenticator_url = datacenters['dallas']
# For ICP(IBM Cloud Private), you can disable SSL verification by changing this to True
DISABLE_SSL_VERTIFICATION = False
# By default we only need the IAM API Key & the Workspace ID
# If you run the notebook regularly you can uncomment the two lines below
# & comment out the line after it
#iam_apikey = '###'
#skill_id = '###'
iam_apikey, skill_id, _ = skills_util.input_credentials()
conversation = skills_util.retrieve_conversation(iam_apikey=iam_apikey,
url=URL,
api_version=API_VERSION,
authenticator_url=authenticator_url)
#If you do not have IAM based API Keys
#but have access to a Username, Password & Workspace ID
#You can comment out the two lines above & uncomment the lines below to authenticate
# username = 'apikey'
# password = '###'
# skill_id = '###'
# conversation = skills_util.retrieve_conversation(username=username,
# password=password,
# url=URL,
# api_version=API_VERSION)
conversation.set_disable_ssl_verification(DISABLE_SSL_VERTIFICATION)
workspace = skills_util.retrieve_workspace(skill_id=skill_id,
conversation=conversation)
Please enter apikey: ········ Please enter skill-id (workspace_id): ········
Pick the language code correspond to your workspace data:
Supported Language codes: en, fr, de, es, cs, it, pt, nl
LANGUAGE_CODE="en" # change the language code to work with other languages
lang_util = lang_utils.LanguageUtility(LANGUAGE_CODE)
# Extract user workspace
workspace_pd, workspace_vocabulary, entities, _ = skills_util.extract_workspace_data(workspace, language_util=lang_util)
entities_list = [item['entity'] for item in entities]
display(Markdown("### Sample of Utterances & Intents"))
display(HTML(workspace_pd.sample(n = len(workspace_pd) if len(workspace_pd)<10 else 10)
.to_html(index=False)))
if entities_list:
display(Markdown("### Sample of Entities"))
display(HTML(pd.DataFrame({"Entity":entities_list})
.sample(n = len(entities_list) if len(entities_list)<10 else 10)
.to_html(index=False)))
utterance | intent | tokens |
---|---|---|
i would like to speak to someone | General_Connect_to_Agent | [i, would, like, to, speak, to, someon] |
send me to an agent | General_Connect_to_Agent | [send, me, to, an, agent] |
are stores open on sunday | Customer_Care_Store_Hours | [are, store, open, on, sunday] |
i d like to go to a store | Customer_Care_Store_Location | [i, d, like, to, go, to, a, store] |
how are you today | General_Greetings | [how, are, you, today] |
what time does the central manchester store shut on a saturday | Customer_Care_Store_Hours | [what, time, doe, the, central, manchest, store, shut, on, a, saturday] |
want to change my visit | Customer_Care_Appointments | [want, to, chang, my, visit] |
see ya | Goodbye | [see, ya] |
what is your location | Customer_Care_Store_Location | [what, is, your, locat] |
can i connect to an agent | General_Connect_to_Agent | [can, i, connect, to, an, agent] |
Entity |
---|
phone |
reply |
sys-number |
sys-date |
zip_code |
sys-time |
holiday |
specialist |
landmark |
We generate summary statistics related to the given skill & workspace
summary_generator.generate_summary_statistics(workspace_pd, entities_list)
Data Characteristic | Value | |
---|---|---|
1 | Total User Examples | 199 |
2 | Unique Intents | 9 |
3 | Average User Examples per Intent | 22 |
4 | Standard Deviation from Average | 16 |
5 | Total Number of Entities | 9 |
We analyze whether the dataset contains class imbalance by checking whether the largest intent contains less than double the number of user examples contained in the smallest intent. Presence of imbalance does not necessarily indicate an issue, please review the actions section below
class_imb_flag = summary_generator.class_imbalance_analysis(workspace_pd)
We display the distribution of intents vs the number of examples per intent (sorted by the number of examples per intent) below. Ideally we should not have large variations in terms of number of user examples for various intents.
summary_generator.scatter_plot_intent_dist(workspace_pd)
Sorted Distribution of User Examples per Intent
¶findfont: Font family 'normal' not found. findfont: Font family 'normal' not found. findfont: Font family 'normal' not found. findfont: Font family 'normal' not found. findfont: Font family 'normal' not found. findfont: Font family 'normal' not found. findfont: Font family 'normal' not found. findfont: Font family 'normal' not found. findfont: Font family 'normal' not found. findfont: Font family 'normal' not found.
summary_generator.show_user_examples_per_intent(workspace_pd)
Intent | Number of User Examples | |
---|---|---|
1 | Goodbye | 6 |
2 | Cancel | 7 |
3 | Thanks | 8 |
4 | Help | 8 |
5 | Customer_Care_Appointments | 20 |
6 | Customer_Care_Store_Location | 25 |
7 | General_Greetings | 30 |
8 | General_Connect_to_Agent | 47 |
9 | Customer_Care_Store_Hours | 48 |
Class imbalance will not always lead to lower accuracy. All intents (classes) thus need not have the same number of examples.
updateBankAccount
and addNewAccountHolder
where the semantics difference between them is subtler, the number of examples per intent needs to be somewhat balanced else the classifier might favor the intent with the higher number of examples.greetings
that are semantically distinct from other intents like updateBankAccount
, it may be okay for it to have fewer examples per intent and still be easy for the intent detector to classify.If during testing it seems like intent classification accuracy is lower than expected, we advise you to re-examine this distribution analysis.
With regard to sorted distribution of examples per intent, if the sorted number of user examples varies a lot across different intents, it can be a potential source of bias for intent detection. Large imbalances in general should be avoided. This can potentially lead to lower accuracy. If your graph displays this characteristic, this might be a source of error.
For further guidance on adding more examples to help balance out your distribution, please refer to Intent-Example-Recommendation
We perform a chi square significance test using count features to determine the terms that are most correlated with each intent in the dataset.
A unigram
is a single word, while a bigram
is two consecutive words from within the training data. E.g. If you have a sentence like Thank you for your service
, each of the words in the sentence are considered unigrams while terms like Thank you
, your service
are considered bigrams.
If you see terms like hi
, hello
correlated with a greeting
intent that would be reasonable. But if you see terms like table
, chair
correlated with the greeting
intent that would be anomalous. A scan of the most correlated unigrams & bigrams for each intent can help you spot potential anomalies within your training data.
Note: We ignore the following common words from consideration an, a, in, on, be, or, of, a, and, can, is, to, the, i
unigram_intent_dict, bigram_intent_dict = chi2_analyzer.get_chi2_analysis(workspace_pd, lang_util=lang_util)
Intent | Correlated Unigrams | Correlated Bigrams | |
---|---|---|---|
1 | Customer_Care_Store_Hours | store, close, hour, are, open | what time, what are, you close, store open, you open |
2 | General_Connect_to_Agent | pleas, want, talk, speak, agent | do not, speak human, want speak, connect me, want talk |
3 | General_Greetings | hi, been, hello, good, hey | hey you, have you, you been, hey there, how are |
4 | Customer_Care_Store_Location | give, direct, find, where, locat | do get, find store, get your, where are, how do |
5 | Customer_Care_Appointments | visit, meet, face, make, appoint | like discuss, d like, like make, face face, make appoint |
6 | Help | me, assist, decid, say, help | need assist, what do, what say, you help, help me |
7 | Thanks | mani, nice, much, appreci, thank | you veri, much appreci, mani thank, appreci it, thank you |
8 | Cancel | request, tabl, anymor, cancel, mind | forget it, cancel that, cancel request, tabl anymor, anymor anymor |
9 | Goodbye | see, arrivederci, ciao, ya, bye | good bye, see ya, so long |
If you identify unusual / anomalous correlated terms like: numbers, names etc., which should not be correlated with an intent please read the following:
A heatmap of terms is a method using which we can visualize which terms or words are frequently occurring within each intent. Rows are the terms and columns are the intents.
By default we show only the top 30 intents with the highest number of user examples in the analysis. This number can be changed if needed.
INTENTS_TO_DISPLAY = 30 # Total number of intents for display
MAX_TERMS_DISPLAY = 30 # Total number of terms to display
intent_list = []
keyword_analyzer.seaborn_heatmap(workspace_pd, lang_util, INTENTS_TO_DISPLAY, MAX_TERMS_DISPLAY, intent_list)
Token Frequency per Intent
¶findfont: Font family 'normal' not found. findfont: Font family 'normal' not found. findfont: Font family 'normal' not found. findfont: Font family 'normal' not found. findfont: Font family 'normal' not found. findfont: Font family 'normal' not found. findfont: Font family 'normal' not found. findfont: Font family 'normal' not found. findfont: Font family 'normal' not found. findfont: Font family 'normal' not found.
If you wish to see term analysis for specific intents, feel free to add those intents to the intent list. This shall generate a custom term heatmap. By default we show the top 30 tokens, but this can be changed if needed
# intent_list = ['intent1','intent2','intent3']
intent_list = ['Customer_Care_Appointments']
MAX_TERMS_DISPLAY = 20 # Total number of terms to display
if intent_list:
keyword_analyzer.seaborn_heatmap(workspace_pd, lang_util, INTENTS_TO_DISPLAY, MAX_TERMS_DISPLAY, intent_list)
Token Frequency per Intent
¶findfont: Font family 'normal' not found. findfont: Font family 'normal' not found. findfont: Font family 'normal' not found. findfont: Font family 'normal' not found. findfont: Font family 'normal' not found. findfont: Font family 'normal' not found. findfont: Font family 'normal' not found. findfont: Font family 'normal' not found. findfont: Font family 'normal' not found. findfont: Font family 'normal' not found.
If you notice any terms or words which should not be frequently present within an intent, consider modifying examples in that intent
Based on the chi-square analysis above, we generate intent pairs whose correlated unigrams and bigrams overlap. This allows us to get a glimpse of which unigrams or bigrams might cause potential confusion in intent detection.
ambiguous_unigram_df = chi2_analyzer.get_confusing_key_terms(unigram_intent_dict)
There is no ambiguity based on top 5 key terms in chi2 analysis
ambiguous_bigram_df = chi2_analyzer.get_confusing_key_terms(bigram_intent_dict)
There is no ambiguity based on top 5 key terms in chi2 analysis
# Add specific intent or intent pairs for which you would like to see overlap
intent1 = 'General_Connect_to_Agent'
intent2 = 'General_Greetings'
chi2_analyzer.chi2_overlap_check(ambiguous_unigram_df,ambiguous_bigram_df,intent1,intent2)
The following analysis shows user examples that are similar but fall under different Intents.
similar_utterance_diff_intent_pd = similarity_analyzer.ambiguous_examples_analysis(workspace_pd, lang_util)
Ambiguous Intent Pairs
If you see terms which are correlated with more than 1 intent, please review if this seems anomalous based on the use case for that intent. If it seems reasonable, it may not be an issue.
Ambiguous Utterances across intents
Reference for more information on entity: Entity Documentation
For more in-depth analysis related to possible conflicts in your training data across intents, try the conflict detection feature in Watson Assistant:
Conflict Resolution Documentation
Analyze your existing Watson Assistant Dialog Skill with the help of a test set.
Please upload a test set in csv format. Each line in the file should have only User_Input<tab>Intent
An example would be
hello how are you<tab>Greeting
I would like to talk to a human<tab>AgentHandoff
import types
from botocore.client import Config
import ibm_boto3
# The following code accesses a csv file in your IBM Cloud Object Storage.
ENDPOINT_URL = 'https://s3.us-east.cloud-object-storage.appdomain.cloud' # change this based on the region of your cos bucket
# please fill in the details here:
COS_API_KEY_ID = 'YOUR_COS_API_KEY'
RESOURCE_INSTANCE_ID = 'YOUR_COS_RESOURCE_INSTANCE_ID'
IBM_COS_BUCKET = 'YOUR_COS_BUCKET_NAME'
IBM_COS_FILE_KEY = 'YOUR_COS_FILE_NAME'
cos_client = ibm_boto3.client(service_name='s3',
ibm_api_key_id=COS_API_KEY_ID,
ibm_service_instance_id = RESOURCE_INSTANCE_ID,
config=Config(signature_version='oauth'),
endpoint_url=ENDPOINT_URL)
body = cos_client.get_object(Bucket=IBM_COS_BUCKET,Key=IBM_COS_FILE_KEY)['Body']
separator = "\t" # separator used in csv.
test_df = skills_util.process_test_set(body, lang_util, delim=separator, cos=True)
display(Markdown("### Random Test Sample"))
display(HTML(test_df.sample(n=min(10, len(test_df))).to_html(index=False)))
utterance | intent | tokens |
---|---|---|
hi advisor | General_Greetings | [hi, advisor] |
can i connect to an agent | General_Connect_to_Agent | [can, i, connect, to, an, agent] |
thank you | Thanks | [thank, you] |
can you arrange for me to meet at your closest store | Customer_Care_Appointments | [can, you, arrang, for, me, to, meet, at, your, closest, store] |
please connect me to a live agent | General_Connect_to_Agent | [pleas, connect, me, to, a, live, agent] |
hey there | General_Greetings | [hey, there] |
are you open on sundays and if so what are the hours | Customer_Care_Store_Hours | [are, you, open, on, sunday, and, if, so, what, are, the, hour] |
i want to know about a store | Customer_Care_Store_Location | [i, want, to, know, about, a, store] |
i want to speak to a human | General_Connect_to_Agent | [i, want, to, speak, to, a, human] |
hey twin | General_Greetings | [hey, twin] |
These steps can take time if you have a large test set
Note: You will be charged for calls made from this notebook based on your WA plan
THREAD_NUM = min(4, os.cpu_count() if os.cpu_count() else 1)
# increase timeout if you experience `TimeoutError`.
# Increasing the `TIMEOUT` allows the process more breathing room to compete
TIMEOUT = 1 # `TIMEOUT` is set to 1 second
full_results = inferencer.inference(conversation=conversation,
test_data=test_df,
max_thread=THREAD_NUM,
skill_id=skill_id,
timeout=TIMEOUT
)
100%|██████████| 53/53 [00:03<00:00, 16.62it/s]
summary_generator.generate_summary_statistics(test_df)
summary_generator.show_user_examples_per_intent(test_df)
Data Characteristic | Value | |
---|---|---|
1 | Total User Examples | 53 |
2 | Unique Intents | 10 |
3 | Average User Examples per Intent | 5 |
4 | Standard Deviation from Average | 3 |
5 | Total Number of Entities | 0 |
Intent | Number of User Examples | |
---|---|---|
1 | Help | 2 |
2 | Thanks | 2 |
3 | Cancel | 3 |
4 | Goodbye | 3 |
5 | Customer_Care_Store_Location | 5 |
6 | Customer_Care_Appointments | 5 |
7 | SYSTEM_OUT_OF_DOMAIN | 7 |
8 | General_Greetings | 7 |
9 | Customer_Care_Store_Hours | 9 |
10 | General_Connect_to_Agent | 10 |
Ideally the Test and Training Data distributions should be similar. The following metrics can help identify gaps between Test Set and Training Set:
1. The distribution of User Examples per Intent for the Test Data should be comparable to the Training Data
2. Average length of User Examples for Test and Training Data should be comparable
3. The vocabulary and phrasing of utterances in the Test Data should be comparable to the Training Data
If your test data comprises of examples labelled from your logs, and the training data comprises of examples created by human subject matter experts, there may be discrepancies between what the virtual assistant designers thought the end users would type and the way they actually type in production. Thus, if you find discrepancies in this section, you might want to consider changing your design to more closely resemble the way end users use your system.
Note: You will be charged for calls made from this notebook based on your WA plan
divergence_analyzer.analyze_train_test_diff(workspace_pd, test_df, full_results)
Intent | % of Train | % of Test | Absolute Difference % | Train Examples | Test Examples | Test Precision % | Test Recall % | Test F1 % | |
---|---|---|---|---|---|---|---|---|---|
0 | Customer_Care_Store_Hours | 24.120000 | 16.980000 | 7.140000 | 48 | 9 | 100.000000 | 100.000000 | 100.000000 |
1 | General_Connect_to_Agent | 23.620000 | 18.870000 | 4.750000 | 47 | 10 | 90.910000 | 100.000000 | 95.240000 |
3 | Customer_Care_Store_Location | 12.560000 | 9.430000 | 3.130000 | 25 | 5 | 62.500000 | 100.000000 | 76.920000 |
8 | Goodbye | 3.020000 | 5.660000 | 2.650000 | 6 | 3 | 100.000000 | 100.000000 | 100.000000 |
7 | Cancel | 3.520000 | 5.660000 | 2.140000 | 7 | 3 | 100.000000 | 100.000000 | 100.000000 |
2 | General_Greetings | 15.080000 | 13.210000 | 1.870000 | 30 | 7 | 87.500000 | 100.000000 | 93.330000 |
4 | Customer_Care_Appointments | 10.050000 | 9.430000 | 0.620000 | 20 | 5 | 100.000000 | 100.000000 | 100.000000 |
5 | Help | 4.020000 | 3.770000 | 0.250000 | 8 | 2 | 50.000000 | 100.000000 | 66.670000 |
6 | Thanks | 4.020000 | 3.770000 | 0.250000 | 8 | 2 | 100.000000 | 100.000000 | 100.000000 |
Distribution Mismatch Color Code
Red - Severe
Blue - Caution
Green - Good
Note Metric used is Jensen Shannon Distance
Average length of user examples is comparable
Train Vocabulary Size | Test Vocabulary Size | % Test Set Vocabulary not found in Train | |
---|---|---|---|
1 | 217 | 125 | 16.0 |
results = full_results[['correct_intent', 'top_confidence','top_intent','utterance']]
accuracy = inferencer.calculate_accuracy(results)
display(Markdown("### Accuracy on Test Data: {} %".format(accuracy)))
This section gives the user an overview of the errors made by the intent classifier on the test set
Note System Out of Domain
labels are assigned to user examples which get classified with confidence scores less than 0.2 as Watson Assistant would consider them to be irrelevant
wrongs_df = inferencer.calculate_mistakes(results)
display(Markdown("### Intent Detection Mistakes"))
display(Markdown("Number of Test Errors: {}".format(len(wrongs_df))))
with pd.option_context('max_colwidth', 250):
if not wrongs_df.empty:
display(wrongs_df)
Number of Test Errors: 7
correct_intent | top_confidence | top_intent | utterance | |
---|---|---|---|---|
Test Example Index | ||||
46 | SYSTEM_OUT_OF_DOMAIN | 0.079831 | Help | can you tell me a good joke |
47 | SYSTEM_OUT_OF_DOMAIN | 0.140559 | Customer_Care_Store_Location | what is your iq |
48 | SYSTEM_OUT_OF_DOMAIN | 0.145882 | General_Greetings | luke i am your father |
49 | SYSTEM_OUT_OF_DOMAIN | 0.120711 | Customer_Care_Store_Location | where did betty buy her butter |
50 | SYSTEM_OUT_OF_DOMAIN | 0.007102 | General_Connect_to_Agent | how many engineers does it take to change a lightbulb |
51 | SYSTEM_OUT_OF_DOMAIN | 0.039089 | Help | can you help me change my account password |
52 | SYSTEM_OUT_OF_DOMAIN | 0.106361 | Customer_Care_Store_Location | what is a way to change my account address |
In this phase of the analysis, we illustrate how a confidence threshold which is used to determine what is considered irrelevant or out of domain can be used for analysis
analysis_df= confidence_analyzer.analysis(results,None)
We calculate metrics for responses where the top intent has a confidence above the threshold specified on the x-axis.
We consider examples which are within the scope of the chatbot's problem formulation as on topic or in domain and those examples which are outside the scope of the problem to be out of domain or irrelevant
x-axis: Confidence threshold used || y-axis: Intent Detection Accuracy for On Topic utterances
x-axis: Confidence threshold used || y-axis: Fraction of All utterances above the threshold
x-axis: Confidence threshold used || y-axis: Fraction of Out of Domain utterances falsely considered on topic
If a certain confidence threshold T is selected, then
analysis_df.index = np.arange(1, len(analysis_df)+1)
display(analysis_df)
Threshold (T) | Ontopic Accuracy (TOA) | Bot Coverage % | Bot Coverage Counts | False Acceptance Rate (FAR) | |
---|---|---|---|---|---|
1 | 0.0 | 100.0 | 100.000000 | 53 / 53 | 100.000000 |
2 | 0.1 | 100.0 | 94.339623 | 50 / 53 | 57.142857 |
3 | 0.2 | 100.0 | 88.679245 | 47 / 53 | 14.285714 |
4 | 0.3 | 100.0 | 86.792453 | 46 / 53 | 0.000000 |
5 | 0.4 | 100.0 | 86.792453 | 46 / 53 | 0.000000 |
6 | 0.5 | 100.0 | 86.792453 | 46 / 53 | 0.000000 |
7 | 0.6 | 100.0 | 84.905660 | 45 / 53 | 0.000000 |
8 | 0.7 | 100.0 | 84.905660 | 45 / 53 | 0.000000 |
9 | 0.8 | 100.0 | 83.018868 | 44 / 53 | 0.000000 |
10 | 0.9 | 100.0 | 81.132075 | 43 / 53 | 0.000000 |
By selecting a higher threshold, we can potentially bias our systems towards being more accurate in terms of determining whether an utterance is on topic or out of domain. The default confidence threshold for Watson Assistance is 0.2
Effect on Accuracy: When we select a higher threshold T, this can result in higher accuracy (TOA) because only examples with confidences greater than the threshold T are included.
Effect on Bot Coverage %: However, when we select a higher threshold T, this can also result in less examples being responded to by the virtual assistant.
Deflection to Human Agent: In the scenarios where the virtual assistant is setup to hand off to a human agent when its less confident, having a higher threshold T can:
This section allows the examination of thresholds on specific intents.
False Acceptance Rate (FAR) for specific intents
When we calculate FAR across all intents (as in previous section) we calculate fraction of out of domain examples falsely considered on topic. When we calculate FAR for specific intents, we calculate the fraction of examples which were falsely predicted to be that specific intent.
# Calculate intent with most test examples
for label in list(test_df['intent'].value_counts().index):
if label != skills_util.OFFTOPIC_LABEL:
MOST_FREQUENT_INTENT = label
break
# Specify intents of interest for analysis
INTENT_LIST = [MOST_FREQUENT_INTENT]
analysis_df_list = confidence_analyzer.analysis(results, INTENT_LIST)
Out of Domain examples fewer than 5 thus no False Acceptance Rate (FAR) calculated
Threshold (T) | Ontopic Accuracy (TOA) | Bot Coverage % | Bot Coverage Counts | |
---|---|---|---|---|
1 | 0.0 | 100.0 | 100.000000 | 11 / 11 |
2 | 0.1 | 100.0 | 100.000000 | 11 / 11 |
3 | 0.2 | 100.0 | 100.000000 | 11 / 11 |
4 | 0.3 | 100.0 | 100.000000 | 11 / 11 |
5 | 0.4 | 100.0 | 100.000000 | 11 / 11 |
6 | 0.5 | 100.0 | 90.909091 | 10 / 11 |
7 | 0.6 | 100.0 | 90.909091 | 10 / 11 |
8 | 0.7 | 100.0 | 90.909091 | 10 / 11 |
9 | 0.8 | 100.0 | 90.909091 | 10 / 11 |
10 | 0.9 | 100.0 | 90.909091 | 10 / 11 |
This intent can be ground-truth or an incorrect predicted intent. It provides term level insights on which terms the classifier thought were important in relation to that specific intent.
Even if the system predicts an intent correctly, the terms which the intent classifier though were important may not be as expected by human insight. Human insight might suggest that the intent classifier is focusing on the wrong terms.
The score of each term in the following highlighted images can be viewed as importance factor of that term for that specific intent. The larger the score, the more important the term.
We can get the highlighted images for either wrongly-predicted utterances or utterances where the classifier returned a low confidence.
Note: You will be charged for calls made from this notebook based on your WA plan
# Pick an example from section 1 which was misclassified
# Add the example and correct intent for the example
utterance = "where is the closest agent" # input example
intent = "General_Connect_to_Agent" # input an intent in your workspace which you are interested in.
# increase timeout if you experience `TimeoutError`.
# Increasing the `TIMEOUT` allows the process more breathing room to compete
TIMEOUT = 1 # `TIMEOUT` is set to 1 second
inference_results = inferencer.inference(conversation=conversation,
skill_id=skill_id,
test_data=pd.DataFrame({'utterance':[utterance],
'intent':[intent]}),
max_thread = 1,
timeout=TIMEOUT
)
highlighter.get_highlights_in_batch_multi_thread(conversation=conversation,
full_results=inference_results,
output_folder=None,
confidence_threshold=1,
show_worst_k=1,
lang_util=lang_util,
skill_id=skill_id,
)
100%|██████████| 5/5 [00:00<00:00, 13.99it/s]
1 examples are shown below:
Characteristic | Value | |
---|---|---|
1 | Test Set Index | 0 |
2 | Utterance | where is the closest agent |
3 | Actual Intent | General_Connect_to_Agent |
4 | Predicted Intent | General_Connect_to_Agent |
5 | Confidence | 1 |
In the section below we analyze your test results and produce highlighting for the top 25 problematic utterances which were either mistakes or had confidences below the threshold that was set.
Note: You will be charged for calls made from this notebook based on your WA plan
# The output folder for generated images
# Note modify this if you want the generated images to be stored in a different directory
highlighting_output_folder = './highlighting_images/'
if not os.path.exists(highlighting_output_folder):
os.mkdir(highlighting_output_folder)
# The threshold the prediction needs to achieve below which
# it will be considered as `out of domain` or `offtopic` utterances.
threshold = 0.2
# Maximum number of test set examples whose highlighting analysis will be conducted
K=25
highlighter.get_highlights_in_batch_multi_thread(conversation=conversation,
full_results=full_results,
output_folder=highlighting_output_folder,
confidence_threshold=threshold,
show_worst_k=K,
lang_util=lang_util,
skill_id=skill_id,
)
Every test utterance is classified as a specific intent with a specific confidence by the WA intent classifier. It is expected that model would be confident when correctly predicting examples and not highly confident when incorrectly predicting examples.
But often this is not true. This may suggest there are anomalies in the design. Examples that are predicted correctly with low confidence and the examples that are predicted incorrectly with high confidence are thus cases which need to be reviewed.
importlib.reload(confidence_analyzer)
correct_thresh, wrong_thresh = 0.3, 0.7
correct_with_low_conf_list, incorrect_with_high_conf_list = confidence_analyzer.abnormal_conf(
full_results, correct_thresh, wrong_thresh)
if len(correct_with_low_conf_list) > 0:
display(Markdown("#### Examples correctedly predicted with low confidence"))
with pd.option_context('max_colwidth', 250):
display(HTML(correct_with_low_conf_list.to_html(index=False)))
if len(incorrect_with_high_conf_list) > 0:
display(Markdown("#### Examples incorrectedly predicted with high confidence"))
with pd.option_context('max_colwidth', 250):
display(HTML(incorrect_with_high_conf_list.to_html(index=False)))
If there are examples which are getting classified incorrectly with high confidence for specific intents, it may indicate an issue in the design of those specific intents as the user examples provided for that intent may be overlapping with the design of other intents.
If intent A seems to always get misclassified as intent B with high confidence or gets correctly predicted with low confidence, please consider using intent conflict detection https://cloud.ibm.com/docs/services/assistant?topic=assistant-intents#intents-resolve-conflicts
Also consider whether those two intents need to be two separate intents or whether they need to be merged. If they can't be merged, then consider adding more user examples which distinguish intent A specifically from intent B.
We perform a chi square significance test for entities as we did for unigrams and bigrams in the previous section. For each utterance in the training data, this analysis will call the mesage api for entity detection on each utterance and find the most correlated entities for each intent
Note: You will be charged for calls made from this notebook based on your WA plan.
if entities_list:
THREAD_NUM = min(4, os.cpu_count() if os.cpu_count() else 1)
# increase timeout if you experience `TimeoutError`.
# Increasing the `TIMEOUT` allows the process more breathing room to compete
TIMEOUT = 1 # `TIMEOUT` is set to 1 second
train_full_results = inferencer.inference(conversation=conversation,
test_data=workspace_pd,
max_thread=THREAD_NUM,
skill_id=skill_id,
timeout=TIMEOUT
)
entity_label_correlation_df = entity_analyzer.entity_label_correlation_analysis(
train_full_results, entities_list)
with pd.option_context('display.max_colwidth', 200):
entity_label_correlation_df.index = np.arange(1, len(entity_label_correlation_df) + 1)
display(entity_label_correlation_df)
else:
display(Markdown("### Target workspace has no entities."))
100%|██████████| 199/199 [00:13<00:00, 15.23it/s]
Intent | Correlated Entities | |
---|---|---|
1 | Customer_Care_Store_Location | landmark |
2 | General_Connect_to_Agent | sys-date, reply |
3 | Customer_Care_Store_Hours | sys-date, holiday, reply |
4 | Customer_Care_Appointments | sys-number |
Congratulation! You have successfully completed the Dialog Skill Analysis. This notebook is designed for improving our dialog skill analysis in an iterative fashion. You can tackle one aspect of your Dialog Skill at a time and start over for another aspect later for continuous improvement.
True Positives (TP): True Positive measures the number of correctly predicted positive values meaning that predicted class is the same as the actual class which is the target intent.
True Negatives (TN): True Negative measures the number of correctly predicted negative values meaning that the predicted class is the same as the actual class which is not the target intent.
False Positives (FP): False Positive measures the number of incorrectly predicted positive values meaning that the predicted class is the target intent but the actual class is not the target intent.
False Negatives (FN): False Negatives measures the number of incorrectly predicted negative values meaning that the predicted class is not the target intent but the actual class is the target intent.
Accuracy: Accuracy measures the ratio of corrected predicted user examples out of all user examples.
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision: Precision measures the ratio of correctly predicted positive observations out of total predicted positive observations.
Precision = TP / (TP + FP)
Recall: Recall measures the ratio of correctly predicted positive observations out of all observations of the target intent.
Recall = TP / (TP + FN)
F1 Score: F1 Score is the harmonic average of Precision and Recall.
F1 = 2 * (Precision * Recall)/ (Precision + Recall)
For more information related to Watson Assistant: Watson Assistant Documentation
Haode Qi is a data scientist at IBM Watson. He conducts research with NLP technologies and delivers Machine Learning Algorithm into IBM Watson's market leading conversational AI service. He is involved in several IBM open-source projects like the Auto-AI framework Lale and works with a dozen of clients to improve their AI chatbot.
Navneet Rao is an engineering lead at IBM Watson. He believes in building unique AI-powered experiences which augment human capabilities. He currently works on AI innovation & research for IBM's award-winning conversational computing platform, the IBM Watson Assistant.
Ming Tan, PhD, is a data scientist at IBM Watson. He is mainly working on prototyping and productization of various algorithmic features related to Watson Assistant service. He demonstrates a broad research interest on deep learning approaches for conversational service and related NLP tasks, such as low-resource intent classification, out-of-domain detection, multi-user chat channels, passage-level semantic matching and entity detection. Multiple research works have been published on top-tier NLP conferences.
Yang Yu, PhD, is a data scientist at IBM Watson. His research focusses mainly include language understanding, question answering, deep learning and representations learning methods for different NLP tasks. At IBM, he achieved awards on several internal machine learning competitions with global researchers. A few novel machine learning solutions he designed and developed solved some critical question answering and human-computer dialog problem in general for several popular Watson Services on the market.