Detecting entities with a custom dictionary

Last updated: Jul 25, 2024

If you have a fixed set of terms that you want to detect, like a list of product names or organizations, you can create a dictionary. Dictionary matching is very fast and resource-efficient.

Watson Natural Language Processing dictionaries contain advanced matching capabilities that go beyond a simple string match, including:

Dictionary terms can consist of a single token, for example wheel, or multiple tokens, for example, steering wheel.
Dictionary term matching can be case-sensitive or case-insensitive. With a case-sensitive match, you can ensure that acronyms, like ABS don't match terms in the regular language, like abs that have a different meaning.
You can specify how to consolidate matches when multiple dictionary entries match the same text. Given the two dictionary entries, Watson and Watson Natural Language Processing, you can configure which entry should match in "I like Watson Natural Language Processing": either only Watson Natural Language Processing, as it contains Watson, or both.
You can specify to match the lemma instead of enumerating all inflections. This way, the single dictionary entry mouse will detect both mouse and mice in the text.
You can attach a label to each dictionary entry, for example Organization category to include additional metadata in the match.

All of these capabilities can be configured, so you can pick the right option for your use case.

Types of dictionary files

Watson Natural Language Processing supports two types of dictionary files:

Term list (ending in .dict)

Example of a term list:
```
Arthur
Allen
Albert
Alexa
```

Table (ending in .csv)

Example of a table:

"label", "entry"
"ORGANIZATION", "NASA"
"COUNTRY", "USA"
"ACTOR", "Christian Bale"

You can use multiple dictionaries during the same extraction. You can also use both types at the same time, for example, run a single extraction with three dictionaries, one term list and two tables.

Creating dictionary files

Begin by creating a module directory inside your notebook. This is a directory inside the notebook file system that will be used temporarily to store your dictionary files.

To create dictionary files in your notebook:

Create a module directory. Note that the name of the module folder cannot contain any dashes as this will cause errors.
```
import os
import watson_nlp
module_folder = "NLP_Dict_Module_1"
os.makedirs(module_folder, exist_ok=True)
```

Create dictionary files, and store them in the module directory. You can either read in an external list or CSV file, or you can create dictionary files like so:

# Create a term list dictionary
term_file = "names.dict"
with open(os.path.join(module_folder, term_file), 'w') as dictionary:
    dictionary.write('Bruce')
    dictionary.write('\n')
    dictionary.write('Peter')
    dictionary.write('\n')

# Create a table dictionary
table_file = 'Places.csv'
with open(os.path.join(module_folder, table_file), 'w') as places:
    places.write("\"label\", \"entry\"")
    places.write("\n")
    places.write("\"SIGHT\", \"Times Square\"")
    places.write("\n")
    places.write("\"PLACE\", \"5th Avenue\"")
    places.write("\n")

Loading the dictionaries and configuring matching options

The dictionaries can be loaded using the following helper methods.

To load a single dictionary, use watson_nlp.toolkit.rule_utils.DictionaryConfig (<dictionary configuration>)
To load multiple dictionaries, use watson_nlp.toolkit.rule_utils.DictionaryConfig.load_all([<dictionary configuration>)])

For each dictionary, you need to specify a dictionary configuration. The dictionary configuration is a Python dictionary, with the following attributes:

Attribute	Value	Description	Required
`name`	string	The name of the dictionary	Yes
`source`	string	The path to the dictionary, relative to `module_folder`	Yes
`dict_type`	file or table	Whether the dictionary artifact is a term list (file) or a table of mappings (table)	No. The default is file
`consolidate`	ContainedWithin (Keep the longest match and deduplicate) / NotContainedWithin (Keep the shortest match and deduplicate) / ContainsButNotEqual (Keep longest match but keep duplicate matches) / ExactMatch (Deduplicate) / LeftToRight (Keep the leftmost longest non-overlapping span)	What to do with dictionary matches that overlap.	No. The default is to not consolidate matches.
`case`	exact / insensitive	Either match exact case or be case insensitive.	No. The default is exact match.
`lemma`	True / False	Match the terms in the dictionary with the lemmas from the text. The dictionary should contain only lemma forms. For example, add `mouse` in the dictionary to match both `mouse` and `mice` in text. Do not add `mice` in the dictionary. To match terms that consist of multiple tokens in text, separate the lemmas of those terms in the dictionary by a space character.	No. The default is False.
`mappings.columns` (columns `as attribute of` mappings: {})	list [ string ]	List of column headers in the same order as present in the table csv	Yes if `dict_type: table`
`mappings.entry` (entry `as attribute of` mappings: {})	string	The name of the column header that contains the string to match against the document.	Yes if `dict_type: table`
`label`	string	The label to attach to matches.	No

Code sample

# Load the dictionaries
dictionaries = watson_nlp.toolkit.rule_utils.DictionaryConfig.load_all([{
    'name': 'Names',
    'source': term_file,
    'case':'insensitive'
}, {
    'name': 'places_and_sights_mappings',
    'source': table_file,
    'dict_type': 'table',
    'mappings': {
        'columns': ['label', 'entry'],
        'entry': 'entry'
    }
}])

Training a model that contains dictionaries

After you have loaded the dictionaries, create a dictionary model and train the model using the RBR.train() method. In the method, specify:

The module directory
The language of the dictionary entries
The dictionaries to use

Code sample

custom_dict_block = watson_nlp.resources.feature_extractor.RBR.train(module_folder,
language='en', dictionaries=dictionaries)

Applying the model on new data

After you have trained the dictionaries, apply the model on new data using the run() method, as you would use on any of the existing pre-trained blocks.

Code sample

custom_dict_block.run('Bruce is at Times Square')

Output of the code sample:

{(0, 5): ['Names'], (12, 24): ['SIGHT']}

To show the labels or the name of the dictionary:

RBR_result = custom_dict_block.executor.get_raw_response('Bruce is at Times Square', language='en')
print(RBR_result)

Output showing the labels:

{'annotations': {'View_Names': [{'label': 'Names', 'match': {'location': {'begin': 0, 'end': 5}, 'text': 'Bruce'}}], 'View_places_and_sights_mappings': [{'label': 'SIGHT', 'match': {'location': {'begin': 12, 'end': 24}, 'text': 'Times Square'}}]}, 'instrumentationInfo': {'annotator': {'version': '1.0', 'key': 'Text match extractor for NLP_Dict_Module_1'}, 'runningTimeMS': 3, 'documentSizeChars': 32, 'numAnnotationsTotal': 2, 'numAnnotationsPerType': [{'annotationType': 'View_Names', 'numAnnotations': 1}, {'annotationType': 'View_places_and_sights_mappings', 'numAnnotations': 1}], 'interrupted': False, 'success': True}}

Parent topic: Creating your own models