Similar to detecting entities with dictionaries, you can use regex pattern matches to detect entities.
Regular expressions are not provided in files like dictionaries but in-memory within a regex configuration. You can use multiple regex configurations during the same extraction.
Regexes that you define with Watson Natural Language Processing can use token boundaries. This way, you can ensure that your regular expression matches within one or more tokens. This is a clear advantage over simpler regular expression engines, especially when you work with a language that is not separated by whitespace, such as Chinese.
Regular expressions are processed by a dedicated component called Rule-Based Runtime, or RBR for short.
Creating regex configurations
Begin by creating a module directory inside your notebook. This is a directory inside the notebook file system that is used temporarily to store the files created by the RBR training. This module directory can be the same directory that you created and used for dictionary-based entity extraction. Dictionaries and regular expressions can be used in the same training run.
To create the module directory in your notebook, enter the following in a code cell. Note that the module directory can't contain a dash (-).
import os
import watson_nlp
module_folder = "NLP_RBR_Module_2"
os.makedirs(module_folder, exist_ok=True)
A regex configuration is a Python dictionary, with the following attributes:
Attribute | Value | Description | Required |
---|---|---|---|
name |
string | The name of the regular expression. Matches of the regular expression in the input text are tagged with this name in the output. | Yes |
regexes |
list (string of perl based regex patterns) | Should be non-empty. Multiple regexes can be provided. | Yes |
flags |
Delimited string of valid flags | Flags such as UNICODE or CASE_INSENSITIVE control the matching. Can also be a combination of flags. For the supported flags, see Pattern (Java Platform SE 8). | No (defaults to DOTALL) |
token_boundary.min |
int | token_boundary indicates whether to match the regular expression only on token boundaries. Specified as a dict object with min and max attributes. |
No (returns the longest non-overlapping match at each character position in the input text) |
token_boundary.max |
int | max is an optional attribute for token_boundary and needed when the boundary needs to extend for a range (between min and max tokens). token_boundary.max needs to be >= token_boundary.min |
No (if token_boundary is specified, the min attribute can be specified alone) |
groups |
list (string labels for matching groups) | String index in list corresponds to matched group in pattern starting with 1 where 0 index corresponds to entire match. For example: regex: (a)(b) on ab with group: ['full', 'first', 'second'] will
yield full: ab, first: a, second: b |
No (defaults to label match on full match) |
The regex configurations can be loaded using the following helper methods:
- To load a single regex configuration, use
watson_nlp.toolkit.RegexConfig.load(<regex configuration>)
- To load multiple regex configurations, use
watson_nlp.toolkit.RegexConfig.load_all([<regex configuration>)])
Code sample
This sample shows you how to load two different regex configurations. The first configuration detects person names. It uses the groups attribute to allow easy access to the full, first and last name at a later stage.
The second configuration detects acronyms as a sequence of all-uppercase characters. By using the token_boundary attribute, it prevents matches in words that contain both uppercase and lowercase characters.
from watson_nlp.toolkit.rule_utils import RegexConfig
# Load some regex configs, for instance to match First names or acronyms
regexes = RegexConfig.load_all([
{
'name': 'full names',
'regexes': ['([A-Z][a-z]*) ([A-Z][a-z]*)'],
'groups': ['full name', 'first name', 'last name']
},
{
'name': 'acronyms',
'regexes': ['([A-Z]+)'],
'groups': ['acronym'],
'token_boundary': {
'min': 1,
'max': 1
}
}
])
Training a model that contains regular expressions
After you have loaded the regex configurations, create an RBR model using the RBR.train()
method. In the method, specify:
- The module directory
- The language of the text
- The regex configurations to use
This is the same method that is used to train RBR with dictionary-based extraction. You can pass the dictionary configuration in the same method call.
Code sample
# Train the RBR model
custom_regex_block = watson_nlp.resources.feature_extractor.RBR.train(module_path=module_folder, language='en', regexes=regexes)
Applying the model on new data
After you have trained the dictionaries, apply the model on new data using the run()
method, as you would use on any of the existing pre-trained blocks.
Code sample
custom_regex_block.run('Bruce Wayne works for NASA')
Output of the code sample:
{(0, 11): ['regex::full names'], (0, 5): ['regex::full names'], (6, 11): ['regex::full names'], (22, 26): ['regex::acronyms']}
To show the matching subgroups or the matched text:
import json
# Get the raw response including matching groups
full_regex_result = custom_regex_block.executor.get_raw_response('Bruce Wayne works for NASA‘, language='en')
print(json.dumps(full_regex_result, indent=2))
Output of the code sample:
{
"annotations": {
"View_full names": [
{
"label": "regex::full names",
"fullname": {
"location": {
"begin": 0,
"end": 11
},
"text": "Bruce Wayne"
},
"firstname": {
"location": {
"begin": 0,
"end": 5
},
"text": "Bruce"
},
"lastname": {
"location": {
"begin": 6,
"end": 11
},
"text": "Wayne"
}
}
],
"View_acronyms": [
{
"label": "regex::acronyms",
"acronym": {
"location": {
"begin": 22,
"end": 26
},
"text": "NASA"
}
}
]
},
...
}
Parent topic: Creating your own models