Similar to detecting entities with dictionaries, you can use regex pattern matches to detect entities.
Regular expressions are not provided in files like dictionaries but in-memory within a regex configuration. You can use multiple regex configurations during the same extraction.
Regexes that you define with Watson Natural Language Processing can use token boundaries. This way, you can ensure that your regular expression matches within one or more tokens. This is a clear advantage over simpler regular expression engines,
especially when you work with a language that is not separated by whitespace, such as Chinese.
Regular expressions are processed by a dedicated component called Rule-Based Runtime, or RBR for short.
Creating regex configurations
Copy link to section
Begin by creating a module directory inside your notebook. This is a directory inside the notebook file system that is used temporarily to store the files created by the RBR training. This module directory can be the same directory that you
created and used for dictionary-based entity extraction. Dictionaries and regular expressions can be used in the same training run.
To create the module directory in your notebook, enter the following in a code cell. Note that the module directory can't contain a dash (-).
import os
import watson_nlp
module_folder = "NLP_RBR_Module_2"
os.makedirs(module_folder, exist_ok=True)
Copy to clipboardCopied to clipboard
A regex configuration is a Python dictionary, with the following attributes:
Available attributes in regex configurations with their values, descriptions of use and indication if required or not
Attribute
Value
Description
Required
name
string
The name of the regular expression. Matches of the regular expression in the input text are tagged with this name in the output.
Yes
regexes
list (string of perl based regex patterns)
Should be non-empty. Multiple regexes can be provided.
Yes
flags
Delimited string of valid flags
Flags such as UNICODE or CASE_INSENSITIVE control the matching. Can also be a combination of flags. For the supported flags, see Pattern (Java Platform SE 8).
No (defaults to DOTALL)
token_boundary.min
int
token_boundary indicates whether to match the regular expression only on token boundaries. Specified as a dict object with min and max attributes.
No (returns the longest non-overlapping match at each character position in the input text)
token_boundary.max
int
max is an optional attribute for token_boundary and needed when the boundary needs to extend for a range (between min and max tokens). token_boundary.max needs to be >= token_boundary.min
No (if token_boundary is specified, the min attribute can be specified alone)
groups
list (string labels for matching groups)
String index in list corresponds to matched group in pattern starting with 1 where 0 index corresponds to entire match. For example: regex: (a)(b) on ab with group: ['full', 'first', 'second'] will
yield full: ab, first: a, second: b
No (defaults to label match on full match)
The regex configurations can be loaded using the following helper methods:
To load a single regex configuration, use watson_nlp.toolkit.RegexConfig.load(<regex configuration>)
To load multiple regex configurations, use watson_nlp.toolkit.RegexConfig.load_all([<regex configuration>)])
Code sample
This sample shows you how to load two different regex configurations. The first configuration detects person names. It uses the groups attribute to allow easy access to the full, first and last name at a later stage.
The second configuration detects acronyms as a sequence of all-uppercase characters. By using the token_boundary attribute, it prevents matches in words that contain both uppercase and lowercase characters.
from watson_nlp.toolkit.rule_utils import RegexConfig
# Load some regex configs, for instance to match First names or acronyms
regexes = RegexConfig.load_all([
{
'name': 'full names',
'regexes': ['([A-Z][a-z]*) ([A-Z][a-z]*)'],
'groups': ['full name', 'first name', 'last name']
},
{
'name': 'acronyms',
'regexes': ['([A-Z]+)'],
'groups': ['acronym'],
'token_boundary': {
'min': 1,
'max': 1
}
}
])
Copy to clipboardCopied to clipboardShow more
Training a model that contains regular expressions
Copy link to section
After you have loaded the regex configurations, create an RBR model using the RBR.train() method. In the method, specify:
The module directory
The language of the text
The regex configurations to use
This is the same method that is used to train RBR with dictionary-based extraction. You can pass the dictionary configuration in the same method call.
Code sample
# Train the RBR model
custom_regex_block = watson_nlp.resources.feature_extractor.RBR.train(module_path=module_folder, language='en', regexes=regexes)
Copy to clipboardCopied to clipboard
Applying the model on new data
Copy link to section
After you have trained the dictionaries, apply the model on new data using the run() method, as you would use on any of the existing pre-trained blocks.
Code sample
custom_regex_block.run('Bruce Wayne works for NASA')
To show the matching subgroups or the matched text:
import json
# Get the raw response including matching groups
full_regex_result = custom_regex_block.executor.get_raw_response('Bruce Wayne works for NASA‘, language='en')
print(json.dumps(full_regex_result, indent=2))
About cookies on this siteOur websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising.For more information, please review your cookie preferences options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.