使用定制字典检测实体 | IBM Cloud Pak for Data as a Service

Go back to the English version of the documentation

使用定制字典检测实体

Last updated: 2024年7月29日

使用定制字典检测实体

如果您有一组要检测的固定术语，例如产品名称或组织的列表，那么可以创建字典。字典匹配非常快速且资源高效。

Watson Natural Language Processing 字典包含超越简单字符串匹配的高级匹配功能，包括:

字典术语可以由单个标记 (例如，滚轮) 或多个标记 (例如， 方向盘) 组成。
字典术语匹配可以区分大小写或不区分大小写。通过区分大小写的匹配，您可以确保首字母缩略词 (例如 ABS ) 与常规语言中的术语 (例如具有不同含义的 abs ) 不匹配。
您可以指定当多个字典条目与同一文本匹配时如何合并匹配项。给定两个字典条目 Watson 和 Watson Natural Language Processing，您可以在 "我喜欢 Watson Natural Language Processing" 中配置应该匹配的条目: 仅 Watson Natural Language Processing，因为它包含 Watson，或两者。
您可以指定与引理匹配，而不是枚举所有拐点。这样，单个字典条目鼠标将检测文本中的鼠标和鼠标。
您可以将标签附加到每个字典条目，例如 组织类别 以在匹配中包含其他元数据。

可以配置所有这些功能，因此您可以为用例选择正确的选项。

字典文件的类型

Watson Natural Language Processing 支持两种类型的字典文件:

术语列表 (以 .dict结尾)

术语列表的示例:
```
Arthur
Allen
Albert
Alexa
```

表 (以 .csv结尾)

表示例:

"label", "entry"
"ORGANIZATION", "NASA"
"COUNTRY", "USA"
"ACTOR", "Christian Bale"

可以在同一抽取期间使用多个字典。您还可以同时使用这两种类型，例如，使用三个字典，一个术语列表和两个表来运行单个抽取。

创建字典文件

首先在 Notebook 中创建模块目录。这是笔记本文件系统中的一个目录，将临时用于存储字典文件。

要在 Notebook 中创建字典文件:

创建模块目录。请注意，模块文件夹的名称不能包含任何短划线，因为这将导致错误。
```
import os
import watson_nlp
module_folder = "NLP_Dict_Module_1"
os.makedirs(module_folder, exist_ok=True)
```

创建字典文件，并将其存储在模块目录中。您可以在外部列表或 CSV 文件中进行读取，也可以创建类似如下的字典文件:

# Create a term list dictionary
term_file = "names.dict"
with open(os.path.join(module_folder, term_file), 'w') as dictionary:
    dictionary.write('Bruce')
    dictionary.write('\n')
    dictionary.write('Peter')
    dictionary.write('\n')

# Create a table dictionary
table_file = 'Places.csv'
with open(os.path.join(module_folder, table_file), 'w') as places:
    places.write("\"label\", \"entry\"")
    places.write("\n")
    places.write("\"SIGHT\", \"Times Square\"")
    places.write("\n")
    places.write("\"PLACE\", \"5th Avenue\"")
    places.write("\n")

装入字典并配置匹配选项

可以使用以下帮助程序方法来装入字典。

要装入单个字典，请使用 watson_nlp.toolkit.rule_utils.DictionaryConfig (<dictionary configuration>)
要装入多个字典，请使用 watson_nlp.toolkit.rule_utils.DictionaryConfig.load_all([<dictionary configuration>)])

对于每个字典，需要指定字典配置。字典配置是具有以下属性的 Python 字典:

属性	值	描述	必需
`name`	字符串	字典的名称	是
`source`	字符串	字典的路径，相对于 `module_folder`	是
`dict_type`	文件或表	字典工件是术语列表 (文件) 还是映射表 (表)	编号缺省值为 file
`consolidate`	ContainedWithin（保留最长匹配并重复）/ NotContainedWithin（保留最短匹配并重复）/ ContainsButNotEqual（保留最长匹配但保留重复匹配）/ ExactMatch（重复）/ LeftToRight（保留最左边的最长非重叠跨度）	与重叠的字典匹配的操作。	编号缺省情况是不合并匹配项。
`case`	精确/不敏感	匹配完全大小写或不区分大小写。	编号缺省值为完全匹配。
`lemma`	true / false	将字典中的术语与文本中的词元相匹配。字典应仅包含词元格式。例如，在字典中添加 `mouse` 以匹配文本中的 `mouse` 和 `mice` 。请勿在字典中添加 `mice` 。要匹配由文本中的多个标记组成的术语，请在字典中使用空格字符分隔这些术语的词元。	编号缺省值为 False。
`mappings.columns` (列 `as attribute of` 映射: {})	list [string]	按表 csv 中显示的顺序列出列标题	是 (如果 `dict_type: table` )
`mappings.entry` (条目 `as attribute of` 映射: {})	字符串	包含要与文档匹配的字符串的列标题的名称。	是 (如果 `dict_type: table` )
`label`	字符串	要附加到匹配项的标签。	否

代码样本

# Load the dictionaries
dictionaries = watson_nlp.toolkit.rule_utils.DictionaryConfig.load_all([{
    'name': 'Names',
    'source': term_file,
    'case':'insensitive'
}, {
    'name': 'places_and_sights_mappings',
    'source': table_file,
    'dict_type': 'table',
    'mappings': {
        'columns': ['label', 'entry'],
        'entry': 'entry'
    }
}])

训练包含字典的模型

装入字典后，创建字典模型并使用 RBR.train() 方法训练模型。在该方法中，指定:

模块目录
字典条目的语言
要使用的字典

代码样本

custom_dict_block = watson_nlp.resources.feature_extractor.RBR.train(module_folder,
language='en', dictionaries=dictionaries)

在新数据上应用模型

在训练完字典之后，使用 run() 方法对新数据应用模型，就像在任何现有预训练的块上使用一样。

代码样本

custom_dict_block.run('Bruce is at Times Square')

代码示例的输出：

{(0, 5): ['Names'], (12, 24): ['SIGHT']}

要显示字典的标签或名称，请执行以下操作:

RBR_result = custom_dict_block.executor.get_raw_response('Bruce is at Times Square', language='en')
print(RBR_result)

显示标签的输出：

{'annotations': {'View_Names': [{'label': 'Names', 'match': {'location': {'begin': 0, 'end': 5}, 'text': 'Bruce'}}], 'View_places_and_sights_mappings': [{'label': 'SIGHT', 'match': {'location': {'begin': 12, 'end': 24}, 'text': 'Times Square'}}]}, 'instrumentationInfo': {'annotator': {'version': '1.0', 'key': 'Text match extractor for NLP_Dict_Module_1'}, 'runningTimeMS': 3, 'documentSizeChars': 32, 'numAnnotationsTotal': 2, 'numAnnotationsPerType': [{'annotationType': 'View_Names', 'numAnnotations': 1}, {'annotationType': 'View_places_and_sights_mappings', 'numAnnotations': 1}], 'interrupted': False, 'success': True}}

父主题: 创建您自己的模型

Was the topic helpful?

0/1000