About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Last updated: Feb 11, 2025
With the Text Link Analysis (TLA) node, the extraction of text link analysis pattern results is automatically enabled. In the node's properties, the expert options include certain additional parameters that impact how text is extracted and handled. The expert parameters control the basic behavior, as well as a few advanced behaviors, of the extraction process. There are also a number of linguistic resources and options that also impact the extraction results, which are controlled by the resource template you select.
Limit extraction to concepts with a global frequency of at least [n]. This option specifies the minimum number of times a word or phrase must occur in the text in order for it to be extracted. In this way, a value of 5 limits the extraction to those words or phrases that occur at least five times in the entire set of records or documents.
In some cases, changing this limit can make a big difference in the resulting extraction results,
and consequently, your categories. Let's say that you're working with some restaurant data and you
don't increase the limit beyond 1 for this option. In this case, you might find
, and pizza (1),
thin pizza (2), spinach pizza (2)
in your
extraction results. However, if you were to limit the extraction to a global frequency of 5 or more
and re-extract, you would no longer get three of these concepts. Instead you would get favorite pizza (2)
, since pizza
(7)
is the simplest form and this word already existed as a
possible candidate. And depending on the rest of your text, you might actually have a frequency of
more than seven, depending on whether there are still other phrases with pizza in the text.
Additionally, if pizza
was already a category descriptor, you might need to
add spinach pizza
as a descriptor instead to capture all of the records. For this reason,
change this limit with care whenever categories have already been created.pizza
Note that this is an extraction-only feature; if your template contains terms (they usually do), and a term for the template is found in the text, then the term will be indexed regardless of its frequency.
For example, suppose you use a Basic Resources template that includes "los angeles" under the
type in the Core library; if your document contains Los Angeles only
once, then Los Angeles will be part of the list of concepts. To prevent this, you'll need to set a
filter to display concepts occurring at least the same number of times as the value entered in the
Limit extraction to concepts with a global frequency of at least [n]
field.<Location>
Accommodate punctuation errors. This option temporarily normalizes text containing punctuation errors (for example, improper usage) during extraction to improve the extractability of concepts. This option is extremely useful when text is short and of poor quality (as, for example, in open-ended survey responses, e-mail, and CRM data), or when the text contains many abbreviations.
Accommodate spelling for a minimum word character length of [n]. This
option applies a fuzzy grouping technique that helps group commonly misspelled words or closely
spelled words under one concept. The fuzzy grouping algorithm temporarily strips all vowels (except
the first one) and strips double/triple consonants from extracted words and then compares them to
see if they're the same so that
and modeling
would be
grouped together. However, if each term is assigned to a different type, excluding the
modelling
type, the fuzzy grouping technique won't be applied.<Unknown>
You can also define the minimum number of root characters required before fuzzy
grouping is used. The number of root characters in a term is calculated by totaling all of the
characters and subtracting any characters that form inflection suffixes and, in the case of
compound-word terms, determiners and prepositions. For example, the term
is counted as 8 root characters in the form "exercise," since the letter exercises
at the
end of the word is an inflection (plural form). Similarly, s
counts as 10
root characters ("apple sauce") and apple sauce
counts as 16 root
characters (“manufacturing car”). This method of counting is only used to check whether the fuzzy
grouping should be applied but doesn't influence how the words are matched.manufacturing of cars
Note: If you find that certain words are later grouped incorrectly, you can exclude word pairs from
this technique by explicitly declaring them in the Fuzzy Grouping: Exceptions
section under the Advanced Resources properties.
Extract uniterms. This option extracts single words (uniterms) as long as the word isn't already part of a compound word and if it's either a noun or an unrecognized part of speech.
Extract nonlinguistic entities. This option extracts nonlinguistic entities, such as phone numbers, social security numbers, times, dates, currencies, digits, percentages, e-mail addresses, and HTTP addresses. You can include or exclude certain types of nonlinguistic entities in the Nonlinguistic Entities: Configuration section under the Advanced Resources properties. By disabling any unnecessary entities, the extraction engine won't waste processing time.
Uppercase algorithm. This option extracts simple and compound terms that aren't in the built-in dictionaries as long as the first letter of the term is in uppercase. This option offers a good way to extract most proper nouns.
Group partial and full person names together when possible. This option
groups names that appear differently in the text together. This feature is helpful since names are
often referred to in their full form at the beginning of the text and then only by a shorter
version. This option attempts to match any uniterm with the
type to
the last word of any of the compound terms that is typed as <Unknown>
. For
example, if <Person>
is found and initially typed as doe
,
the extraction engine checks to see if any compound terms in the <Unknown>
type include <Person>
as the last word, such as doe
. This option
doesn't apply to first names since most are never extracted as uniterms.john doe
Maximum nonfunction word permutation. This option specifies the maximum
number of nonfunction words that can be present when applying the permutation technique. This
permutation technique groups similar phrases that differ from each other only by the nonfunction
words (for example,
and of
) contained, regardless of
inflection. For example, let's say that you set this value to—at most—two words, and both
the
and company officials
were extracted. In
this case, both extracted terms would be grouped together in the final concept list since both terms
are deemed to be the same when officials of the company
is ignored.of the
Use derivation when grouping multiterms. When processing Big Data, select this option to group multiterms by using derivation rules.
Was the topic helpful?
0/1000