During the extraction of key concepts and ideas from your responses, Text Analytics
relies on linguistics-based text analysis. This approach offers the speed and cost effectiveness of
statistics-based systems. But it offers a far higher degree of accuracy, while requiring far less
human intervention. Linguistics-based text analysis is based on the field of study known as natural
language processing, also known as computational linguistics.
Understanding how the extraction process works can help you make key decisions
when fine-tuning your linguistic resources (libraries, types, synonyms, and more). Steps in the
extraction process include:
Converting source data to a standard format
Identifying candidate terms
Identifying equivalence classes and integration of synonyms
Assigning a type
Indexing
Matching patterns and events extraction
Step 1. Converting source data to a standard format
Copy link to section
In this first step, the data you import is converted to a uniform format that
can be used for further analysis. This conversion is performed internally and does not change your
original data.
Step 2. Identifying candidate terms
Copy link to section
It is important to understand the role of linguistic resources in the
identification of candidate terms during linguistic extraction. Linguistic resources are used every
time an extraction is run. They exist in the form of templates, libraries, and compiled resources.
Libraries include lists of words, relationships, and other information used to specify or tune the
extraction. The compiled resources cannot be viewed or edited. However, the remaining resources
(templates) can be edited in the Template Editor or, if you're in a Text Analytics Workbench
session, in the Resource editor.
Compiled resources are core, internal components of the extraction engine.
These resources include a general dictionary containing a list of base forms with a part-of-speech
code (noun, verb, adjective, adverb, participle, coordinator, determiner, or preposition). The
resources also include reserved, built-in types used to assign many extracted terms to the following
types, <Location>, <Organization>, or
<Person>.
In addition to those compiled resources, several libraries are delivered with
the product and can be used to complement the types and concept definitions in the compiled
resources, as well as to offer other types and synonyms. These libraries—and any custom ones you
create—are made up of several dictionaries. These include type dictionaries, substitution
dictionaries (synonyms and optional elements), and exclude dictionaries.
After the data is imported and converted, the extraction engine will begin
identifying candidate terms for extraction. Candidate terms are words or groups of words that are
used to identify concepts in the text. During the processing of the text, single words
(uni-terms) that are not in the compiled resources are considered as candidate term
extractions. Candidate compound words (multi-terms) are identified using part-of-speech
pattern extractors. For example, the multi-term sports car, which follows the
adjective noun part-of-speech pattern, has two components. The multi-term fast
sports car, which follows the adjective adjective noun part-of-speech pattern,
has three components.
Note: The terms in the aforementioned compiled general dictionary represent a list of all of the
words that are likely to be uninteresting or linguistically ambiguous as uni-terms. These words are
excluded from extraction when you are identifying the uni-terms. However, they are reevaluated when
you are determining parts of speech or looking at longer candidate compound words
(multi-terms).
Finally, a special algorithm is used to handle uppercase letter strings, such
as job titles, so that these special patterns can be extracted.
Step 3. Identifying equivalence classes and integration of
synonyms
Copy link to section
After candidate uni-terms and multi-terms are identified, the software uses a
set of algorithms to compare them and identify equivalence classes. An equivalence class is a base
form of a phrase or a single form of two variants of the same phrase. The purpose of assigning
phrases to equivalence classes is to ensure that, for example, president of the
company and company president are not treated as separate concepts. To
determine which concept to use for the equivalence class—that is, whether president of the
company or company president is used as the lead term, the extraction
engine applies the following rules in the order listed:
The user-specified form in a library.
The most frequent form in the full body of text.
The shortest form in the full body of text (which usually corresponds to the
base form).
Step 4. Assigning type
Copy link to section
Next, types are assigned to extracted concepts. A type is a semantic grouping
of concepts. Both compiled resources and the libraries are used in this step. Types include such
things as higher-level concepts, positive and negative words, first names, places, organizations,
and more. Additional types can be defined by the user.
Step 5. Indexing
Copy link to section
The entire set of records or documents is indexed by
establishing a pointer between a text position and the representative term for each equivalence
class. This assumes that all of the inflected form instances of a candidate concept are indexed as a
candidate base form. The global frequency is calculated for each base form.
Step 6. Matching patterns and events extraction
Copy link to section
Text Analytics can discover not only types and concepts but also relationships
among them. Several algorithms and libraries are available with this tool and provide the ability to
extract relationship patterns between types and concepts. They are particularly useful when
attempting to discover specific opinions (for example, product reactions) or the relational links
between people or objects (for example, links between political groups or genomes).
About cookies on this siteOur websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising.For more information, please review your cookie preferences options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.