Setting options for the Text Analytics Workbench (SPSS Modeler)
You can customize different parts of the extraction process while in the Text Analytics
Workbench. On the Concepts, Text links, and
Categories tabs, you can access several workbench settings to change how
terms are extracted from the text data.
Settings for extraction results
Copy link to section
When you run the Text Mining node, the extraction engine reads through the text data, identifies
the relevant concepts, and assigns a type to each. You can change the settings for the extraction
process to tune how extraction results are created.
From the Concepts or Text links tab, click the
Settings icon to change setting for extracting concepts, patterns, and text
links.
Enable Text Link Analysis pattern extraction
If you have text link analysis (TLA) rules in one of your libraries, select the checkbox to
extract TLA patterns from your text data. This option can significantly lengthen the extraction
time.
Limit extraction to concepts with a global frequency of at least
You can use this option to extract a term as a concept only if the term appears a set number of
times in the text data.
Accommodate punctuation errors
This option temporarily normalizes text that has punctuation errors to improve the
extractability of concepts during the extraction process. This option is useful when a text is short
and poor quality. For example, text data from open-ended survey responses, email, and CRM data can
have improper punctuation. It is also useful when the text contains many abbreviations.
Accommodate spelling for a minimum root character limit
This option applies a fuzzy grouping technique that helps group commonly misspelled words or
closely-spelled words under one concept. The fuzzy grouping algorithm temporarily strips all vowels
(except the first one) and strips double/triple consonants from extracted words. It then compares
the extracted words to see whether they are the same. For example,
modeling and modelling are grouped
together. However, if each term is assigned to a different type, excluding the
<Unknown> type, the fuzzy grouping technique is not applied.
Note: This technique does not work with text data that is written in Japanese. Written Japanese
relies on context for grammatical functions like number and gender, so words often have the same
form despite different uses. As a result, this technique does not work effectively.
Extract uniterms
You can use this option to extract single words (uniterms) as concepts when they meet the
following criteria:
The word is not already part of a compound word
The word is either a noun or an unrecognized part of speech
Extract non-linguistic entities
This option extracts non-linguistic entities, such as the following entities:
Phone numbers
Social security numbers
Times
Dates
Currencies
Percentages
Email addresses
HTTP addresses
You can include or exclude certain types of nonlinguistic entities. By disabling any
unnecessary entities, the extraction engine saves processing time.
Uppercase algorithm
This option extracts simple and compound terms that aren't in the built-in dictionaries as long
as the first letter of the term is in uppercase. This option can be useful if you want to extract
most proper nouns.
Group partial and full person names together when possible
This option groups together names that appear differently in the text. This feature is helpful
since names are often referred to in their full form at the beginning of the text and then only by a
shorter version. This option attempts to match any uniterm with the <Unknown>
type to the last word of any of the compound terms that is typed as <Person>.
For example, if doe is found and initially typed as <Unknown>, the
extraction engine checks to see if any compound terms in the <Person> type
include doe as the last word, such as john doe. This option doesn't apply
to first names since most are never extracted as uniterms.
Maximum nonfunction word permutation
This option specifies the maximum number of nonfunction words that can be present when applying
the permutation technique. This permutation technique groups similar phrases that differ from each
other only by the nonfunction words (for example, of and the) contained,
regardless of inflection. For example, let's say you set this value to at most two words and both
company officials and officials of the company were extracted. In this
case, both extracted terms would be grouped together in the final concept list since both terms are
deemed to be the same when of the is ignored.
Use derivation when grouping multiterms
When processing big data, select this option to group multiterms by using derivation rules.
Settings for categories
Copy link to section
Categories are built from descriptors that are derived from either types or type patterns. In the
table, you can select the individual types or type patterns to include in the category building
process.
From the Categories tab, go to
Build > Change settings to
change the following settings.
Build categories from
If you select Types, the categories are built from the concepts that
belong to the selected types. So, if you select the <Budget> type in
the table, categories such as cost or
price might be produced since cost and
price are concepts that are assigned to the
<Budget> type.
By default, only the types that capture the most
records or documents are selected. This pre-selection allows you to quickly focus in on the most
interesting types and avoid building uninteresting categories. The table displays the types in
descending order starting with the one with the greatest number of records or documents (Doc.
count).
The input that you choose affects the categories that you obtain. When you choose to
use Types as input, you can see the clearly related concepts more easily. For example, if you build
categories by using Types as input, you might obtain a category Fruit
with concepts such as apple, pear,
citrus fruits, and orange. If you choose
Type Patterns as input instead and select the pattern <Unknown> +
<Positive>, for example, then you might get a category fruit +
<Positive> with one or two kinds of fruit such as fruit +
tasty and apple + good. This second result only shows 2
concept patterns because the other occurrences of fruit are not necessarily positively qualified.
While this might work for your current text data, in longitudinal studies where you use different
document sets, you might want to manually add in other descriptors such as citrus
fruit + positive or use types. Using types alone as an input helps you to find all
possible fruit.
If you select Type Patterns, categories are built from
patterns rather than types and concepts on their own. Any records or documents containing a concept
pattern that belong to the selected type pattern are categorized. So, if you select the
<Budget> and <Positive> type pattern
in the table, categories such as cost & <Positive> or
rates & excellent might be produced.
When using type patterns
as input for automated category building, sometimes the techniques identify multiple ways to form
the category structure. Technically, there is no single right way to produce the categories; however
you might find one structure more suited to your analysis than another. To help customize the output
in this case, you can designate a type as the preferred focus. All the top-level categories produced
will come from a concept of the type you select here (and no other type). Every subcategory will
contain a text link pattern from this type. Choose this type in the Structure categories by pattern
type: field and the table will be updated to show only the applicable patterns containing the
selected type. More often than not, <Unknown> is preselected for
you. When <Unknown> is selected, it results in all of the patterns
containing the type <Unknown> being selected. The table displays the
types in descending order, starting with the one with the greatest number of records or documents
(Doc. count).
Techniques
Because every dataset is unique, the number of methods and the order in which you apply them
might change over time. Your goals for text mining might be different from one set of data to the
next, so you might need to experiment with different techniques to see which one produces the best
results with your text data.
You do not need to be an expert in these settings to use them. By
default, the most common and average settings are already selected. Therefore, you can bypass the
advanced setting dialogs and go straight to building your categories. Likewise, if you make changes
here, you do not have to come back to the settings dialog each time since the latest settings are
always retained.
Select one of the following techniques and then click Advanced
settings. None of the automatic techniques can perfectly categorize your data. You might
need to find and apply one or more automatic techniques that work well with your data. You can't
build by using linguistic and frequency techniques simultaneously.
Select Unused extraction results if you want categories to be built from
extraction results that aren't used in any existing categories. This option minimizes the tendency
for records to match multiple categories and limits the number of categories produced. Or select
All extraction results if you want categories to be built using any of the
extraction results. This option is most useful when you have no or few categories already.
Each of
the grouping techniques fits certain types of data and situations best. It's often helpful to
combine techniques in the same analysis to capture the full range of documents or records. You might
see a concept in multiple categories or find redundant categories.
The concept
inclusion technique builds categories by grouping multiterm concepts (compound words)
based on whether they contain words that are subsets or supersets of a word in the other. For
example, the concept seat is grouped with safety seat, seat belt, and seat belt buckle.
The semantic
network technique begins by identifying the possible senses of each concept from its
extensive index of word relationships and then creates categories by grouping related concepts. For
example, the concepts scuba diving, sailing, snorkeling,
kayaking, and white water kayaking might all be grouped in the category
sports/sports by type/water sports. Or the concept animal might be
grouped with cat and kangaroo since they're hyponyms of animal. The
semantic network technique works best when the concepts are known to the
semantic network and are not too ambiguous. It is less useful when the text contains specialized
terminology or jargon unknown to the network. This technique is available for English text only.
The Maximum search
distance option is only available if you select the semantic network technique. Select
how far you want the techniques to search before it produces categories. The lower the value, the
fewer results you might get. However, these results are less noisy and are more likely to be
significantly linked or associated with each other. The higher the value, the more results you might
get. However, these results might be less reliable or relevant. While this option is globally
applied to all techniques, its effect is greatest on co-occurrences and semantic
networks.
Select Prevent pairing of specific concepts if you want to
stop the process from grouping or pairing two concepts together in the output. To create or manage
concept pairs, click Manage pairs.
Where possible
Choose whether to extend or generalize the descriptors by using wildcards, or both.
Extend and generalize
This option extends the selected categories and then generalizes the descriptors. When you
choose to generalize, the category building process creates generic category rules that use the
asterisk wildcard. For example, instead of multiple descriptors such as [apple tart +
.] and [apple sauce + .], a generic category rule might use wildcards to
produce [apple * + .]. If you generalize with wildcards, you often get the same
number of records or documents as you did before. However, this option has the advantage of reducing
the number and simplifying category descriptors. Also, this option increases the ability to
categorize more records or documents by using these categories on new text data (for example, in
longitudinal or wave studies).
Extend only
This option extends your categories without generalizing. It can be helpful to first choose the
Extend only option for manually-created categories and then extend the same
categories again using the Extend and generalize option.
Generalize only
This option generalizes the descriptors without extending your categories in any other way.
Maximum number of items to extend a descriptor by
When you extend a descriptor with items (concepts, types, and other expressions), define the
maximum number of items that can be added to a single descriptor. If you set this limit to 10, then
no more than 10 extra items can be added to an existing descriptor. If there are more than 10 items
to be added, the techniques stop adding new items after the tenth is added. Doing so can make a
descriptor list shorter but doesn't guarantee that the most interesting items were used first.
Also extend subcategories
This option extends any subcategories that are included in the selected categories.
Extend empty categories with descriptors generated from the category name
This method applies only to empty categories, which have 0 descriptors. If a category already
contains descriptors, it is not extended in this way. This option attempts to automatically create
descriptors for each category based on the words that make up the name of the category. The category
name is scanned to see whether words in the name match any extracted concepts. If a concept is
recognized, it's used to find matching concept patterns and these both are used to form descriptors
for the category. This option produces the best results when the category names are both long and
descriptive. It is a quick method for generating category descriptors, which in turn enable the
category to capture records that contain those descriptors. This option is most useful when you
import categories from somewhere else or when you create categories manually with long descriptive
names.
Generate descriptors as
This option applies only if the preceding option is selected. Choose the
Concepts option to produce the resulting descriptors in the form of concepts,
regardless of whether they have been extracted from the source text. Or choose the
Patterns option to produce the resulting descriptors in the form of patterns,
regardless of whether the resulting patterns or any patterns have been extracted.
About cookies on this siteOur websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising.For more information, please review your cookie preferences options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.