0 / 0
Setting options
Last updated: Nov 22, 2024
Setting options for the Text Analytics Workbench (SPSS Modeler)

You can customize different parts of the extraction process while in the Text Analytics Workbench. On the Concepts, Text links, and Categories tabs, you can access several workbench settings to change how terms are extracted from the text data.

Settings for extraction results

When you run the Text Mining node, the extraction engine reads through the text data, identifies the relevant concepts, and assigns a type to each. You can change the settings for the extraction process to tune how extraction results are created.

From the Concepts or Text links tab, click the Settings icon to change setting for extracting concepts, patterns, and text links.

Enable Text Link Analysis pattern extraction
If you have text link analysis (TLA) rules in one of your libraries, select the checkbox to extract TLA patterns from your text data. This option can significantly lengthen the extraction time.
Limit extraction to concepts with a global frequency of at least
You can use this option to extract a term as a concept only if the term appears a set number of times in the text data.
Accommodate punctuation errors
This option temporarily normalizes text that has punctuation errors to improve the extractability of concepts during the extraction process. This option is useful when a text is short and poor quality. For example, text data from open-ended survey responses, email, and CRM data can have improper punctuation. It is also useful when the text contains many abbreviations.
Accommodate spelling for a minimum root character limit
This option applies a fuzzy grouping technique that helps group commonly misspelled words or closely-spelled words under one concept. The fuzzy grouping algorithm temporarily strips all vowels (except the first one) and strips double/triple consonants from extracted words. It then compares the extracted words to see whether they are the same. For example, modeling and modelling are grouped together. However, if each term is assigned to a different type, excluding the <Unknown> type, the fuzzy grouping technique is not applied.
Note: This technique does not work with text data that is written in Japanese. Written Japanese relies on context for grammatical functions like number and gender, so words often have the same form despite different uses. As a result, this technique does not work effectively.
Extract uniterms
You can use this option to extract single words (uniterms) as concepts when they meet the following criteria:
  • The word is not already part of a compound word
  • The word is either a noun or an unrecognized part of speech
Extract non-linguistic entities
This option extracts non-linguistic entities, such as the following entities:
  • Phone numbers
  • Social security numbers
  • Times
  • Dates
  • Currencies
  • Percentages
  • Email addresses
  • HTTP addresses

You can include or exclude certain types of nonlinguistic entities. By disabling any unnecessary entities, the extraction engine saves processing time.

Uppercase algorithm
This option extracts simple and compound terms that aren't in the built-in dictionaries as long as the first letter of the term is in uppercase. This option can be useful if you want to extract most proper nouns.
Group partial and full person names together when possible
This option groups together names that appear differently in the text. This feature is helpful since names are often referred to in their full form at the beginning of the text and then only by a shorter version. This option attempts to match any uniterm with the <Unknown> type to the last word of any of the compound terms that is typed as <Person>. For example, if doe is found and initially typed as <Unknown>, the extraction engine checks to see if any compound terms in the <Person> type include doe as the last word, such as john doe. This option doesn't apply to first names since most are never extracted as uniterms.
Maximum nonfunction word permutation
This option specifies the maximum number of nonfunction words that can be present when applying the permutation technique. This permutation technique groups similar phrases that differ from each other only by the nonfunction words (for example, of and the) contained, regardless of inflection. For example, let's say you set this value to at most two words and both company officials and officials of the company were extracted. In this case, both extracted terms would be grouped together in the final concept list since both terms are deemed to be the same when of the is ignored.
Use derivation when grouping multiterms
When processing big data, select this option to group multiterms by using derivation rules.

Settings for categories

Categories are built from descriptors that are derived from either types or type patterns. In the table, you can select the individual types or type patterns to include in the category building process.

From the Categories tab, go to Build > Change settings to change the following settings.

Build categories from
If you select Types, the categories are built from the concepts that belong to the selected types. So, if you select the <Budget> type in the table, categories such as cost or price might be produced since cost and price are concepts that are assigned to the <Budget> type.

By default, only the types that capture the most records or documents are selected. This pre-selection allows you to quickly focus in on the most interesting types and avoid building uninteresting categories. The table displays the types in descending order starting with the one with the greatest number of records or documents (Doc. count).

The input that you choose affects the categories that you obtain. When you choose to use Types as input, you can see the clearly related concepts more easily. For example, if you build categories by using Types as input, you might obtain a category Fruit with concepts such as apple, pear, citrus fruits, and orange. If you choose Type Patterns as input instead and select the pattern <Unknown> + <Positive>, for example, then you might get a category fruit + <Positive> with one or two kinds of fruit such as fruit + tasty and apple + good. This second result only shows 2 concept patterns because the other occurrences of fruit are not necessarily positively qualified. While this might work for your current text data, in longitudinal studies where you use different document sets, you might want to manually add in other descriptors such as citrus fruit + positive or use types. Using types alone as an input helps you to find all possible fruit.

If you select Type Patterns, categories are built from patterns rather than types and concepts on their own. Any records or documents containing a concept pattern that belong to the selected type pattern are categorized. So, if you select the <Budget> and <Positive> type pattern in the table, categories such as cost & <Positive> or rates & excellent might be produced.

When using type patterns as input for automated category building, sometimes the techniques identify multiple ways to form the category structure. Technically, there is no single right way to produce the categories; however you might find one structure more suited to your analysis than another. To help customize the output in this case, you can designate a type as the preferred focus. All the top-level categories produced will come from a concept of the type you select here (and no other type). Every subcategory will contain a text link pattern from this type. Choose this type in the Structure categories by pattern type: field and the table will be updated to show only the applicable patterns containing the selected type. More often than not, <Unknown> is preselected for you. When <Unknown> is selected, it results in all of the patterns containing the type <Unknown> being selected. The table displays the types in descending order, starting with the one with the greatest number of records or documents (Doc. count).

Techniques
Because every dataset is unique, the number of methods and the order in which you apply them might change over time. Your goals for text mining might be different from one set of data to the next, so you might need to experiment with different techniques to see which one produces the best results with your text data.

You do not need to be an expert in these settings to use them. By default, the most common and average settings are already selected. Therefore, you can bypass the advanced setting dialogs and go straight to building your categories. Likewise, if you make changes here, you do not have to come back to the settings dialog each time since the latest settings are always retained.

Select one of the following techniques and then click Advanced settings. None of the automatic techniques can perfectly categorize your data. You might need to find and apply one or more automatic techniques that work well with your data. You can't build by using linguistic and frequency techniques simultaneously.

The following Extend settings are available:

Category input
Select Unused extraction results if you want categories to be built from extraction results that aren't used in any existing categories. This option minimizes the tendency for records to match multiple categories and limits the number of categories produced. Or select All extraction results if you want categories to be built using any of the extraction results. This option is most useful when you have no or few categories already.

Each of the grouping techniques fits certain types of data and situations best. It's often helpful to combine techniques in the same analysis to capture the full range of documents or records. You might see a concept in multiple categories or find redundant categories.

The concept inclusion technique builds categories by grouping multiterm concepts (compound words) based on whether they contain words that are subsets or supersets of a word in the other. For example, the concept seat is grouped with safety seat, seat belt, and seat belt buckle.

The semantic network technique begins by identifying the possible senses of each concept from its extensive index of word relationships and then creates categories by grouping related concepts. For example, the concepts scuba diving, sailing, snorkeling, kayaking, and white water kayaking might all be grouped in the category sports/sports by type/water sports. Or the concept animal might be grouped with cat and kangaroo since they're hyponyms of animal. The semantic network technique works best when the concepts are known to the semantic network and are not too ambiguous. It is less useful when the text contains specialized terminology or jargon unknown to the network. This technique is available for English text only.

The Maximum search distance option is only available if you select the semantic network technique. Select how far you want the techniques to search before it produces categories. The lower the value, the fewer results you might get. However, these results are less noisy and are more likely to be significantly linked or associated with each other. The higher the value, the more results you might get. However, these results might be less reliable or relevant. While this option is globally applied to all techniques, its effect is greatest on co-occurrences and semantic networks.

Select Prevent pairing of specific concepts if you want to stop the process from grouping or pairing two concepts together in the output. To create or manage concept pairs, click Manage pairs.

Where possible
Choose whether to extend or generalize the descriptors by using wildcards, or both.
Extend and generalize
This option extends the selected categories and then generalizes the descriptors. When you choose to generalize, the category building process creates generic category rules that use the asterisk wildcard. For example, instead of multiple descriptors such as [apple tart + .] and [apple sauce + .], a generic category rule might use wildcards to produce [apple * + .]. If you generalize with wildcards, you often get the same number of records or documents as you did before. However, this option has the advantage of reducing the number and simplifying category descriptors. Also, this option increases the ability to categorize more records or documents by using these categories on new text data (for example, in longitudinal or wave studies).
Extend only
This option extends your categories without generalizing. It can be helpful to first choose the Extend only option for manually-created categories and then extend the same categories again using the Extend and generalize option.
Generalize only
This option generalizes the descriptors without extending your categories in any other way.
Maximum number of items to extend a descriptor by
When you extend a descriptor with items (concepts, types, and other expressions), define the maximum number of items that can be added to a single descriptor. If you set this limit to 10, then no more than 10 extra items can be added to an existing descriptor. If there are more than 10 items to be added, the techniques stop adding new items after the tenth is added. Doing so can make a descriptor list shorter but doesn't guarantee that the most interesting items were used first.
Also extend subcategories
This option extends any subcategories that are included in the selected categories.
Extend empty categories with descriptors generated from the category name
This method applies only to empty categories, which have 0 descriptors. If a category already contains descriptors, it is not extended in this way. This option attempts to automatically create descriptors for each category based on the words that make up the name of the category. The category name is scanned to see whether words in the name match any extracted concepts. If a concept is recognized, it's used to find matching concept patterns and these both are used to form descriptors for the category. This option produces the best results when the category names are both long and descriptive. It is a quick method for generating category descriptors, which in turn enable the category to capture records that contain those descriptors. This option is most useful when you import categories from somewhere else or when you create categories manually with long descriptive names.
Generate descriptors as
This option applies only if the preceding option is selected. Choose the Concepts option to produce the resulting descriptors in the form of concepts, regardless of whether they have been extracted from the source text. Or choose the Patterns option to produce the resulting descriptors in the form of patterns, regardless of whether the resulting patterns or any patterns have been extracted.