You can access settings in various panes of the Text Analytics Workbench, such as extraction settings for concepts.
On the Concepts, Text links, and Categories tabs, categories are built from descriptors derived from either types or type patterns. In the table, you can select the individual types or patterns to include in the category building process. A description of all settings on each tab follows.
Settings for extraction results (Concepts data)
From the Concepts or Text links tab, click the Settings icon to change the following settings:
- Enable Text Link Analysis pattern extraction. Specifies that you want to extract TLA patterns from your text data. It also assumes you have TLA pattern rules in one of your libraries in the Resource editor. This option may significantly lengthen the extraction time.
- Accommodate punctuation errors. This option temporarily normalizes text containing punctuation errors (for example, improper usage) during extraction to improve the extractability of concepts. This option is extremely useful when text is short and of poor quality (as, for example, in open-ended survey responses, e-mail, and CRM data), or when the text contains many abbreviations.
- Accommodate spelling for a minimum root character limit. This option applies a fuzzy grouping technique that helps group commonly misspelled words or closely spelled words under one concept. The fuzzy grouping algorithm temporarily strips all vowels (except the first one) and strips double/triple consonants from extracted words and then compares them to see if they are the same so, for example, modeling and modelling would be grouped together. However, if each term is assigned to a different type, excluding the <Unknown> type, the fuzzy grouping technique won't be applied.
- Extract uniterms. This option extracts single words (uniterms) as long as the word isn't already part of a compound word and if it's either a noun or an unrecognized part of speech.
- Extract non-linguistic entities. This option extracts non-linguistic entities, such as phone numbers, social security numbers, times, dates, currencies, digits, percentages, e-mail addresses, and HTTP addresses. You can include or exclude certain types of nonlinguistic entities. By disabling any unnecessary entities, the extraction engine won't waste processing time.
- Uppercase algorithm. This option extracts simple and compound terms that aren't in the built-in dictionaries as long as the first letter of the term is in uppercase. This option offers a good way to extract most proper nouns.
- Group partial and full person names together when possible. This option
groups names that appear differently in the text together. This feature is helpful since names are
often referred to in their full form at the beginning of the text and then only by a shorter
version. This option attempts to match any uniterm with the
<Unknown>
type to the last word of any of the compound terms that is typed as<Person>
. For example, if doe is found and initially typed as<Unknown>
, the extraction engine checks to see if any compound terms in the<Person>
type include doe as the last word, such as john doe. This option doesn't apply to first names since most are never extracted as uniterms. - Maximum nonfunction word permutation. This option specifies the maximum number of nonfunction words that can be present when applying the permutation technique. This permutation technique groups similar phrases that differ from each other only by the nonfunction words (for example, of and the) contained, regardless of inflection. For example, let's say you set this value to at most two words and both company officials and officials of the company were extracted. In this case, both extracted terms would be grouped together in the final concept list since both terms are deemed to be the same when of the is ignored.
- Use derivation when grouping multiterms. When processing big data, select this option to group multiterms by using derivation rules.
Settings for categories (Category data)
From the Categories tab, go to to change the following settings:
- Build categories from. If you select Types, the
categories will be built from the concepts belonging to the selected types. So if you select the
<Budget> type in the table, categories such as
cost or price could be produced since
cost and price are concepts assigned to
the <Budget> type.
By default, only the types that capture the most records or documents are selected. This pre-selection allows you to quickly focus in on the most interesting types and avoid building uninteresting categories. The table displays the types in descending order starting with the one with the greatest number of records or documents (Doc. count). Types from the Opinions library are deselected by default in the types table.
The input you choose affects the categories you obtain. When you choose to use Types as input, you can see the clearly related concepts more easily. For example, if you build categories using Types as input, you could obtain a category Fruit with concepts such as apple, pear, citrus fruits, orange and so on. If you choose Type Patterns as input instead and select the pattern <Unknown> + <Positive>, for example, then you might get a category fruit + <Positive> with one or two kinds of fruit such as fruit + tasty and apple + good. This second result only shows 2 concept patterns because the other occurrences of fruit are not necessarily positively qualified. And while this might be good enough for your current text data, in longitudinal studies where you use different document sets, you may want to manually add in other descriptors such as citrus fruit + positive or use types. Using types alone as input will help you to find all possible fruit.
If you select Type Patterns, categories are built from patterns rather than types and concepts on their own. In that way, any records or documents containing a concept pattern belonging to the selected type pattern are categorized. So, if you select the <Budget> and <Positive> type pattern in the table, categories such as cost & <Positive> or rates & excellent could be produced.
When using type patterns as input for automated category building, there are times when the techniques identify multiple ways to form the category structure. Technically, there is no single right way to produce the categories; however you might find one structure more suited to your analysis than another. To help customize the output in this case, you can designate a type as the preferred focus. All the top-level categories produced will come from a concept of the type you select here (and no other type). Every subcategory will contain a text link pattern from this type. Choose this type in the Structure categories by pattern type: field and the table will be updated to show only the applicable patterns containing the selected type. More often than not, <Unknown> will be preselected for you. This results in all of the patterns containing the type <Unknown> being selected. The table displays the types in descending order starting with the one with the greatest number of records or documents (Doc. count).
- Techniques. Because every dataset is unique, the number of methods and
the order in which you apply them may change over time. Since your text mining goals may be
different from one set of data to the next, you may need to experiment with the different techniques
to see which one produces the best results for the given text data.
You do not need to be an expert in these settings to use them. By default, the most common and average settings are already selected. Therefore, you can bypass the advanced setting dialogs and go straight to building your categories. Likewise, if you make changes here, you do not have to come back to the settings dialog each time since the latest settings are always retained.
Select one of the following techniques and then click Advanced settings. None of the automatic techniques will perfectly categorize your data; therefore we recommend finding and applying one or more automatic techniques that work well with your data. You can't build using linguistic and frequency techniques simultaneously.- Use linguistic techniques to build categories. See Advanced linguistic settings.
- Use frequencies to build categories. See Advanced frequency settings.
The following Extend settings are available:
- Category input. Select the option Unused extraction
results if you want categories to be built from extraction results that aren't used in
any existing categories. This minimizes the tendency for records to match multiple categories and
limits the number of categories produced. Or select the option All extraction
results if you want categories to be built using any of the extraction results. This is
most useful when no or few categories already exist.
Each of the grouping techniques available is well suited to certain types of data and situations, but often it's helpful to combine techniques in the same analysis to capture the full range of documents or records. You may see a concept in multiple categories or find redundant categories. The concept inclusion technique builds categories by grouping multiterm concepts (compound words) based on whether they contain words that are subsets or supersets of a word in the other. For example, the concept seat would be grouped with safety seat, seat belt, and seat belt buckle. . The semantic network technique begins by identifying the possible senses of each concept from its extensive index of word relationships and then creates categories by grouping related concepts. This technique is best when the concepts are known to the semantic network and are not too ambiguous. It's less helpful when text contains specialized terminology or jargon unknown to the network. In one example, the concept granny smith apple could be grouped with gala apple and winesap apple since they're siblings of the granny smith. In another example, the concept animal might be grouped with cat and kangaroo since they're hyponyms of animal. This technique is available for English text only.
The Maximum search distance option is only available if you select the semantic network technique. Select how far you want the techniques to search before producing categories. The lower the value, the fewer results you will get—however, these results will be less noisy and are more likely to be significantly linked or associated with each other. The higher the value, the more results you might get—however, these results may be less reliable or relevant. While this option is globally applied to all techniques, its effect is greatest on co-occurrences and semantic networks.
Select Prevent pairing of specific concepts if you want to stop the process from grouping or pairing two concepts together in the output. To create or manage concept pairs, click Manage pairs.
- Where possible. Choose whether to simply extend, generalize the
descriptors using wildcards, or both.
- Extend and generalize. This option will extend the selected categories
and then generalize the descriptors. When you choose to generalize, the product will create generic
category rules in categories using the asterisk wildcard. For example, instead of producing multiple
descriptors such as
[apple tart + .]
and[apple sauce + .]
, using wildcards might produce[apple * + .]
. If you generalize with wildcards, you'll often get exactly the same number of records or documents as you did before. However, this option has the advantage of reducing the number and simplifying category descriptors. Additionally, this option increases the ability to categorize more records or documents using these categories on new text data (for example, in longitudinal/wave studies). - Extend only. This option will extend your categories without generalizing. It can be helpful to first choose the Extend only option for manually-created categories and then extend the same categories again using the Extend and generalize option.
- Generalize only. This option will generalize the descriptors without extending your categories in any other way.
- Maximum number of items to extend a descriptor by. When extending a descriptor with items (concepts, types, and other expressions), define the maximum number of items that can be added to a single descriptor. If you set this limit to 10, then no more than 10 additional items can be added to an existing descriptor. If there are more than 10 items to be added, the techniques stop adding new items after the tenth is added. Doing so can make a descriptor list shorter but doesn't guarantee that the most interesting items were used first.
- Also extend subcategories. This option will also extend any subcategories below the selected categories.
- Extend empty categories with descriptors generated from the category name. This method applies only to empty categories, which have 0 descriptors. If a category already contains descriptors, it won't be extended in this way. This option attempts to automatically create descriptors for each category based on the words that make up the name of the category. The category name is scanned to see if words in the name match any extracted concepts. If a concept is recognized, it's used to find matching concept patterns and these both are used to form descriptors for the category. This option produces the best results when the category names are both long and descriptive. This is a quick method for generating category descriptors, which in turn enable the category to capture records that contain those descriptors. This option is most useful when you import categories from somewhere else or when you create categories manually with long descriptive names.
- Generate descriptors as. This option only applies if the preceding option is selected. Choose the Concepts option to produce the resulting descriptors in the form of concepts, regardless of whether they have been extracted from the source text. Or choose the Patterns option to produce the resulting descriptors in the form of patterns, regardless of whether the resulting patterns or any patterns have been extracted.
- Extend and generalize. This option will extend the selected categories
and then generalize the descriptors. When you choose to generalize, the product will create generic
category rules in categories using the asterisk wildcard. For example, instead of producing multiple
descriptors such as