When you build categories, you can select from a number of advanced linguistic category
building techniques such as concept inclusion and semantic networks
(English text only). These techniques can be used individually or in combination with each other to
create categories.
Keep in mind that because every dataset is unique, the number of methods and the order in which
you apply them may change over time. Since your text mining goals may be different from one set of
data to the next, you may need to experiment with the different techniques to see which one produces
the best results for the given text data. None of the automatic techniques will perfectly categorize
your data; therefore we recommend finding and applying one or more automatic techniques that work
well with your data.
The following advanced settings are available for the Use linguistic techniques to
build categories option in the category settings.
Category input
Copy link to section
Select what the categories will be built from:
Unused extraction results. This option enables categories to be built from
extraction results that aren't used in any existing categories. This minimizes the tendency for
records to match multiple categories and limits the number of categories produced.
All extraction results. This option enables categories to be built using any
of the extraction results. This is most useful when no or few categories already exist.
Category output
Copy link to section
Select the general structure for the categories that will be built:
Hierarchical with subcategories. This option creates
subcategories and sub-subcategories. You can set the depth of your categories by choosing the
maximum number of levels that can be created. For example, if you choose 3, categories could contain
subcategories and those subcategories could also have subcategories.
Flat categories (single level only). This option
builds only one level of categories, meaning that no subcategories will be generated.
Grouping techniques
Copy link to section
Each of the techniques available is well suited to certain types of data and situations, but
often it's helpful to combine techniques in the same analysis to capture the full range of documents
or records. You may see a concept in multiple categories or find redundant categories.
Group by concept inclusion. This technique builds categories by grouping
multiterm concepts (compound words) based on whether they contain words that are subsets or
supersets of a word in the other. For example, the concept seat would be grouped
with safety seat, seat belt, and seat belt
buckle.
Group by semantic network. This technique begins by identifying the
possible senses of each concept from its extensive index of word relationships and then creates
categories by grouping related concepts. This technique is best when the concepts are known to the
semantic network and are not too ambiguous. It is less helpful when text contains specialized
terminology or jargon unknown to the network. In one example, the concept granny smith
apple could be grouped with gala apple and winesap apple
since they are siblings of the granny smith. In another example, the concept animal
might be grouped with cat and kangaroo since they are hyponyms of
animal. This technique is available for English text only.
Maximum search distance. This setting is only available if you select the
Group by semantic network option. Select how far you want the techniques to
search before producing categories. The lower the value, the fewer results you will get—however,
these results will be less noisy and are more likely to be significantly linked or associated with
each other. The higher the value, the more results you might get—however, these results may be less
reliable or relevant. While this option is globally applied to all techniques, its effect is
greatest on co-occurrences and semantic networks.
Prevent pairing of specific concepts. Select this option to stop the
process from grouping or pairing two concepts together in the output. To create or manage concept
pairs, click Manage pairs.
Generalize with wildcards where possible. Select this option to allow
Modeler to generate generic rules in categories using the asterisk wildcard. For example, instead of
producing multiple descriptors such as [apple tart + .] and [apple sauce +
.], using wildcards might produce [apple * + .]. If you generalize with
wildcards, you'll often get exactly the same number of records or documents as you did before.
However, this option has the advantage of reducing the number and simplifying category descriptors.
Additionally, this option increases the ability to categorize more records or documents using these
categories on new text data (for example, in longitudinal/wave studies).
Other options for building categories
Copy link to section
Maximum number of top level categories created. Use
this option to limit the number of categories that can be generated the next time you click
Build in the categories pane. In some cases, you might get better results if
you set this value high and then delete any of the uninteresting categories.
Minimum number of descriptors and/or subcategories per
descriptor. Use this option to define the minimum number of descriptors and
subcategories a category must contain in order to be created. This option helps limit the creation
of categories that don't capture a significant number of records or documents.
Allow descriptors to appear in more than one category.
When selected, this option allows descriptors to be used in more than one of the categories that
will be built next. This option is generally selected since items commonly or "naturally" fall into
two or more categories, and allowing them to do so usually leads to higher quality categories. If
you don't select this option, you reduce the overlap of records in multiple categories and—depending
on the type of data you have—this might be desirable. However, with most types of data, restricting
descriptors to a single category usually results in a loss of quality or category coverage. For
example, let's say you have the concept car seat manufacturer. With this option,
this concept could appear in one category based on the text car seat and in another
one based on manufacturer. But if this option is not selected, although you may
still get both categories, the concept car seat manufacturer will only appear as a
descriptor in the category it best matches based on several factors including the number of records
in which car seat and manufacturer each occur.
Resolve duplicate category names by. Select how to
handle any new categories or subcategories whose names would be the same as existing categories. You
can either merge the new ones (and their descriptors) with the existing categories with the same
name, or you can choose to skip the creation of any categories if a duplicate name is found in the
existing categories.
About cookies on this siteOur websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising.For more information, please review your cookie preferences options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.