The Text Mining node uses linguistic and frequency techniques to extract key concepts
from the text and create categories with these concepts and other data. Use the node to explore the
text data contents or to produce either a concept model nugget or category model nugget.
When you run this node, an internal linguistic extraction engine extracts and
organizes the concepts, patterns, and categories by using natural language processing methods. Two
build modes are available in the Text Mining node's properties:
The Generate directly (concept model nugget) mode automatically produces
a concept or category model nugget when you run the node.
The Build interactively (category model nugget) is a more hands-on,
exploratory approach. You can use this mode to not only extract concepts, create categories, and
refine your linguistic resources, but also run text link analysis and explore clusters. This build
mode launches the Text Analytics Workbench.
And you can use the Text Mining node to generate one of two text mining model nuggets:
Concept model nuggets uncover and extract important concepts
from your structured or unstructured text data.
Category model nuggets score and assign documents and records
to categories, which are made up of the extracted concepts (and patterns).
The extracted concepts and patterns and the categories from your model nuggets
can all be combined with existing structured data, such as demographics, to yield better and
more-focused decisions. For example, if customers frequently list login issues as the primary
impediment to completing online account management tasks, you might want to incorporate "login
issues" into your models.
Data sources and linguistic resources
Copy link to section
Text Mining modeling nodes accept text data from
Import nodes.
You can also upload custom templates and text analysis packages directly in the Text Mining node
to use in the extraction process.
Concepts and concept model nuggets
Copy link to section
During the extraction process, text data is scanned and analyzed to identify
important single words, such as election or peace, and word
phrases such as presidential election, election of the president,
or peace treaties. These words and phrases are collectively referred to as
terms. Using the linguistic resources, the relevant terms are extracted, and similar
terms are grouped under a lead term that is called a concept.
This grouping means that a concept might represent multiple underlying terms. For example, the
concept salary was extracted from an employee satisfaction survey. When you looked
at the records associated with salary, you noticed that salary
isn't always present in the text but instead certain records contained something similar, such as
the terms wage, wages, and salaries. These terms
are grouped under salary since the extraction engine deemed them as similar or
determined they were synonyms based on processing rules or linguistic resources. In this case, any
documents or records containing any of those terms would be treated as if they contained the word
salary.
If you want to see what terms are grouped under a concept, you can explore the
concept in the Text Analytics Workbench or look at which synonyms are shown in the concept model.
A concept model nugget contains a set of concepts, which you can
use to identify records or documents that also contain the concept (including any of its synonyms or
grouped terms). A concept model can be used in two ways:
To explore and analyze the concepts that were discovered in the original source text or to
quickly identify documents of interest.
To apply this model to new text records or documents to quickly identify the same key concepts
in the new documents/records. For example, you can apply the model to the real-time discovery of key
concepts in scratch-pad data from a call center.
Categories and category model nuggets
Copy link to section
You can create categories that represent higher-level concepts or
topics to capture the key ideas, knowledge, and attitudes expressed in the text. Categories are made
up of a set of descriptors, such as concepts, types, and
rules. Together, these descriptors are used to identify whether or not a record or
document belongs in a category. A document or record can be scanned to see whether any of its text
matches a descriptor. If a match is found, the document is assigned to that category. This process
is called categorization.
Categories can be built automatically by using SPSS Modeler's robust set of
automated techniques. You can also manually build them using any additional insight that you might
have regarding the data, or a combination of both. You can also load a set of prebuilt categories
from a text analysis package through the Model settings of this node. Manual creation of categories
or refining categories can only be done through the Text Analytics Workbench.
A category model nugget contains a set of categories along with
its descriptors. The model can be used to categorize a set of documents or records based on the text
in each document or record. Every document or record is read and then assigned to each category for
which a descriptor match was found. In this way, a document or record could be assigned to more than
one category. For example, you can use category model nuggets to see the essential ideas in
open-ended survey responses or in a set of blog entries.
About cookies on this siteOur websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising.For more information, please review your cookie preferences options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.