Mining for concepts and categories

The Text Mining node uses linguistic and frequency techniques to extract key concepts from the text and create categories with these concepts and other data. Use the node to explore the text data contents or to produce either a concept model nugget or category model nugget.

When you run this modeling node, an internal linguistic extraction engine extracts and organizes the concepts, patterns, and/or categories using natural language processing methods.

You can run the Text Mining node and automatically produce a concept or category model nugget using the Generate directly option. Alternatively, you can use a more hands-on, exploratory approach using the Build interactively mode in which not only can you extract concepts, create categories, and refine your linguistic resources, but also perform text link analysis and explore clusters.

Requirements. Text Mining modeling nodes accept text data from Import nodes.

Use the Text Mining node to generate one of two text mining model nuggets:

  • Concept model nuggets uncover and extract salient concepts from your structured or unstructured text data.
  • Category model nuggets score and assign documents and records to categories, which are made up of the extracted concepts (and patterns).

The extracted concepts and patterns and the categories from your model nuggets can all be combined with existing structured data, such as demographics, to yield better and more focused decisions. For example, if customers frequently list login issues as the primary impediment to completing online account management tasks, you might want to incorporate "login issues" into your models.

In Text Analytics, we often refer to extracted concepts and categories. It is important to understand the meaning of concepts and categories since they can help you make more informed decisions during your exploratory work and model building.

Concepts and concept model nuggets

During the extraction process, text data is scanned and analyzed to identify interesting or relevant single words, such as election or peace, and word phrases such as presidential election, election of the president, or peace treaties. These words and phrases are collectively referred to as terms. Using the linguistic resources, the relevant terms are extracted, and similar terms are grouped together under a lead term called a concept.

In this way, a concept could represent multiple underlying terms depending on your text and the set of linguistic resources you are using. For example, let's say we have an employee satisfaction survey and the concept salary was extracted. Let's also say that when you looked at the records associated with salary, you noticed that salary isn't always present in the text but instead certain records contained something similar, such as the terms wage, wages, and salaries. These terms are grouped under salary since the extraction engine deemed them as similar or determined they were synonyms based on processing rules or linguistic resources. In this case, any documents or records containing any of those terms would be treated as if they contained the word salary.

If you want to see what terms are grouped under a concept, you can explore the concept within an interactive workbench or look at which synonyms are shown in the concept model.

A concept model nugget contains a set of concepts you can use to identify records or documents that also contain the concept (including any of its synonyms or grouped terms). A concept model can be used in two ways. The first would be to explore and analyze the concepts that were discovered in the original source text or to quickly identify documents of interest. The second would be to apply this model to new text records or documents to quickly identify the same key concepts in the new documents/records, such as the real-time discovery of key concepts in scratch-pad data from a call center.

Categories and category model nuggets

You can create categories that represent, in essence, higher-level concepts or topics to capture the key ideas, knowledge, and attitudes expressed in the text. Categories are made up of a set of descriptors, such as concepts, types, and rules. Together, these descriptors are used to identify whether or not a record or document belongs in a given category. A document or record can be scanned to see whether any of its text matches a descriptor. If a match is found, the document/record is assigned to that category. This process is called categorization.

Categories can be built automatically using Watson Studio's robust set of automated techniques, manually using additional insight you may have regarding the data, or a combination of both. You can also load a set of prebuilt categories from a text analysis package through the Model settings of this node. Manual creation of categories or refining categories can only be done through the interactive workbench.

A category model nugget contains a set of categories along with its descriptors. The model can be used to categorize a set of documents or records based on the text in each document/record. Every document or record is read and then assigned to each category for which a descriptor match was found. In this way, a document or record could be assigned to more than one category. You can use category model nuggets to see the essential ideas in open-ended survey responses or in a set of blog entries, for example.