How categorization works (SPSS Modeler)

How categorization works

When creating category models in Text Analytics, there are several different techniques you can choose from to create categories. Because every dataset is unique, the number of techniques and the order in which you apply them may change.

Since your interpretation of the results may be different from someone else's, you may need to experiment with the different techniques to see which one produces the best results for your text data. In Text Analytics, you can create category models in a workbench session in which you can explore and fine-tune your categories further.

In this documentation, category building refers to the generation of category definitions and classification through the use of one or more built-in techniques, and categorization refers to the scoring, or labeling, process whereby unique identifiers (name/ID/value) are assigned to the category definitions for each record or document.

During category building, the concepts and types that were extracted are used as the building blocks for your categories. When you build categories, the records or documents are automatically assigned to categories if they contain text that matches an element of a category's definition.

Text Analytics offers you several automated category building techniques to help you categorize your documents or records quickly.

Grouping techniques

Each of the techniques available is well suited to certain types of data and situations, but often it is helpful to combine techniques in the same analysis to capture the full range of documents records. You may see a concept in multiple categories or find redundant categories.

Semantic Network. This technique begins by identifying the possible senses of each concept from its extensive index of word relationships and then creates categories by grouping related concepts. This technique is best when the concepts are known to the semantic network and are not too ambiguous. It is less helpful when text contains specialized terminology or jargon unknown to the network. In one example, the concept granny smith apple could be grouped with gala apple and winesap apple since they are siblings of the granny smith. In another example, the concept animal might be grouped with cat and kangaroo since they are hyponyms of animal. This technique is available for English text only.

Concept Inclusion. This technique builds categories by grouping multiterm concepts (compound words) based on whether they contain words that are subsets or supersets of a word in the other. For example, the concept seat would be grouped with safety seat, seat belt, and seat belt buckle.