Classifying text with a custom classification model
Classifying text with a custom classification model
Last updated: Nov 27, 2024
Classifying text with a custom classification model
You can train your own models for text classification using strong classification algorithms from three different families:
Classic machine learning using SVM (Support Vector Machines)
Deep learning using CNN (Convolutional Neural Networks)
A transformer-based algorithm using a pre-trained transformer model: Slate IBM Foundation model
The Watson Natural Language Processing library also offers an easy to use Ensemble classifier that combines different classification algorithms and majority voting.
The algorithms support multi-label and multi-class tasks and special cases, like if the document belongs to one class only (single-label task), or binary classification tasks.
Note:
Training classification models is CPU and memory intensive. Depending on the size of your training data, the environment might not be large enough to complete the training. If you run into issues with the notebook kernel during training, create
a custom notebook environment with a larger amount of CPU and memory, and use that to run your notebook. Especially for transformer-based algorithms, you should use a GPU-based environment, if it is available to you. See Creating your own environment template.
Classification blocks accept training data in CSV and JSON formats.
The CSV Format
The CSV file should contain no header. Each row in the CSV file represents an example record. Each record has one or more columns, where the first column represents the text and the subsequent columns represent the labels associated with
that text.
Note:
The SVM and CNN algorithms do not support training data where an instance has no labels. So, if you are using the SVM algorithm, or the CNN algorithm, or an Ensemble including one of these algorithms, each CSV row must have at least one
label, i.e., 2 columns.
The BERT-based and Slate-based Transformer algorithms support training data where each instance has 0, 1 or more than one label.
Example 1,label 1
Example 2,label 1,label 2
The JSON Format
The training data is represented as an array with multiple JSON objects. Each JSON object represents one training instance, and must have a text and a labels field. The text represents the training example, and labels stores the labels associated
with the example (0, 1, or more than one label).
"labels": [] denotes an example with no labels. The SVM and CNN algorithms do not support training data where an instance has no labels. So, if you are using the SVM algorithm, or the CNN algorithm, or an Ensemble
including one of these algorithms, each JSON object must have at least one label.
The BERT-based and Slate-based Transformer algorithms support training data where each instance has 0, 1 or more than one label.
Input data requirements
Copy link to section
For SVM and CNN algorithms:
Minimum number of unique labels required: 2
Minimum number of text examples required per label: 5
For the BERT-based and Slate-based Transformer algorithms:
Minimum number of unique labels required: 1
Minimum number of text examples required per label: 5
Training data in CSV or JSON format is converted to a DataStream before training. Instead of training data files, you can also pass data streams directly to the training functions of classification blocks.
Stopwords
Copy link to section
You can provide your own stopwords that will be removed during preprocessing. Stopwords file inputs are expected in a standard format: a single text file with one phrase per line. Stopwords can be provided as a list or as a file in a standard
format.
Stopwords can be used only with the Ensemble classifier.
Training SVM algorithms
Copy link to section
SVM is a support vector machine classifier that can be trained using predictions on any kind of input provided by the embedding or vectorization blocks as feature vectors, for example, by USE (Universal Sentence Encoder) embeddings
and TF-IDF vectorizers. It supports multi-class and multi-label text classification and produces confidence scores by using Platt Scaling.
For all options that are available for configuring SVM training, enter:
import watson_nlp
from watson_nlp.toolkit.classification_utils.train_util import prepare_data_from_json
from watson_nlp.blocks.classification.svm import SVM
training_data_file = "train_data.json"# Load a Syntax model
syntax_model = watson_nlp.load('syntax_izumo_en_stock')
# Create datastream from training data
train_stream = prepare_data_from_json(training_data_file, syntax_model)
syntax_stream, labels_stream = train_stream[0], train_stream[1]
# download embedding
use_embedding_model = watson_nlp.load('embedding_use_en_stock')
use_train_stream = use_embedding_model.stream(syntax_stream, doc_embed_style='raw_text')
# NOTE: doc_embed_style can be changed to `avg_sent` as well. For more information check the documentation for Embeddings# Or the USE run function API docs
use_svm_train_stream = watson_nlp.data_model.DataStream.zip(use_train_stream, labels_stream)
# Train SVM using Universal Sentence Encoder (USE) training stream
classification_model = SVM.train(use_svm_train_stream)
Copy to clipboardCopied to clipboard
Training the CNN algorithm
Copy link to section
CNN is a simple convolutional network architecture, built for multi-class and multi-label text classification on short texts. It utilizes GloVe embeddings. GloVe embeddings encode word-level semantics into a vector space. The GloVe embeddings
for each language are trained on the Wikipedia corpus in that language. For information on using GloVe embeddings, see GloVe Embeddings.
For all the options that are available for configuring CNN training, enter:
import watson_nlp
from watson_nlp.toolkit.classification_utils.train_util import prepare_data_from_json
from watson_nlp.blocks.classification.cnn import CNN
training_data_file = "train_data.json"# Load a Syntax model
syntax_model = watson_nlp.load('syntax_izumo_en_stock')
# Create datastream from training data
train_stream = prepare_data_from_json(training_data_file, syntax_model)
syntax_stream, labels_stream = train_stream[0], train_stream[1]
# Download GloVe embeddings
glove_embedding_model = watson_nlp.load('embedding_glove_en_stock')
# Train CNN
classification_model = CNN.train(watson_nlp.data_model.DataStream.zip(syntax_stream, labels_stream), embedding=glove_embedding_model.embedding)
Copy to clipboardCopied to clipboardShow more
Training the transformer algorithm by using the IBM Slate model
Copy link to section
The transformer algorithm using a pretrained Slate IBM Foundation model can be used for multi-class and multi-label text classification on short texts.
For all the options available for configuring Transformer training, enter:
import watson_nlp
from watson_nlp.blocks.classification.transformer import Transformer
from watson_nlp.toolkit.classification_utils.train_util import prepare_stream_of_train_records
from watson_nlp import data_model as dm
training_data_file = "train_data.json"# Create datastream from training data
train_data_stream = dm.DataStream.from_file(training_data_file)
train_record_data_stream = prepare_stream_of_train_records(train_data_stream)
# Load pre-trained slate model
pretrained_model_resource = watson_nlp.load('<pretrained Slate model>')
# Train model - note that Transformer uses Slatea by default
classification_model = Transformer.train(train_data_stream=train_record_data_stream,
pretrained_model_resource=pretrained_model_resource)
Copy to clipboardCopied to clipboardShow more
Training a custom transformer model by using a model provided by Hugging Face
Copy link to section
You can train your custom transformer-based model by using a pretrained model from Hugging Face.
To use a Hugging Face model, specify the model name as the pretrained_model_resource parameter in the train method of watson_nlp.blocks.classification.transformer.Transformer. Go to https://huggingface.co/models to copy the model name.
To get a list of all the options available for configuring a transformer training, type this code:
For information on how to train transformer algorithms, refer to this code example:
import watson_nlp
from watson_nlp.blocks.classification.transformer import Transformer
from watson_nlp.toolkit.classification_utils.train_util import prepare_stream_of_train_records
from watson_nlp import data_model as dm
training_data_file = "train_data.json"# Create datastream from training data
train_data_stream = dm.DataStream.from_file(training_data_file)
train_record_data_stream = prepare_stream_of_train_records(train_data_stream)
# Specify the name of the Hugging Face model
huggingface_model_name = 'xml-roberta-base'# Train model
classification_model = Transformer.train(train_data_stream=train_record_data_stream,
pretrained_model_resource=huggingface_model_name)
Copy to clipboardCopied to clipboardShow more
Training an ensemble model
Copy link to section
The Ensemble model is a weighted ensemble of these three algorithms: CNN, SVM with TF-IDF and SVM with USE. It computes the weighted mean of a set of classification predictions using confidence scores. The ensemble model is very easy to use.
The GenericEnsemble classifier allows more flexibility for the user to choose from the three base classifiers TFIDF-SVM, USE-SVM and CNN. For texts ranging from 50 to 1000 characters, using the combination of TFIDF-SVM and USE-SVM classifiers
often yields a good balance of quality and performance. On some medium or long documents (500-1000+ characters), adding the CNN to the Ensemble could help increase quality, but it usually comes with a significant runtime performance impact
(lower throughput and increased model loading time).
For all of the options available for configuring Ensemble training, enter:
import watson_nlp
from watson_nlp.workflows.classification.generic_ensemble import GenericEnsemble
from watson_nlp.workflows.classification.base_classifier import GloveCNN
from watson_nlp.workflows.classification.base_classifier import TFidfSvm
training_data_file = "train_data.json"# Syntax model
syntax_model = watson_nlp.load('syntax_izumo_en_stock')
# GloVE Embedding model
glove_embedding_model = watson_nlp.load('embedding_glove_en_stock')
# Train classifier using the manually defined base classifier and manually defined weights¶
classification_model = GenericEnsemble.train(training_data_file,
syntax_model,
base_classifiers_params=[
TFidfSvm.TrainParams(syntax_model=syntax_model),
GloveCNN.TrainParams(syntax_model=syntax_model,
glove_embedding_model=glove_embedding_model,
)],
weights=[2,1])
Copy to clipboardCopied to clipboardShow more
Pretrained stopword models available out-of-the-box
Copy link to section
The text model for identifying stopwords is used in training the document classification ensemble model.
The following table lists the pretrained stopword models and the language codes that are supported (xx stands for the language code). For a list of the language codes and the corresponding language, see Language codes.
List of pretrained stopword models with their supported languages
Resource class
Model name
Supported languages
text
text_stopwords_classification_ensemble_xx_stock
ar, de, es, en, fr, it, ja, ko
Training best practices
Copy link to section
There are certain constraints on the quality and quantity of data to ensure that the classifications model training can complete in a reasonable amount of time and also meets various performance criteria. These are listed below. Note that none
are hard restrictions. However, the further one deviates from these guidelines, the greater the chance that the model fails to train or the model will not be satisfactory.
Data quantity
The highest number of classes classification model has been tested on is ~1200.
The best suited text size for training and testing data for classification is around 3000 code points. However, larger texts can also be processed, but the runtime performance might be slower.
Training time will increase based on the number of examples and number of labels.
Inference time will increased based on the number of labels.
Data quality
Size of each sample (for example, number of phrases in each training sample) can affect quality.
Class separation is important. In other words, classes among the training (and test) data should be semantically distinguishable from each another in order to avoid misclassifications. Since the classifier algorithms in Watson Natural
Language Processing rely on word embeddings, training classes that contain text examples with too much semantic overlap may make high-quality classification computationally intractable. While more sophisticated heuristics may exist for
assessing the semantic similarity between classes, you should start with a simple "eye test" of a few examples from each class to discern whether or not they seem adequately separated.
It is recommended to use balanced data for training. Ideally there should be roughly equal numbers of examples from each class in the training data, otherwise the classifiers may be biased towards classes with larger representation in
the training data.
It is best to avoid circumstances where some classes in the training data are highly under-represented as compared to other classes.
Limitations and caveats:
The CNN classification block has a predefined sequence length of 1000 code points. This limit can be configured at train time by changing the parameter max_phrase_len. There is no maximum limit for this parameter, but increasing
the maximum phrase length will affect CPU and memory consumption.
SVM blocks do not have such limit on sequence length and can be used with longer texts.
Applying the model on new data
Copy link to section
After you have trained the model on a data set, apply the model on new data using the run() method, as you would use on any of the existing pre-trained blocks.
Sample code
For the Ensemble models:
# run Ensemble model on new text
ensemble_prediction = ensemble_classification_model.run("new input text")
Copy to clipboardCopied to clipboard
For SVM and CNN models, for example for CNN:
# run Syntax model first
syntax_result = syntax_model.run("new input text")
# run CNN model on top of syntax result
cnn_prediction = cnn_classification_model.run(syntax_result)
Copy to clipboardCopied to clipboard
Choosing the right algorithm for your use case
Copy link to section
You need to choose the model algorithm that best suits your use case.
When choosing between SVM, CNN, and Transformers, consider the following:
Transformer-based Slate
Choose when high quality is required and higher computing resources are available.
CNN
Choose when decent size data is available
Choose if GloVe embedding is available for the required language
Choose if you have the option between single label versus multi-label
CNN fine tunes embeddings, so it could give better performance for unknown terms or newer domains.
SVM
Choose if an easier and simpler model is required
SVM has the fastest training and inference time
Choose if your data set size is small
If you select SVM, you need to consider the following when choosing between the various implementations of SVM:
SVMs train multi-label classifiers.
The larger the number of classes, the longer the training time.
TF-IDF:
Choose TF-IDF vectorization with SVM if the data set is small, i.e. has a small number of classes, a small number of examples and shorter text size, for example, sentences containing fewer phrases.
TF-IDF with SVM can be faster than other algorithms in the classification block.
Choose TF-IDF if embeddings for the required language are not available.
USE:
Choose Universal Sentence Encoder (USE) with SVM if the data set has one or more sentences in input text.
USE can perform better on data sets where understanding the context of words or sentences is important.
The Ensemble model combines multiple individual (diverse) models together to deliver superior prediction power. Consider the following key data for this model type:
The ensemble model combines CNN, SVM with TF-IDF and SVM with USE.
It is the easiest model to use.
It can give better performance than the individual algorithms.
It works for all kinds of data sets. However, training time for large datasets (more than 20000 examples) can be high.
An ensemble model allows you to set weights. These weights decides how the ensemble model combines the results of individual classifiers. Currently, the selection of weights is a heuristics and needs to be set by trial and error. The default
weights that are provided in the function itself are a good starting point for the exploration.
About cookies on this siteOur websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising.For more information, please review your cookie preferences options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.