0 / 0
Classifying text with a custom classification model
Classifying text with a custom classification model

Classifying text with a custom classification model

You can train your own models for text classification using strong classification algorithms from three different families:

  • Classic machine learning using SVM (Support Vector Machines)
  • Deep learning using CNN (Convolutional Neural Networks)
  • A transformer-based algorithm using the Google BERT multilingual model.

The Watson Natural Language Processing library also offers an easy to use Ensemble classifier that combines different classification algorithms and majority voting.

The algorithms support multi-label and multi-class tasks and special cases, like if the document belongs to one class only (single-label task), or binary classification tasks.

Note: Training classification models is CPU and memory intensive. Depending on the size of your training data, theDO + NLP Runtime 22.1 on Python 3.9 environment might not be large enough to complete the training. If you run into issues with the notebook kernel during training, create a custom notebook environment with a larger amount of CPU and memory, and use that to run your notebook. See Creating your own environment template.

Topic sections:

Input data format for training

Classification blocks accept training data in CSV and JSON formats.

  • The CSV Format

    The CSV file should contain no header. Each row in the CSV file represents an example record. Each record has one or more columns, where the first column represents the text and the subsequent columns represent the labels associated with that text.

    Note:

    • The SVM and CNN algorithms do not support training data where an instance has no labels. So, if you are using the SVM algorithm, or the CNN algorithm, or an Ensemble including one of these algorithms, each CSV row must have at least one label, i.e., 2 columns.
    • The BERT algorithm supports training data where each instance has 0, 1 or more than one label.

      Example 1,label 1
      Example 2,label 1,label 2
      
  • The JSON Format

    The training data is represented as an array with multiple JSON objects. Each JSON object represents one training instance, and must have a text and a labels field. The text represents the training example, and labels stores the labels associated with the example (0, 1, or more than one label).

      [        
          {
          "text": "Example 1",
          "labels": ["label 1"]
          },
          {
          "text": "Example 2",
          "labels": ["label 1", "label 2"]
          },
          {
          "text": "Example 3",
          "labels": []
          }
      ]
    

    Note:

    • "labels": [] denotes an example with no labels. The SVM and CNN algorithms do not support training data where an instance has no labels. So, if you are using the SVM algorithm, or the CNN algorithm, or an Ensemble including one of these algorithms, each JSON object must have at least one label.
    • The BERT algorithm supports training data where each instance has 0, 1 or more than one label.

Input data requirements

For SVM and CNN algorithms:

  • Minimum number of unique labels required: 2
  • Minimum number of text examples required per label: 5

For the BERT algorithm:

  • Minimum number of unique labels required: 1
  • Minimum number of text examples required per label: 5

Note that the training data in CSV or JSON format is converted to a DataStream before training. Instead of training data files, you can also pass data streams directly to the training functions of classification blocks.

Stopwords

You can provide your own stopwords that will be removed during preprocessing. Stopwords file inputs are expected in a standard format: a single text file with one phrase per line. Stopwords can be provided as a list or as a file in a standard format.

Stopwords can be used only with the Ensemble classifier.

Training SVM algorithms

SVM is a support vector machine classifier that can be trained using predictions on any kind of input provided by the embedding or vectorization blocks as feature vectors, for example, by USE (Universal Sentence Encoder) embeddings and TF-IDF vectorizers. It supports multi-class and multi-label text classification and produces confidence scores via Platt Scaling.

For all options that are available for configuring SVM training, enter:

help(watson_nlp.blocks.classification.svm.SVM.train)

To train SVM algorithms:

  1. Begin with these preprocessing steps:

     import watson_nlp
     from watson_core.data_model.streams.resolver import DataStreamResolver
     from watson_nlp.blocks.classification.svm import SVM
    
     training_data_file = "<ADD TRAINING DATA FILE PATH>"
    
     # Create datastream from training data
     data_stream_resolver = DataStreamResolver(target_stream_type=list, expected_keys={'text': str, 'labels': list})
     training_data = data_stream_resolver.as_data_stream(training_data_file)
    
     # Load a Syntax model
     syntax_model = watson_nlp.load(watson_nlp.download('syntax_izumo_en_stock'))
    
     # Create Syntax stream
     text_stream, labels_stream = training_data[0], training_data[1]
     syntax_stream = syntax_model.stream(text_stream)
    
  2. Train the classification model using USE embeddings. See Pretrained USE embeddings available out-of-the-box for a list of the pretrained blocks that are available.

     # download embedding
     use_embedding_model = watson_nlp.load(watson_nlp.download('embedding_use_en_stock')) 
    
     use_train_stream = use_embedding_model.stream(syntax_stream, doc_embed_style='raw_text')
     # NOTE: doc_embed_style can be changed to `avg_sent` as well. For more information check the documentation for Embeddings
     # Or the USE run function API docs
     use_svm_train_stream = watson_nlp.data_model.DataStream.zip(use_train_stream, labels_stream)
    
     # Train SVM using Universal Sentence Encoder (USE) training stream
     classification_model = SVM.train(use_svm_train_stream)
    

Pretrained USE embeddings available out-of-the-box

USE embeddings are wrappers around Google Universal Sentence Encoder embeddings available in TFHub. These embeddings are used in the document classification SVM algorithm.

The following table lists the pretrained blocks for USE embeddings that are available and the languages that are supported. For a list of the language codes and the corresponding language, see Language codes.

List of pretrained USE embeddings with their supported languages
Block name Model name Supported languages
use embedding_use_en_stock English only
use embedding_use_multi_small ar, de, en, es, fr, it, ja, ko, nl, pl, pt, ru, tr, zh-cn, zh-tw
use embedding_use_multi_large ar, de, en, es, fr, it, ja, ko, nl, pl, pt, ru, tr, zh-cn, zh-tw

When using USE embeddings, consider the following:

  • Choose embedding_use_en_stock if your task involves English text.
  • Choose one of the multilingual USE embeddings if your task involves text in a non-English language, or you want to train multilingual models.
  • The USE embeddings exhibit different trade-offs between quality of the trained model and throughput at inference time, as described below. Try different embeddings to decide the trade-off between quality of result and inference throughput that is appropriate for your use case.

    • embedding_use_multi_small has reasonable quality, but it is fast at inference time
    • embedding_use_en_stock is a English-only version of embedding_embedding_use_multi_small, hence it is smaller and exhibits higher inference throughput
    • embedding_use_multi_large is based on Transformer architecture, and therefore it provides higher quality of result, with lower throughput at inference time

Training the CNN algorithm

CNN is a simple convolutional network architecture, built for multi-class and multi-label text classification on short texts. It utilizes GloVe embeddings. GloVe embeddings encode word-level semantics into a vector space. The GloVe embeddings for each language are trained on the Wikipedia corpus in that language. For information on using GloVe embeddings, see the open source GloVe embeddings documentation.

For all the options that are available for configuring CNN training, enter:

help(watson_nlp.blocks.classification.cnn.CNN.train)


To train CNN algorithms:

import watson_nlp
from watson_core.data_model.streams.resolver import DataStreamResolver
from watson_nlp.blocks.classification.cnn import CNN

training_data_file = "<ADD TRAINING DATA FILE PATH>"

# Create datastream from training data
data_stream_resolver = DataStreamResolver(target_stream_type=list, expected_keys={'text': str, 'labels': list})
training_data = data_stream_resolver.as_data_stream(training_data_file)

# Load a Syntax model
syntax_model = watson_nlp.load(watson_nlp.download('syntax_izumo_en_stock'))

# Create Syntax stream
text_stream, labels_stream = training_data[0], training_data[1]
syntax_stream = syntax_model.stream(text_stream)

# Download GloVe embeddings
glove_embedding_model = watson_nlp.load(watson_nlp.download('embedding_glove_en_stock'))

# Train CNN
classification_model = CNN.train(watson_nlp.data_model.DataStream.zip(syntax_stream, labels_stream), embedding=glove_embedding_model.embedding)

Training the multilingual BERT algorithm

BERT is a transformer-based architecture, built for multi-class and multi-label text classification on short texts. It utilizes Multilingual BERT pretrained models.

For all the options available for configuring BERT training, enter:

help(watson_nlp.blocks.classification.bert.BERT.train)


To train BERT algorithms:

import watson_nlp
from watson_nlp.blocks.classification.bert import BERT
from watson_core.data_model.streams.resolver import DataStreamResolver
training_data_file = "<ADD TRAINING DATA FILE PATH>"

# create datastream from training data
data_stream_resolver = DataStreamResolver(target_stream_type=list, expected_keys={'text': str, 'labels': list})
train_stream = data_stream_resolver.as_data_stream(training_data_file)

# Load pre-trained BERT model
pretrained_model_resource = watson_nlp.download_and_load('pretrained-model_bert_multi_bert_multi_uncased')

# Train model
classification_model = BERT.train(train_stream, pretrained_model_resource)

Training an Ensemble model

The Ensemble model is a weighted ensemble of these three algorithms: CNN, SVM with TF-IDF and SVM with USE. It computes the weighted mean of a set of classification predictions using confidence scores. The ensemble model is very easy to use.

For all of the options available for configuring Ensemble training, enter:

help(watson_nlp.workflows.classification.Ensemble.train)

See Pretrained stopword model available out-of-the-box for the available pretrained model and its supported languages.

To train Ensemble algorithms:

import watson_nlp
from watson_nlp.workflows.classification import Ensemble

training_data_file = "<ADD TRAINING DATA FILE PATH>"

# Load predefined stopwords
stopwords = watson_nlp.load(watson_nlp.download('text_stopwords_classification_ensemble_en_stock'))

# Note: If stopwords are available as a single text file with one phrase per line, they can instead be read in using:
# with open('my_stopwords_en.txt') as f:
#    stopwords = f.readlines()

# Train the classifier
# Set the `use_ewl` flag to True to learn the weights automatically.
ensemble_classifier_model = Ensemble.train(training_data_file, 'syntax_izumo_en_stock', 'embedding_glove_en_stock', 'embedding_use_en_stock', stopwords=stopwords, use_ewl=True)

Pretrained stopword model available out-of-the-box

The text model for identifying stopwords is used in training the document classification ensemble model.

The following table lists the pretrained stopword model and the language codes that are supported (xx stands for the language code). For a list of the language codes and the corresponding language, see Language codes.

List of pretrained stopword model with its supported languages
Resource class Model name Supported languages
text text_stopwords_classification_ensemble_xx_stock ar, de, es, en, fr, it, ja, ko

Training best practices

There are certain constraints on the quality and quantity of data to ensure that the classifications model training can complete in a reasonable amount of time and also meets various performance criteria. These are listed below. Note that none are hard restrictions. However, the further one deviates from these guidelines, the greater the chance that the model fails to train or the model will not be satisfactory.

  • Data quantity

    • The highest number of classes classification model has been tested on is ~1200.
    • The best suited text size for training and testing data for classification is around 3000 code points. However, larger texts can also be processed, but the runtime performance might be slower.
    • Training time will increase based on the number of examples and number of labels.
    • Inference time will increased based on the number of labels.
  • Data quality

    • Size of each sample (for example, number of phrases in each training sample) can affect quality.
    • Class separation is important. In other words, classes among the training (and test) data should be semantically distinguishable from each another in order to avoid misclassifications. Since the classifier algorithms in Watson Natural Language Processing rely on word embeddings, training classes that contain text examples with too much semantic overlap may make high-quality classification computationally intractable. While more sophisticated heuristics may exist for assessing the semantic similarity between classes, you should start with a simple "eye test" of a few examples from each class to discern whether or not they seem adequately separated.
    • It is recommended to use balanced data for training. Ideally there should be roughly equal numbers of examples from each class in the training data, otherwise the classifiers may be biased towards classes with larger representation in the training data.
    • It is best to avoid circumstances where some classes in the training data are highly under-represented as compared to other classes.

Limitations and caveats:

  • The BERT classification block has a predefined sequence length of 128 code points. However, this can be configured at train time by changing the parameter max_seq_length. The maximum value allowed for this parameter is 512. This means that the BERT classification block can only be used to classify short text. Text longer than max_seq_length is trimmed and discarded during classification training and inference.
  • The CNN classification block has a predefined sequence length of 1000 code points. This limit can be configured at train time by changing the parameter max_phrase_len. There is no maximum limit for this parameter, but increasing the maximum phrase length will affect CPU and memory consumption.
  • SVM blocks do not have such limit on sequence length and can be used with longer texts.

Applying the model on new data

After you have trained the model on a data set, apply the model on new data using the run() method, as you would use on any of the existing pre-trained blocks.

Sample code

  • For the Ensemble and BERT models, for example for Ensemble:

      # run Ensemble model on new text 
      ensemble_prediction = ensemble_classification_model.run("new input text")
    
  • For SVM and CNN models, for example for CNN:

      # run Syntax model first 
      syntax_result = syntax_model.run("new input text") 
      # run CNN model on top of syntax result 
      cnn_prediction = cnn_classification_model.run(syntax_result)
    

Choosing the right algorithm for your use case

You need to choose the model algorithm that best suits your use case.

When choosing between SVM, CNN and BERT consider the following:

  • BERT

    • Choose when high quality is required and higher computing resources are available.
  • CNN

    • Choose when decent size data is available
    • Choose if GloVe embedding is available for the required language
    • Choose if you have the option between single label versus multi-label
    • CNN fine tunes embeddings, so it could give better performance for unknown terms or newer domains.
  • SVM

    • Choose if an easier and simpler model is required
    • SVM has the fastest training and inference time
    • Choose if your data set size is small

If you select SVM, you need to consider the following when choosing between the various implementations of SVM:

  • SVMs train multi-label classifiers.
  • The larger the number of classes, the longer the training time.
  • TF-IDF:
    • Choose TF-IDF vectorization with SVM if the data set is small, i.e. has a small number of classes, a small number of examples and shorter text size, for example, sentences containing fewer phrases.
    • TF-IDF with SVM can be faster than other algorithms in the classification block.
    • Choose TF-IDF if embeddings for the required language are not available.
  • USE:
    • Choose Universal Sentence Encoder (USE) with SVM if the data set has one or more sentences in input text.
    • USE can perform better on data sets where understanding the context of words or sentences is important.

The Ensemble model combines multiple individual (diverse) models together to deliver superior prediction power. Consider the following key data for this model type:

  • The ensemble model combines CNN, SVM with TF-IDF and SVM with USE.
  • It is the easiest model to use.
  • It can give better performance than the individual algorithms.
  • It works for all kinds of data sets. However, training time for large datasets (more than 20000 examples) can be high.
  • An ensemble model allows you to set weights. These weights decides how the ensemble model combines the results of individual classifiers. Currently, the selection of weights is a heuristics and needs to be set by trial and error. The default weights that are provided in the function itself are a good starting point for the exploration.

Parent topic: Creating your own models