Preparing text for training a Natural Language Classifier model

You can train an IBM Watson Natural Language Classifier model to classify text according to classes you define. You need to define a minimum of two classes, with at least three text examples in each class, and then upload your training data to Watson Studio in a .csv file.

 

Procedure

  1. Collect sample text for your classes in a .csv file
  2. Upload your .csv file to your project

 

1. Collect sample text for your classes in a .csv file

For each class you want your model to recognize, collect at least three text examples.

.csv file requirements

  • The .csv file must contain a minimum of six lines (three text samples for each of two classes.)
  • The .csv file may contain a maximum of 20 000 lines.
  • The file must be saved as: UTF-8 encoded.

.csv file structure

  • Each line in the file contains: the sample text, a comma, at least one class name, and an end-of-line character.
  • In those rare cases where more than one class is specified for one sample text, separate the class names with commas.
  • The maximum length of one text sample is: 1024 characters.
  • Tabs and new lines in sample text must be backslash escaped: a tab as \t, and a new line as \r, \n or \r\n.
  • If the sample text contains a comma, the sample text must be surrounded by double quotation marks.
  • If the sample text contains a comma and double quotation marks, the double quotation marks must be escaped by double quotation marks.
  • Class names cannot include tabs or end-of-line characters.

Example

.csv file

This example image demonstrates how to format the .csv file:

  • Lines 1-5: Sample text for the class "hi".
  • Line 2: The sample text contains a comma, so the sample text is surrounded by double quotation marks.
  • Line 6: The sample text contains a comma and double quotation marks, so the entire sample text is enclosed in double quotation marks, and the double quotation marks are escaped.
  • Lines 6-8: Sample text for the class "problem".
  • Line 9: The sample text belongs in both the "problem" class and the "question" class.
  • Lines 10-12: Sample text for the class "question".

 

2. Upload your .csv file to your project

From your project Assets page or from within the Natural Language Classifier model builder in Watson Studio, upload your .csv file using the data panel. (If the data panel isn't open, you can open the data panel by clicking the Find and add data icon (The find data icon).)

 

Next steps