Text training data format for building models in NeuNetS

The NeuNetS tool in IBM Watson Studio synthesizes a neural network and trains it on your training data without you having to design or build anything by hand. This topic describes how to organize and format text training data.

 

To train a text classifier in the NeuNetS tool, you must provide a file named train.tsv that contains your labeled sample text.

 

Training file format

  • Each row of the file must contain: sample text, a tab, and then a class name
  • The file must be saved as: UTF-8 encoded
  • Tabs and new lines in sample text must be backslash escaped: a tab as \t, and a new line as \r, \n or \r\n
  • Do not include a header in the file like you might in other tab-delimited files

 

Requirements and limits

Table 1. Requirements for training file, train.tsv
Requirement Minimum Maximum Notes
File size -- 5GB
Sample text length -- Each text sample can contain up to 500 words Any text sample that contain more than 500 words will be truncated at the limit
Number of classes 2 Limited only by the 5GB total training set restriction
Number of text samples in each class 100 Limited only by the 5GB total training set restriction A minimum of 250 text samples in each class is recommended for best performance

 

Example

The sample training data UCI: SMS Spam Collection - NeuNetS compatible format external link includes sample text messages in two classes:

  • "ham" - real messages from one person to another
  • "spam" - inappropriate or unsolicited advertising

Here is a subset of that sample data:

K. Did you call me just now ah? ham
Finished class where are you. ham
Haha awesome, be there in a minute  ham
WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! spam
Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free!  spam
"XXXMobileMovieClub: To use your credit, click the WAP link in the next txt message or click here spam