FastText Tutorial : We shall learn how to train and test supervised text classifier using fastText and check Precision and Recall values for the generated model.
This tutorial uses the fastText command line tool for a small text classification example. You will prepare labelled training data, build a supervised classifier, test it with a separate file, read the P@1 and R@1 values, and then try predictions on new text.
Train and test Supervised Text Classifier using fasttext
Text Classification is an important NLP (Natural Language Processing) task with a wide range of applications in solving problems like Document Classification, Sentiment Analysis, Email SPAM Classification, Tweet Classification etc.
FastText provides “supervised” module to build a model for Text Classification using Supervised learning. In supervised mode, each training line contains one or more labels and the text that belongs to those labels. The model learns from the labelled examples and predicts the most likely label for unseen text.
To work with fastText, it has to be built from source. To build fastText, follow the fastText Tutorial – How to build FastText library from github source. Once fastText is built, run the fastText commands mentioned in the following tutorial from the location of the fasttext executable.
For the command-line interface and supported supervised training options, you may also refer to the official fastText supervised tutorial at fasttext.cc/docs/en/supervised-tutorial.html.
fastText labelled training file format for supervised classification
Prepare a text file such that each line is an example. During the start of the line mention the labels. To mention a label, precede the label name with “__label__” (underscore underscore label underscore underscore).
Example of an entry is shown below.
__label__wish Good Morning
where
‘wish‘ is a label
‘Good Morning‘ is the data for the example.
Multiple labels could be mentioned for an entry as below.
__label__wish __label__question Good Morning. Did you have break-fast ?
For reliable training, keep these formatting rules consistent in both the training file and the test file.
- Write one training example on one line.
- Keep the label prefix exactly as
__label__, unless you deliberately use a custom label prefix in fastText options. - Do not put spaces inside a label name. Use
__label__spam_emailinstead of__label__spam email. - Use the same text cleaning rules for training, testing, and prediction. For example, do not lowercase only the test file if the training file keeps mixed case.
- Avoid blank lines because they add no useful training example.
Sample fastText trainingData.txt for greet, wish and question labels
Prepare a text document containing multiple entries of such to train a text classifier with supervised training using FastText.
__label__greet Good Morning
__label__greet Good Evening
__label__greet Good Day
__label__greet Good Afternoon
__label__greet All the best
__label__greet Good luck
__label__greet Happy Birthday
__label__greet Happy Journey
__label__wish __label__question Good Morning. Did you have break-fast ?
__label__question When did you come ?
__label__question When did you reach office ?
__label__question Where did you go in the morning ?
__label__question What did you bring for lunch ?
This example is intentionally small so that the command and output are easy to understand. For a useful classifier, prepare many more examples per label and keep a separate test file that was not used for training.
Train the fastText supervised classifier from the command line
Run the following command to train a supervised classifier with input as trainingData.txt and the generated output model name as supervised_classifier_model.
$ ./fasttext supervised -input trainingData.txt -output supervised_classifier_model
Read 0M words
Number of words: 32
Number of labels: 3
Progress: 100.0% words/sec/thread: 204861 lr: 0.000000 loss: 0.917794 eta: 0h0m
- Number of words represents the number of unique words read from the training data after fastText tokenization.
- Number of labels represents the number of unique labels found in the training data.
- words/sec/thread is the number of words processed per second per thread during training.
- loss is the training loss shown at the end of this run. A lower loss can be useful during training, but model quality should be judged on separate test data.
- supervised_classifier_model.bin would be the model generated as a result of training the supervised classifier.
In the command above, the $ sign is only the shell prompt. If your terminal does not use it, copy the command after $.
After the basic command works, common options used while tuning a fastText supervised classifier are -epoch, -lr, and -wordNgrams. The following command is an example of how such options are passed to fastText.
./fasttext supervised -input trainingData.txt -output supervised_classifier_model -epoch 25 -lr 0.5 -wordNgrams 2
Do not tune on the final test file. Keep a validation file for trying different options, and use the test file only for the final evaluation.
Use separate fastText testData.txt to measure classifier quality
We shall test the generated model using test data. The test data has the format same as that of training data.
__label__greet Good Night
__label__greet Good luck
__label__question What is your name ?
Run the following command in the terminal.
$ ./fasttext test supervised_classifier_model.bin testData.txt
$ ./fasttext test supervised_classifier_model.bin testData.txt
N 3
P@1 0.667
R@1 0.667
Number of examples: 3
Precision is at 0.667 (66.7%) and Recall is at 0.667 (66.7%).
Read the fastText test output: N, P@1 and R@1
The test command prints compact evaluation metrics. In the output above, N is the number of test examples. Here, N = 3.
P@1 means precision at one prediction. It checks whether the top predicted label is correct. A value of 0.667 means that 2 out of 3 test examples were classified correctly as the first prediction.
Working:
- Total test examples = 3
- Correct top predictions = 2
P@1 = 2 / 3 = 0.666..., which is shown as0.667- Percentage precision =
0.667 × 100 = 66.7%
R@1 means recall at one prediction. In this test file, every line has only one true label, so R@1 is also 2 / 3 = 0.667. If a test line has multiple labels, recall depends on how many of those true labels are recovered in the top predictions.
Predict labels for new text using the trained fastText model
After training and testing, use the model to predict labels for new text. The following command sends two text lines to the trained model. The exact prediction depends on the training data and the options used while training.
printf "Good Morning\nWhat is your name ?\n" | ./fasttext predict supervised_classifier_model.bin -
To see confidence scores along with predicted labels, use predict-prob. The final number in the command below asks fastText to return up to two labels for each input line.
printf "Good Morning\nWhat is your name ?\n" | ./fasttext predict-prob supervised_classifier_model.bin - 2
Improve a fastText supervised text classifier before using it
The sample dataset in this tutorial is too small for a dependable NLP model. Use it to understand the workflow, not to judge fastText performance. For real classification work, improve the dataset and evaluation process first.
- Add enough labelled examples for every class. A classifier cannot learn a label well from one or two examples.
- Keep train, validation and test files separate. Do not evaluate on the same examples used for training.
- Balance labels where possible. If one label has thousands of examples and another has very few, the model may prefer the larger label.
- Use phrase information when needed. Options such as
-wordNgrams 2can help when short phrases carry meaning. - Check mistakes manually. Look at misclassified examples and add clearer training data instead of only changing parameters.
fastText supervised classifier troubleshooting for training and testing
If the model does not train or the test output looks wrong, check these common causes.
| Problem | Likely cause | Fix |
|---|---|---|
Number of labels: 0 | Labels are missing or do not use the expected prefix. | Write labels as __label__name at the start of each line. |
| Very high accuracy on a tiny test file | Test data may be too small or copied from training data. | Create a larger test file with examples not used in training. |
| Predictions are always the same label | Training examples may be imbalanced or too few. | Add more examples for smaller labels and review text cleaning. |
| Model file is not found during testing | The test command is running from a different directory. | Use the correct path to supervised_classifier_model.bin. |
FAQ about fastText supervised text classification
What does __label__ mean in fastText training data?
__label__ is the default prefix fastText uses to identify class labels in supervised training data. For example, __label__question tells fastText that the text on that line belongs to the question label.
Can one fastText training line have more than one label?
Yes. A line can contain multiple labels, such as __label__wish __label__question Good Morning. Did you have break-fast ?. When you use multi-label data, evaluate with a suitable number of predictions because one top prediction may not cover all true labels.
Why are P@1 and R@1 equal in this fastText example?
They are equal here because each test example has one true label. With 3 test examples and 2 correct top predictions, both values are 2 / 3 = 0.667. In multi-label testing, precision and recall can differ.
Should I test a fastText classifier on the training file?
No. Testing on the training file usually gives an unreliable view of model quality. Keep a separate test file containing examples that were not used to train the model.
Can fastText supervised classification be used from Python?
Yes. fastText also provides a Python module for training, loading, testing, and predicting with supervised models. See the official Python module page at fasttext.cc/docs/en/python-module.html if you prefer a Python workflow.
Editorial QA checklist for this fastText classifier workflow
- Confirm that every training and test line starts with at least one
__label__value. - Confirm that
trainingData.txtandtestData.txtare different files. - Confirm that command examples are run from the directory where the
fasttextexecutable is available, or use the correct executable path. - Confirm that the reported
P@1andR@1values are explained with the number of test examples. - Confirm that the tutorial does not imply a production-quality model from the tiny sample dataset.
FastText supervised classifier tutorial summary
In this Fasttext Tutorial – Train and test supervised text classifier using fasttext, we have learnt to train a supervised Text Classifier using training data containing examples, and generate a model. The model is then tested to evaluate its Precision and Recall.
The key steps are: prepare labelled lines using the __label__ prefix, train the model with ./fasttext supervised, test the generated .bin model using a separate test file, and read P@1 and R@1 from the test output.
TutorialKart.com