How to Fine Tune BERT for Text Classification using Transformers in Python

Learn how to use HuggingFace transformers library to fine tune BERT and other transformer models for text classification task in Python.
  · 10 min read · Updated dec 2020 · Machine Learning · Natural Language Processing


Transformer models have been showing incredible results in most of the tasks in natural language processing field. The power of transfer learning combined with large-scale transformer language models has become a standard in state-of-the art NLP.

One of the biggest milestones in the evolution of NLP is the release of Google's BERT model in late 2018, which is known as the beginning of a new era in NLP.

In this tutorial, we will take you through an example of fine tuning BERT (as well as other transformer models) for text classification using Huggingface Transformers library on the dataset of your choice.

We'll be using 20 newsgroups dataset as a demo for this tutorial, it is a dataset that has about 18,000 news posts on 20 different topics.

To get started, let's install Huggingface transformers library along with others:

pip3 install transformers numpy torch sklearn

Open up a new notebook/Python file and import the necessary modules:

import torch
from transformers.file_utils import is_tf_available, is_torch_available, is_torch_tpu_available
from transformers import BertTokenizerFast, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
import numpy as np
import random
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split

Next, let's make a function to set a seed so we'll have same results in different runs:

def set_seed(seed: int):
    """
    Helper function for reproducible behavior to set the seed in ``random``, ``numpy``, ``torch`` and/or ``tf`` (if
    installed).

    Args:
        seed (:obj:`int`): The seed to set.
    """
    random.seed(seed)
    np.random.seed(seed)
    if is_torch_available():
        torch.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        # ^^ safe to call this function even if cuda is not available
    if is_tf_available():
        import tensorflow as tf

        tf.random.set_seed(seed)

set_seed(1)

As mentioned earlier, we'll be using BERT model. More specifically, we'll be using bert-base-uncased weights from the library.

Also, we'll be using max_length of 512:

# the model we gonna train, base uncased BERT
# check text classification models here: https://huggingface.co/models?filter=text-classification
model_name = "bert-base-uncased"
# max sequence length for each document/sentence sample
max_length = 512

max_length is the maximum length of our sequence. In other words, we'll be picking only the first 512 tokens from each document or post, you can always change it to whatever you want. However, if you increase it, make sure it fits your memory during the training even when using lower batch size.

Loading the Dataset

Next, let's download and load the tokenizer responsible for converting our text to sequences of tokens:

# load the tokenizer
tokenizer = BertTokenizerFast.from_pretrained(model_name, do_lower_case=True)

We also set do_lower_case to True to make sure we lowercase all the text (remember, we're using uncased model).

The below code downloads and loads the dataset:

def read_20newsgroups(test_size=0.2):
  # download & load 20newsgroups dataset from sklearn's repos
  dataset = fetch_20newsgroups(subset="all", shuffle=True, remove=("headers", "footers", "quotes"))
  documents = dataset.data
  labels = dataset.target
  # split into training & testing a return data as well as label names
  return train_test_split(documents, labels, test_size=test_size), dataset.target_names
  
# call the function
(train_texts, valid_texts, train_labels, valid_labels), target_names = read_20newsgroups()

Each of train_texts and valid_texts is a list of documents (list of strings) for training and validation sets respectively, the same for train_labels and valid_labels, each of them is a list of integers, or labels ranging from 0 to 19. target_names is a list of our 20 labels each has its own name.

Now let's use our tokenizer to encode our corpus:

# tokenize the dataset, truncate when passed `max_length`, 
# and pad with 0's when less than `max_length`
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=max_length)
valid_encodings = tokenizer(valid_texts, truncation=True, padding=True, max_length=max_length)

We set truncation to True so that we eliminate tokens that goes above max_length, we also set padding to True to pad documents that are less than max_length with empty tokens.

The below code wraps our tokenized text data into a torch Dataset:

class NewsGroupsDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
        item["labels"] = torch.tensor([self.labels[idx]])
        return item

    def __len__(self):
        return len(self.labels)

# convert our tokenized data into a torch Dataset
train_dataset = NewsGroupsDataset(train_encodings, train_labels)
valid_dataset = NewsGroupsDataset(valid_encodings, valid_labels)

Since we gonna use Trainer from Transformers library, it expects our dataset as a torch.utils.data.Dataset, so we made a simple class that implements __len__() method that returns number of samples, and __getitem__() method to return a data sample at a specific index.

Training the Model

Now that we have our data prepared, let's download and load our BERT model and its pre-trained weights:

# load the model and pass to CUDA
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=len(target_names)).to("cuda")

We're using BertForSequenceClassification class from Transformers library, we set num_labels to the length of our available labels, in this case 20.

We also cast our model to our CUDA GPU, if you're on CPU (not suggested), then just delete to() method.

Before we start fine tuning our model, let's make a simple function to compute the metrics we want. In this case, accuracy:

from sklearn.metrics import accuracy_score

def compute_metrics(pred):
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)
  # calculate accuracy using sklearn's function
  acc = accuracy_score(labels, preds)
  return {
      'accuracy': acc,
  }

You're free to include any metric you want, I've included accuracy, but you can add precision, recall, etc.

The below code uses TrainingArguments class to specify our training arguments, such as number of epochs, batch size, and some other parameters:

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=20,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    load_best_model_at_end=True,     # load the best model when finished training (default metric is loss)
    # but you can specify `metric_for_best_model` argument to change to accuracy or other metric
    logging_steps=200,               # log & save weights each logging_steps
    evaluation_strategy="steps",     # evaluate each `logging_steps`
)

Each argument is explained in the code comments, I've specified 16 as training batch size, that's because it's the maximum I can get to fit in a Google Colab environment's memory.

You can also tweak other parameters, such as adding number of epochs for better training.

We then pass our training arguments, dataset and compute_metrics callback to our Trainer:

trainer = Trainer(
    model=model,                         # the instantiated Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=valid_dataset,          # evaluation dataset
    compute_metrics=compute_metrics,     # the callback that computes metrics of interest
)

Training the model:

# train the model
trainer.train()

This will take several minutes/hours depending on your environment, here's my output on Google Colab:

[######################] [2829/2829 58:39, Epoch 3/3]
Step	Training Loss	Validation Loss	Accuracy
200 	2.799619	    2.147746	    0.475066
400 	1.660876	    1.215588	    0.648011
600 	1.204610	    1.035250	    0.706101
800 	1.053862	    0.946825	    0.717507
1000	0.963572	    0.894024	    0.729973
1200	0.765880	    0.860701	    0.746419
1400	0.743791	    0.831061	    0.751989
1600	0.710643	    0.808310	    0.756233
1800	0.675188	    0.814872	    0.760477
2000	0.542912	    0.819911	    0.768700
2200	0.425509	    0.801369	    0.768435
2400	0.401201	    0.834266	    0.771883
2600	0.402379	    0.811787	    0.773210
2800	0.393575	    0.800710	    0.775862
TrainOutput(global_step=2829, training_loss=0.9052972534007089)

As you can see, the validation loss is gradually decreasing, and the accuracy increased to over 77.5%.

Remember we set load_best_model_at_end to True, this will automatically load the best performed model when finished training, let's make sure with evaluate() method:

# evaluate the current model after training
trainer.evaluate()

This will take several seconds to output something like this:

{'epoch': 3.0,
 'eval_accuracy': 0.7758620689655172,
 'eval_loss': 0.80070960521698}

Now that we trained our model, let's save it:

# saving the fine tuned model & tokenizer
model_path = "20newsgroups-bert-base-uncased"
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

Performing Inference

Now we have a trained model on our dataset, let's try to have some fun with it!

The below function takes a text as string, tokenizes it with our tokenizer, calculates the output probabilities using softmax function, and returns the actual label:

def get_prediction(text):
    # prepare our text into tokenized sequence
    inputs = tokenizer(text, padding=True, truncation=True, max_length=max_length, return_tensors="pt").to("cuda")
    # perform inference to our model
    outputs = model(**inputs)
    # get output probabilities by doing softmax
    probs = outputs[0].softmax(1)
    # executing argmax function to get the candidate label
    return target_names[probs.argmax()]

Here's an example:

# Example #1
text = """
The first thing is first. 
If you purchase a Macbook, you should not encounter performance issues that will prevent you from learning to code efficiently.
However, in the off chance that you have to deal with a slow computer, you will need to make some adjustments. 
Having too many background apps running in the background is one of the most common causes. 
The same can be said about a lack of drive storage. 
For that, it helps if you uninstall xcode and other unnecessary applications, as well as temporary system junk like caches and old backups.
"""
print(get_prediction(text))

Output:

comp.sys.mac.hardware

As expected, we're talking about Macbooks. Here's a second example:

# Example #2
text = """
A black hole is a place in space where gravity pulls so much that even light can not get out. 
The gravity is so strong because matter has been squeezed into a tiny space. This can happen when a star is dying.
Because no light can get out, people can't see black holes. 
They are invisible. Space telescopes with special tools can help find black holes. 
The special tools can see how stars that are very close to black holes act differently than other stars.
"""
print(get_prediction(text))

Output:

sci.space

This is a label of science -> space, as expected!

Yet another example:

# Example #3
text = """
Coronavirus disease (COVID-19) is an infectious disease caused by a newly discovered coronavirus.
Most people infected with the COVID-19 virus will experience mild to moderate respiratory illness and recover without requiring special treatment.  
Older people, and those with underlying medical problems like cardiovascular disease, diabetes, chronic respiratory disease, and cancer are more likely to develop serious illness.
"""
print(get_prediction(text))

Output:

sci.med

Conclusion

In this tutorial, you've learned how you can train BERT model using Huggingface Transformers library on your dataset.

Note that, you can also use other transformer models, such as GPT-2 with GPT2ForSequenceClassification, RoBERTa with GPT2ForSequenceClassification, DistilBERT with DistilBERTForSequenceClassification, and much more. Please head to the official documentation for list of available models.

Also, if your dataset is in a language other than English, make sure you pick the weights for your language, this will help a lot during training. Check this link and use the filter to get the model weights you need.

Learn also: How to Perform Text Summarization using Transformers in Python.

Open in Colab

View Full Code
Sharing is caring!



Read Also





Comment panel