How to Perform Text Classification in Python using Tensorflow 2 and Keras

Building deep learning models (using embedding and recurrent layers) for different text classification problems such as sentiment analysis or 20 news group classification using Tensorflow and Keras in Python
  · 16 min read · Updated sep 2020 · Machine Learning · Natural Language Processing


Text classification is one of the important and common tasks in supervised machine learning. It is about assigning a category (a class) to documents, articles, books, reviews, tweets or anything that involves text. It is a core task in natural language processing.

Many applications appeared to use text classification as the main task, examples include spam filtering, sentiment analysis, speech tagging, language detection, and many more.

In this tutorial, we will build a text classifier model using Tensorflow in Python, we will be using IMDB reviews dataset which has 50K real world movie reviews along with their sentiment (positive or negative). In the end of this tutorial, I will show you how you can integrate your own dataset so you can train the model on it.

Although we're using sentiment analysis dataset, this tutorial is intended to perform text classification on any task, if you wish to perform sentiment analysis out of the box, check this tutorial.

To get started, you need to install the following libraries:

pip3 install tqdm numpy tensorflow==2.0.0 sklearn

Now open up a new Python notebook or file and follow along, let's import our necessary modules:

from tqdm import tqdm
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Dense, Dropout, LSTM, Embedding, Bidirectional
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.callbacks import TensorBoard
from sklearn.model_selection import train_test_split
import numpy as np

from glob import glob
import random
import os

Data Preparation

Now before we load our dataset into Python, you need to download the dataset here, you'll see two files there, reviews.txt which contains a movie review in each line, and labels.txt which holds its corresponding label.

The below function loads and preprocesses the dataset:

def load_imdb_data(num_words, sequence_length, test_size=0.25, oov_token=None):
    # read reviews
    reviews = []
    with open("data/reviews.txt") as f:
        for review in f:
            review = review.strip()
            reviews.append(review)

    labels = []
    with open("data/labels.txt") as f:
        for label in f:
            label = label.strip()
            labels.append(label)
    # tokenize the dataset corpus, delete uncommon words such as names, etc.
    tokenizer = Tokenizer(num_words=num_words, oov_token=oov_token)
    tokenizer.fit_on_texts(reviews)
    X = tokenizer.texts_to_sequences(reviews)
    X, y = np.array(X), np.array(labels)
    # pad sequences with 0's
    X = pad_sequences(X, maxlen=sequence_length)
    # convert labels to one-hot encoded
    y = to_categorical(y)
    # split data to training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=1)
    data = {}
    data["X_train"] = X_train
    data["X_test"]= X_test
    data["y_train"] = y_train
    data["y_test"] = y_test
    data["tokenizer"] = tokenizer
    data["int2label"] =  {0: "negative", 1: "positive"}
    data["label2int"] = {"negative": 0, "positive": 1}
    return data

A lot to cover here, this function does the following:

  • It loads the dataset from the files mentioned earlier.
  • After that, it uses Keras' utility Tokenizer class, which help us remove all punctuations automatically, tokenize the corpus, remove rare words such as names, convert text sentences into sequence of numbers (each word corresponds to a number).
  • We already know that neural networks expects a fixed length input, and since the reviews doesn't have the same length of words, we need a way to make the length of sequences as a fixed size. pad_sequences() function comes in to the rescue, we tell it we want only say 300 words in each review (maxlen parameter), it will remove the words that exceeds that number, and it'll pad with 0's to the reviews that are below 300.
  • We use Keras' to_categorical() function to one-hot encode the labels, this is a binary classification, so it'll convert the label 0 to [1, 0] vector, and 1 to [0, 1]. But in general, it converts categorical labels to a fixed length vector.
  • After that, we split our dataset into training set and testing set using sklearn's train_test_split() function and use the data dictionary to add all the things we need in the training process, which are the dataset, the tokenizer and the label encoding dictionary.

Building the Model

Now that we know how to load the dataset, let's build our model.

We are going to use an embedding layer as the first layer of the model. Embedding proved to be useful in mapping categorical variables (words, in this case) to a vector of continuous numbers, it is widely used in natural language processing tasks.

More preciselly, we are going to use pre-trained GloVe word vectors, which are pre-trained vectors that map each word to a vector of a specific size. This size parameter is often called embedding size, although GloVe uses 50, 100, 200 or 300 embedding size vectors. In this tutorial, we will try all of them and see which performs best. Also, two words with the same meaning tend to have very close vectors.

The second layer will be a recurrent layer, you'll have the choice to choose any recurrent cell you want, including LSTM, GRU or even just SimpleRNN and again, we'll see which one outperforms the others.

The last layer should be a dense layer with N neurons, N should be the same number of categories of your dataset. In the case of positive/negative sentiment analysis, it should be 2.

The general architecture of the model is shown in the following figure (grabbed from spam classifier tutorial):

The general architecture of the text classification model

Now you need to download the pre-trained GloVe (download here), after you done that, extract all of them in the data folder (you'll find different vectors for different embedding size), the below function loads these vectors:

def get_embedding_vectors(word_index, embedding_size=100):
    embedding_matrix = np.zeros((len(word_index) + 1, embedding_size))
    with open(f"data/glove.6B.{embedding_size}d.txt", encoding="utf8") as f:
        for line in tqdm(f, "Reading GloVe"):
            values = line.split()
            # get the word as the first word in the line
            word = values[0]
            if word in word_index:
                idx = word_index[word]
                # get the vectors as the remaining values in the line
                embedding_matrix[idx] = np.array(values[1:], dtype="float32")
    return embedding_matrix

Now we gonna need a function that creates the model from scratch, given the hyper parameters:

def create_model(word_index, units=128, n_layers=1, cell=LSTM, bidirectional=False,
                embedding_size=100, sequence_length=100, dropout=0.3, 
                loss="categorical_crossentropy", optimizer="adam", 
                output_length=2):
    """Constructs a RNN model given its parameters"""
    embedding_matrix = get_embedding_vectors(word_index, embedding_size)
    model = Sequential()
    # add the embedding layer
    model.add(Embedding(len(word_index) + 1,
              embedding_size,
              weights=[embedding_matrix],
              trainable=False,
              input_length=sequence_length))
    for i in range(n_layers):
        if i == n_layers - 1:
            # last layer
            if bidirectional:
                model.add(Bidirectional(cell(units, return_sequences=False)))
            else:
                model.add(cell(units, return_sequences=False))
        else:
            # first layer or hidden layers
            if bidirectional:
                model.add(Bidirectional(cell(units, return_sequences=True)))
            else:
                model.add(cell(units, return_sequences=True))
        model.add(Dropout(dropout))
    model.add(Dense(output_length, activation="softmax"))
    # compile the model
    model.compile(optimizer=optimizer, loss=loss, metrics=["accuracy"])
    return model

I know, there are a lot of parameters in this function. Well, in order to test various parameters, this function will be flexible to all parameters provided. Let's explain them:

  • word_index: This is a dictionary that maps each word to its corresponding index number, this is produced by the previously mentioned Tokenizer object.
  • units: This is the number of neurons in each recurrent layer, it defaults to 128, but use any number you want, just be aware that the more units, the more weights to adjust, and therefore, the more slow it'll be in the training process.
  • n_layers: This is the number of recurrent layers we want to use, 1 is a good one to start with.
  • cell: The recurrent cell you want to use, LSTM is a good choice.
  • bidirectional: This is a boolean variable that indicates whether we use bidirectional recurrent layers
  • embedding_size: The size of our embedding vector we mentioned earlier, we will experiment with various sizes.
  • sequence_length: The number of tokenized words on each text sample to feed into the neural networks, we will experiment with this parameter too.
  • dropout: it is the probability of training a given node on the layer, it is useful for reducing overfitting. 40% is pretty good for this, but try to tweak it and see if it performs better.
  • loss: It's the loss function to use for the training, by default, we're using the categorical cross entropy function.
  • optimizer: The optimizer function to use, we're using ADAM here.
  • output_length: This is the number of neurons to use in the last layer, since we're using only positive and negative sentiment classification, it must be 2.

When you look closely, you'll notice that I'm using the Embedding class with weights parameter, it is specifying the pre-trained weights we just downloaded, we're also setting trainable to False, so these vectors won't change at all during the training process.

If your dataset is in different language than english, make sure you find embedding vectors for the language you're using, if not, you shouldn't set weights parameter at all and you need to set trainable to True, so you'll train the vectors parameters from scratch, check this page for word vectors of your language.

Training the Model

Now to start training, we need to define all of the previously mentioned hyper parameters and more:

# max number of words in each sentence
SEQUENCE_LENGTH = 300
# N-Dimensional GloVe embedding vectors
EMBEDDING_SIZE = 300
# number of words to use, discarding the rest
N_WORDS = 10000
# out of vocabulary token
OOV_TOKEN = None
# 30% testing set, 70% training set
TEST_SIZE = 0.3
# number of CELL layers
N_LAYERS = 1
# the RNN cell to use, LSTM in this case
RNN_CELL = LSTM
# whether it's a bidirectional RNN
IS_BIDIRECTIONAL = False
# number of units (RNN_CELL ,nodes) in each layer
UNITS = 128
# dropout rate
DROPOUT = 0.4
### Training parameters
LOSS = "categorical_crossentropy"
OPTIMIZER = "adam"
BATCH_SIZE = 64
EPOCHS = 6

def get_model_name(dataset_name):
    # construct the unique model name
    model_name = f"{dataset_name}-{RNN_CELL.__name__}-seq-{SEQUENCE_LENGTH}-em-{EMBEDDING_SIZE}-w-{N_WORDS}-layers-{N_LAYERS}-units-{UNITS}-opt-{OPTIMIZER}-BS-{BATCH_SIZE}-d-{DROPOUT}"
    if IS_BIDIRECTIONAL:
        # add 'bid' str if bidirectional
        model_name = "bid-" + model_name
    if OOV_TOKEN:
        # add 'oov' str if OOV token is specified
        model_name += "-oov"
    return model_name

I've set the optimal parameters so far that I've found, get_model_name() function is producing a unique model name based on parameters, this is useful when it comes to comparing various parameters on TensorBoard.

Let's bring everything together and start training our model:

# create these folders if they does not exist
if not os.path.isdir("results"):
    os.mkdir("results")
if not os.path.isdir("logs"):
    os.mkdir("logs")
if not os.path.isdir("data"):
    os.mkdir("data")
# dataset name, IMDB movie reviews dataset
dataset_name = "imdb"
# get the unique model name based on hyper parameters on parameters.py
model_name = get_model_name(dataset_name)
# load the data
data = load_imdb_data(N_WORDS, SEQUENCE_LENGTH, TEST_SIZE, oov_token=OOV_TOKEN)
# construct the model
model = create_model(data["tokenizer"].word_index, units=UNITS, n_layers=N_LAYERS, 
                    cell=RNN_CELL, bidirectional=IS_BIDIRECTIONAL, embedding_size=EMBEDDING_SIZE, 
                    sequence_length=SEQUENCE_LENGTH, dropout=DROPOUT, 
                    loss=LOSS, optimizer=OPTIMIZER, output_length=data["y_train"][0].shape[0])
model.summary()
# using tensorboard on 'logs' folder
tensorboard = TensorBoard(log_dir=os.path.join("logs", model_name))
# start training
history = model.fit(data["X_train"], data["y_train"],
                    batch_size=BATCH_SIZE,
                    epochs=EPOCHS,
                    validation_data=(data["X_test"], data["y_test"]),
                    callbacks=[tensorboard],
                    verbose=1)
# save the resulting model into 'results' folder
model.save(os.path.join("results", model_name) + ".h5")

This will take several minutes to train, here is my execution output after the training is finished:

Reading GloVe: 400000it [00:17, 23047.55it/s]
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, 300, 300)          37267200  
_________________________________________________________________
lstm (LSTM)                  (None, 128)               219648    
_________________________________________________________________
dropout (Dropout)            (None, 128)               0         
_________________________________________________________________
dense (Dense)                (None, 2)                 258       
=================================================================
Total params: 37,487,106
Trainable params: 219,906
Non-trainable params: 37,267,200
_________________________________________________________________
Train on 35000 samples, validate on 15000 samples
Epoch 1/6
35000/35000 [==============================] - 186s 5ms/sample - loss: 0.4359 - accuracy: 0.7919 - val_loss: 0.2912 - val_accuracy: 0.8788
Epoch 2/6
35000/35000 [==============================] - 179s 5ms/sample - loss: 0.2857 - accuracy: 0.8820 - val_loss: 0.2608 - val_accuracy: 0.8919
Epoch 3/6
35000/35000 [==============================] - 175s 5ms/sample - loss: 0.2501 - accuracy: 0.8985 - val_loss: 0.2472 - val_accuracy: 0.8977
Epoch 4/6
35000/35000 [==============================] - 174s 5ms/sample - loss: 0.2184 - accuracy: 0.9129 - val_loss: 0.2525 - val_accuracy: 0.8997
Epoch 5/6
35000/35000 [==============================] - 185s 5ms/sample - loss: 0.1918 - accuracy: 0.9246 - val_loss: 0.2576 - val_accuracy: 0.9035
Epoch 6/6
35000/35000 [==============================] - 188s 5ms/sample - loss: 0.1598 - accuracy: 0.9391 - val_loss: 0.2494 - val_accuracy: 0.9004

Awesome, it reached about 90% accuracy after 6 epochs of training.

Testing the Model

Using the model is pretty straightforward, the below function uses model.predict() method to produce the output:

def get_predictions(text):
    sequence = data["tokenizer"].texts_to_sequences([text])
    # pad the sequences
    sequence = pad_sequences(sequence, maxlen=SEQUENCE_LENGTH)
    # get the prediction
    prediction = model.predict(sequence)[0]
    return prediction, data["int2label"][np.argmax(prediction)]

So as you can see, in order to properly produce predictions, we need to use our previously used tokenizer to convert the text into sequences, after that, we pad sequences so it's fixed length sequence, and then we produce the output using model.predict() method, let's play around with this model:

text = "The movie is awesome!"
output_vector, prediction = get_predictions(text)
print("Output vector:", output_vector)
print("Prediction:", prediction)

Output:

Output vector: [0.3001343  0.69986564]
Prediction: positive

Let's use another text:

text = "The movie is bad."
output_vector, prediction = get_predictions(text)
print("Output vector:", output_vector)
print("Prediction:", prediction)

Output:

Output vector: [0.92491007 0.07508987]
Prediction: negative

It is pretty sure that it's a negative sentiment with about 92% confidence. Let's be more challenging:

text = "Not very good, but pretty good try."
output_vector, prediction = get_predictions(text)
print("Output vector:", output_vector)
print("Prediction:", prediction)

Output:

Output vector: [0.38528103 0.61471903]
Prediction: positive

It is pretty 61% sure that's a good sentiment, as you can see, it's giving interesting results, spend some time tricking the model !

Hyperparameter Tuning

Before I came up with 90% accuracy, I have experimented with various hyper parameters, here are some of the interesting ones:

Loss comparison between different embedding sizes using Tensorboard

These are 4 models, each has different embedding size, as you can see, the one that has 300 length vector (each word got a 300 length vector) reached the lowest validation loss value.

Here is another one when I used the sequence length as the varying parameter:

Loss comparisons with different sequence lengths using tensorboard

The model which has 300 sequence length (the green one) tends to perform better.

Using tensorboard, you can see that after reaching epochs 4-5-6, the validation loss will try to increase again, that's clearly overfitting. That's why I set epochs to 6. try to tweak other parameters such as dropout rate and see if you can decrease it further more.

Integrating Custom Datasets

Since this is a text classificatation tutorial, it would be useful if you can use your own datasets without changing much of this tutorial's code. In fact, all you have to change is the loading data function, previously we used load_imdb_data() function, that returns data dictionary which has:

  • X_train: A numpy array that is of the shape (number of training samples, sequence length) which contain all the sequences of each data sample.
  • X_test: Same as above, but for testing samples.
  • y_train: These are the labels of the training set, it's a numpy array of the shape (number of testing samples, number of total categories), in the case of sentiment analysis, this should be something like (15000, 2)
  • y_test: Same as above, but for testing samples.
  • tokenizer: This is a Tokenizer instance from tensorflow.keras.preprocessing.text module, the object that used to tokenize the corpus.
  • label2int: A Python dictionary that converts a label to its corresponding encoded integer, in the sentiment analysis example, we used 1 for positive and 0 for negative.
  • int2label: Vice-versa of the above.

Here is an example function that loads the 20 news group dataset (contains around 18000 newsgroups posts on 20 topics), it uses sklearn's built-in function fetch_20newsgroups():

from sklearn.datasets import fetch_20newsgroups

def load_20_newsgroup_data(num_words, sequence_length, test_size=0.25, oov_token=None):
    # load the 20 news groups dataset
    # shuffling the data & removing each document's header, signature blocks and quotation blocks
    dataset = fetch_20newsgroups(subset="all", shuffle=True, remove=("headers", "footers", "quotes"))
    documents = dataset.data
    labels = dataset.target
    tokenizer = Tokenizer(num_words=num_words, oov_token=oov_token)
    tokenizer.fit_on_texts(documents)
    X = tokenizer.texts_to_sequences(documents)
    X, y = np.array(X), np.array(labels)
    # pad sequences with 0's
    X = pad_sequences(X, maxlen=sequence_length)
    # convert labels to one-hot encoded
    y = to_categorical(y)
    # split data to training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=1)
    data = {}
    data["X_train"] = X_train
    data["X_test"]= X_test
    data["y_train"] = y_train
    data["y_test"] = y_test
    data["tokenizer"] = tokenizer
    data["int2label"] = { i: label for i, label in enumerate(dataset.target_names) }
    data["label2int"] = { label: i for i, label in enumerate(dataset.target_names) }
    return data

Alright, good luck implementing your own text classifier, if you have any problems integrating one, post your comment below and i'll try to reach you as soon as possible.

As I have mentioned earlier, try to experiment with all the hyper parameters provided, I tried to write the code as flexible as possible so you can change only the parameters without doing anything else. If you outperformed my parameters, share them with us in the comments below !

Want to Learn More ?

Finally, if you want to expand your skills on machine learning, deep learning and natural language processing (or even if you're a beginner), I would suggest you take the following courses:

Related: How to Perform Text Summarization using Transformers in Python.

Happy Learning ♥

View Full Code
Sharing is caring!



Read Also





Comment panel