How to Build a Spam Classifier using Keras in Python

Abdou Rockikz · 17 aug 2019

Abdou Rockikz · 10 min read · Updated nov 2019 · Machine Learning · Natural Language Processing

Email spam or junk email is unsolicited, unavoidable and repetitive messages sent in email. Email spam has grown since the early 1990s , and by 2014, it was estimated that it made up around 90% of email messages sent.

Since we all have the problem of spam emails filling our inboxes, in this tutorial, we gonna build a model in Keras that can distinguish between spam and legitimate emails.

Table of content:

  1. Installing and Importing Dependencies
  2. Loading the Dataset
  3. Preparing the Dataset
  4. Building the Model
  5. Training the Model
  6. Evaluating the Model

1. Installing and Importing Dependencies

We first need to install some dependencies:

pip3 install keras sklearn tqdm numpy keras_metrics tensorflow==1.14.0

Now open up an interactive shell or a jupyter notebook and import:

import tqdm
import numpy as np
import keras_metrics # for recall and precision metrics
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from keras.layers import Embedding, LSTM, Dropout, Dense
from keras.models import Sequential
from keras.utils import to_categorical
from keras.callbacks import ModelCheckpoint, TensorBoard
from sklearn.model_selection import train_test_split
import time
import numpy as np
import pickle

Let's define some hyper-parameters:

SEQUENCE_LENGTH = 100 # the length of all sequences (number of words per sample)
EMBEDDING_SIZE = 100  # Using 100-Dimensional GloVe embedding vectors
TEST_SIZE = 0.25 # ratio of testing set

BATCH_SIZE = 64
EPOCHS = 20 # number of epochs

# to convert labels to integers and vice-versa
label2int = {"ham": 0, "spam": 1}
int2label = {0: "ham", 1: "spam"}

Don't worry if you are not sure what these parameters mean, we'll talk about them later we construct our model.

2. Loading the Dataset

The dataset we gonna use is SMS Spam Collection Dataset, download, extract and put it in a folder called "data", let's define the function that loads it:

def load_data():
    """
    Loads SMS Spam Collection dataset
    """
    texts, labels = [], []
    with open("data/SMSSpamCollection") as f:
        for line in f:
            split = line.split()
            labels.append(split[0].strip())
            texts.append(' '.join(split[1:]).strip())
    return texts, labels

The dataset is in a single file, each line corresponds to a data sample, the first word is the label and the rest is the actual email content, that's why we are grabbing labels as split[0] and the content as split[1:].

Calling the function:

# load the data
X, y = load_data()

3. Preparing the Dataset

Now, we need a way to vectorize the text corpus by turning each text into a sequence of integers, you're now may be wondering why we need to turn the text into sequence of integers, well, remember we are going to feed the text into a neural network, a neural network only understands numbers. More precisely, a fixed length sequence of integers.

But before we do all of that, we need to clean this corpus by removing punctuations, lowercase all characters, etc. Luckily for us, Keras has a builtin class keras.preprocessing.text.Tokenizer() that does all that in few lines of code:

# Text tokenization
# vectorizing text, turning each text into sequence of integers
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X)
# convert to sequence of integers
X = tokenizer.texts_to_sequences(X)

Let's try to print the first sample:

In [4]: print(X[0])
[49, 472, 4436, 843, 756, 659, 64, 8, 1328, 87, 123, 352, 1329, 148, 2996, 1330, 67, 58, 4437, 144]

A bunch of numbers, each integer correspond to a word in the vocabulary, that's what the neural network needs anyway. However, the samples doesn't have the same length, we need a way to have a fixed length sequence.

As a result, we're using keras.preprocessing.sequence.pad_sequences() function that pad sequences at the beginning of each sequence with zeros:

# convert to numpy arrays
X = np.array(X)
y = np.array(y)
# pad sequences at the beginning of each sequence with 0's
# for example if SEQUENCE_LENGTH=4:
# [[5, 3, 2], [5, 1, 2, 3], [3, 4]]
# will be transformed to:
# [[0, 5, 3, 2], [5, 1, 2, 3], [0, 0, 3, 4]]
X = pad_sequences(X, maxlen=SEQUENCE_LENGTH)

As you may remember, we set SEQUENCE_LENGTH to 100, in this way, all sequences have a length of 100.

Now our labels are text also, but we gonna make a different approach here, since the labels are only "spam" and "ham", we need to one-hot encode them:

# One Hot encoding labels
# [spam, ham, spam, ham, ham] will be converted to:
# [1, 0, 1, 0, 1] and then to:
# [[0, 1], [1, 0], [0, 1], [1, 0], [0, 1]]

y = [ label2int[label] for label in y ]
y = to_categorical(y)

We used keras.utils.to_categorial() here, which does what its name suggests, let's try to print the first sample of the labels:

In [7]: print(y[0])
[1.0, 0.0]

That means the first sample is ham.

Next, let's shuffle and split training and testing data:

# split and shuffle
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE, random_state=7)

4. Building the Model

Now we ready to build our model, the general architecture is as shown in the following image:

Network ArchitectureThe first layer is a pre-trained embedding layer that maps each word to a N-dimensional vector of real numbers ( the EMBEDDING_SIZE corresponds to the size of this vector, in this case 100). Two words that have similar meaning tend to have very close vectors.

The second layer is a recurrent neural network with LSTM units. Finally, the output layer is 2 neurons each corresponds to "spam" or "ham" with a softmax activation function.

Let's start by writing a function to load the pre-trained embedding vectors:

def get_embedding_vectors(tokenizer, dim=100):
    embedding_index = {}
    with open(f"data/glove.6B.{dim}d.txt", encoding='utf8') as f:
        for line in tqdm.tqdm(f, "Reading GloVe"):
            values = line.split()
            word = values[0]
            vectors = np.asarray(values[1:], dtype='float32')
            embedding_index[word] = vectors

    word_index = tokenizer.word_index
    embedding_matrix = np.zeros((len(word_index)+1, dim))
    for word, i in word_index.items():
        embedding_vector = embedding_index.get(word)
        if embedding_vector is not None:
            # words not found will be 0s
            embedding_matrix[i] = embedding_vector
            
    return embedding_matrix

Note: In order to run this function properly, you need to download GloVe, extract and put in "data" folder, we will use the 100-dimensional vectors here.

Let's define the function that builds the model:

def get_model(tokenizer, lstm_units):
    """
    Constructs the model,
    Embedding vectors => LSTM => 2 output Fully-Connected neurons with softmax activation
    """
    # get the GloVe embedding vectors
    embedding_matrix = get_embedding_vectors(tokenizer)
    model = Sequential()
    model.add(Embedding(len(tokenizer.word_index)+1,
              EMBEDDING_SIZE,
              weights=[embedding_matrix],
              trainable=False,
              input_length=SEQUENCE_LENGTH))

    model.add(LSTM(lstm_units, recurrent_dropout=0.2))
    model.add(Dropout(0.3))
    model.add(Dense(2, activation="softmax"))
    # compile as rmsprop optimizer
    # aswell as with recall metric
    model.compile(optimizer="rmsprop", loss="categorical_crossentropy",
                  metrics=["accuracy", keras_metrics.precision(), keras_metrics.recall()])
    model.summary()
    return model

The above function constructs the whole model, we loaded the pre-trained embedding vectors to the Embedding layer, and set trainable=False, this will freeze the embedding weights during the training process.

After we add the RNN layer, we added a 30% dropout chance, this will freeze 30% of neurons in the previous layer in each iteration which will help us reduce overfitting.

Note that accuracy isn't enough for determining whether the model is doing great, that is because this dataset is unbalanced, only few samples are spam. As a result, we will use precision and recall metrics.

Let's call the function:

# constructs the model with 128 LSTM units
model = get_model(tokenizer=tokenizer, lstm_units=128)

5. Training the Model

We are almost there, we gonna need to train this model with the data we just loaded:

# initialize our ModelCheckpoint and TensorBoard callbacks
# model checkpoint for saving best weights
model_checkpoint = ModelCheckpoint("results/spam_classifier_{val_loss:.2f}", save_best_only=True,
                                    verbose=1)
# for better visualization
tensorboard = TensorBoard(f"logs/spam_classifier_{time.time()}")
# print our data shapes
print("X_train.shape:", X_train.shape)
print("X_test.shape:", X_test.shape)
print("y_train.shape:", y_train.shape)
print("y_test.shape:", y_test.shape)
# train the model
model.fit(X_train, y_train, validation_data=(X_test, y_test),
          batch_size=BATCH_SIZE, epochs=EPOCHS,
          callbacks=[tensorboard, model_checkpoint],
          verbose=1)

The training has started:

_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
embedding_1 (Embedding)      (None, 100, 100)          901300
_________________________________________________________________
lstm_1 (LSTM)                (None, 128)               117248
_________________________________________________________________
dropout_1 (Dropout)          (None, 128)               0
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 258
=================================================================
Total params: 1,018,806
Trainable params: 117,506
Non-trainable params: 901,300
_________________________________________________________________
X_train.shape: (4180, 100)
X_test.shape: (1394, 100)
y_train.shape: (4180, 2)
y_test.shape: (1394, 2)
Train on 4180 samples, validate on 1394 samples
Epoch 1/20
4180/4180 [==============================] - 9s 2ms/step - loss: 0.1712 - acc: 0.9325 - precision: 0.9524 - recall: 0.9708 - val_loss: 0.1023 - val_acc: 0.9656 - val_precision: 0.9840 - val_recall: 0.9758

Epoch 00001: val_loss improved from inf to 0.10233, saving model to results/spam_classifier_0.10
Epoch 2/20
4180/4180 [==============================] - 8s 2ms/step - loss: 0.0976 - acc: 0.9675 - precision: 0.9765 - recall: 0.9862 - val_loss: 0.0809 - val_acc: 0.9720 - val_precision: 0.9793 - val_recall: 0.9883

The training is finished:

Epoch 20/20
4180/4180 [==============================] - 8s 2ms/step - loss: 0.0130 - acc: 0.9971 - precision: 0.9973 - recall: 0.9994 - val_loss: 0.0629 - val_acc: 0.9821 - val_precision: 0.9916 - val_recall: 0.9875

6. Evaluating the Model

Let's evaluate our model:

# get the loss and metrics
result = model.evaluate(X_test, y_test)
# extract those
loss = result[0]
accuracy = result[1]
precision = result[2]
recall = result[3]

print(f"[+] Accuracy: {accuracy*100:.2f}%")
print(f"[+] Precision:   {precision*100:.2f}%")
print(f"[+] Recall:   {recall*100:.2f}%")

Output:

1394/1394 [==============================] - 1s 569us/step
[+] Accuracy: 98.21%
[+] Precision:   99.16%
[+] Recall:   98.75%

Here are what each metric means:

  • Accuracy: Percentage of predictions that were correct.
  • Recall: Percentage of spam emails that were predicted correctly.
  • Precision: Percentage of emails classified as spam that were actually spam.

Great! let's test this out:

def get_predictions(text):
    sequence = tokenizer.texts_to_sequences([text])
    # pad the sequence
    sequence = pad_sequences(sequence, maxlen=SEQUENCE_LENGTH)
    # get the prediction
    prediction = model.predict(sequence)[0]
    # one-hot encoded vector, revert using np.argmax
    return int2label[np.argmax(prediction)]

Let's fake a spam email:

text = "Congratulations! you have won 100,000$ this week, click here to claim fast"
print(get_predictions(text))

Output:

spam

Okey, let's try to be legitimate:

text = "Hi man, I was wondering if we can meet tomorrow."
print(get_predictions(text))

Output:

ham

Awesome! This approach is the current state-of-the-art, try to tune the training and model parameters and see if you can improve it.

To see various metrics during training, we need to go to tensorboard by typing in cmd or terminal:

tensorboard --logdir="logs"

Go to the browser and type "localhost:6006" and go to various metrics, here is my result:

AccuracyPrecisionRecall

Here are some further readings:

Finally, I encourage you to check the full code.

Happy Coding ♥

View Full Code
Sharing is caring!


Read Also





Comment panel

   
Comment system is still in Beta, if you find any bug, please consider contacting us here.