How to Build a Text Generator using Keras in Python

Abdou Rockikz · 22 sep 2019

Abdou Rockikz · 9 min read · Updated oct 2019 · Machine Learning · Natural Language Processing

Recurrent Neural Networks (RNNs) are very powerful sequence models for classification problems. However, in this tutorial, we will use RNNs as generative models, which means they can learn the sequences of a problem and then generate entirely a new sequence for the problem domain.

After reading this tutorial, you will learn how to build a LSTM model that can generate text (character by character) using Keras in Python.

In text generation, we show the model many training examples so it can learn a pattern between the input and output. Each input is a sequence of characters and the output is the next single character. For instance, say we want to train on the sentence "python is great", the input is "python is grea" and output would be "t". We need to show the model as many examples as our memory can handle in order to make reasonable predictions.

Getting Started

Let's install the required dependencies for this tutorial:

pip3 install tensorflow==1.13.1 keras numpy requests

Importing everything:

import numpy as np
import os
import pickle
from keras.models import Sequential
from keras.layers import Dense, LSTM
from keras.callbacks import ModelCheckpoint
from string import punctuation

Preparing the Dataset

We are going to use a free downloadable book as the dataset: Alice’s Adventures in Wonderland by Lewis Carroll.

These lines of code will download it and save it in a text file:

import requests
content = requests.get("http://www.gutenberg.org/cache/epub/11/pg11.txt").text
open("data/wonderland.txt", "w", encoding="utf-8").write(content)

Just make sure you have a folder called "data" exists in your current directory.

Now let's try to clean this dataset:

# read the textbook
text = open("data/wonderland.txt", encoding="utf-8").read()
# remove caps and replace two new lines with one new line
text = text.lower().replace("\n\n", "\n")
# remove all punctuations
text = text.translate(str.maketrans("", "", punctuation))

The above code reduces our vocabulary for better and faster training by removing upper case characters and punctuations as well as replacing two consecutive new line by just one.

Let's print some statistics about the dataset:

n_chars = len(text)
unique_chars = ''.join(sorted(set(text)))
print("unique_chars:", unique_chars)
n_unique_chars = len(unique_chars)
print("Number of characters:", n_chars)
print("Number of unique characters:", n_unique_chars)

Output:

unique_chars:
 0123456789abcdefghijklmnopqrstuvwxyz
Number of characters: 154207
Number of unique characters: 39

Now that we loaded and cleaned the dataset successfully, we need a way to convert these characters into integers, there are a lot of Keras and Scikit-Learn utilities out there for that, but we are going to make this manually in Python.

Since we have unique_chars as our vocabulary that contains all the unique characters of our dataset, we can make two dictionaries that maps each character to an integer number and vice-versa:

# dictionary that converts characters to integers
char2int = {c: i for i, c in enumerate(unique_chars)}
# dictionary that converts integers to characters
int2char = {i: c for i, c in enumerate(unique_chars)}

Let's save them to a file (to retrieve them later in text generation):

# save these dictionaries for later generation
pickle.dump(char2int, open("char2int.pickle", "wb"))
pickle.dump(int2char, open("int2char.pickle", "wb"))

Now, we need to split the text up into subsequences with a fixed size of 100 characters, as discussed earlier, the input is 100 sequence of characters (converted to integers obviously) and the output is the next character (onehot-encoded). Let's do it:

# hyper parameters
sequence_length = 100
step = 1
batch_size = 128
epochs = 40
sentences = []
y_train = []
for i in range(0, len(text) - sequence_length, step):
    sentences.append(text[i: i + sequence_length])
    y_train.append(text[i+sequence_length])
print("Number of sentences:", len(sentences))

Output:

Number of sentences: 154107

I've chosed 40 epochs for this problem, this will take few hours to train, you can add more epochs to gain better performance.

The above code creates two new lists which contains all the sentences (fixed length sequence of 100 characters) and its corresponding output (the next character).

Now we need to transform the list of input sequences into the form (number_of_sentences, sequence_length, n_unique_chars).

n_unique_chars is the total vocabulary size, in this case; 39 total unique characters.

# vectorization
X = np.zeros((len(sentences), sequence_length, n_unique_chars))
y = np.zeros((len(sentences), n_unique_chars))

for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t, char2int[char]] = 1
        y[i, char2int[y_train[i]]] = 1
print("X.shape:", X.shape)
print("y.shape:", y.shape)

Output:

X.shape: (154107, 100, 39)
y.shape: (154107, 39)

As expected, each character (input sequences or output character) is represented as a vector of 39 numbers, full of zeros except with a 1 in the column for the character index. For example,  "a" (index value of 12) is one-hot encoded like that:

[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]

Building the Model

Now let's build the model, it has basically one LSTM layer (more layers is better) with an arbitrary number of 128 LSTM units.

The output layer is a fully connected layer with 39 units where each neuron corresponds to a character (probability of the occurence of each character).

# building the model
model = Sequential([
    LSTM(128, input_shape=(sequence_length, n_unique_chars)),
    Dense(n_unique_chars, activation="softmax"),
])

Training the Model

Let's train the model now:

model.summary()
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
# make results folder if does not exist yet
if not os.path.isdir("results"):
    os.mkdir("results")
# save the model in each epoch
checkpoint = ModelCheckpoint("results/wonderland-v1-{loss:.2f}.h5", verbose=1)
model.fit(X, y, batch_size=batch_size, epochs=epochs, callbacks=[checkpoint])

This will start training, which gonna look something like this:

Epoch 00026: saving model to results/wonderland-v1-1.10.h5
Epoch 27/40
154107/154107 [==============================] - 314s 2ms/step - loss: 1.0901 - acc: 0.6632

Epoch 00027: saving model to results/wonderland-v1-1.09.h5
Epoch 28/40
 80384/154107 [==============>...............] - ETA: 2:24 - loss: 1.0770 - acc: 0.6694

This will take few hours, depending on your hardware, try increasing batch_size to 256 for faster training.

After each epoch, the checkpoint will save model weights in results folder.

Generating New Text

Now we have trained the model, how can we generate new text?

Open up a new file, I will call it generate.py and import:

import numpy as np
import pickle
import tqdm
from keras.models import Sequential
from keras.layers import Dense, LSTM
from keras.callbacks import ModelCheckpoint

We need a sample text to start generating with, you can take sentences from the training data which will perform better, but I'll try to produce a new chapter:

seed = "chapter xiii"

Let's load the dictionaries that maps each integer to a character and vise-verca that we saved before in the training process:

char2int = pickle.load(open("char2int.pickle", "rb"))
int2char = pickle.load(open("int2char.pickle", "rb"))

Building the model again:

sequence_length = 100
n_unique_chars = len(char2int)

# building the model
model = Sequential([
    LSTM(128, input_shape=(sequence_length, n_unique_chars)),
    Dense(n_unique_chars, activation="softmax"),
])

Now we need to load the optimal set of model weights, choose the least loss you have in the results folder:

model.load_weights("results/wonderland-v1-1.10.h5")

Let's start generating:

# generate 400 characters
generated = ""
for i in tqdm.tqdm(range(400), "Generating text"):
    # make the input sequence
    X = np.zeros((1, sequence_length, n_unique_chars))
    for t, char in enumerate(seed):
        X[0, (sequence_length - len(seed)) + t, char2int[char]] = 1
    # predict the next character
    predicted = model.predict(X, verbose=0)[0]
    # converting the vector to an integer
    next_index = np.argmax(predicted)
    # converting the integer to a character
    next_char = int2char[next_index]
    # add the character to results
    generated += next_char
    # shift seed and the predicted character
    seed = seed[1:] + next_char
print("Generated text:")
print(generated)

All we are doing here, is starting with a seed text, constructing the input sequence, and then predicting the next character. After that, we shift the input sequence by removing the first character and adding the last character predicted. This gives us a slighty changed sequence of inputs that still has length equal to the size of our sequence length.

We then feed in this updated input sequence into the model to predict another character, repeating this process N times will generate a text with N characters.

Here is an interesting text generated:

Generated Text:
ded of and alice as it go on and the court
well you wont you wouldncopy thing
there was not a long to growing anxiously any only a low every cant
go on a litter which was proves of any only here and the things and the mort meding and the mort and alice was the things said to herself i cant remeran as if i can repeat eften to alice any of great offf its archive of and alice and a cancur as the mo

That is clearly english! But you know, most of the sentences doesn't make sense, that is because it is a character-level model.

Note though, this is not limited to english text, you can use whatever type of text you want. In fact, you can even generate Python code once you have enough lines of code.

Conclusion

Great, we are done. Now you know how to make RNNs in Keras as generative models, training LSTM network on text sequences, cleaning text and tuning the performance of the model.

In order to further improve the model, you can:

  • Reduce the vocabulary size by removing few occured characters.
  • Train the model on padded sequences.
  • Add more LSTM and Dropout layers with more LSTM units.
  • Tweak the batch size and see which works best.
  • Train on more epochs.

I suggest you grab your own text, just make sure it is long enough (more than 100K characters) and train on it!

Check the full code here (modified a little bit).

Happy Training ♥

View Full Code
Sharing is caring!


Read Also





Comment panel

   
Comment system is still in Beta, if you find any bug, please consider contacting us here.