Speech Recognition using Transformers in Python

Learn how to perform automatic speech recognition (ASR) using wav2vec2 transformer with the help of Huggingface transformers library in Python
  · 6 min read · Updated nov 2021 · Machine Learning · Natural Language Processing


Open In Colab

Automatic Speech Recognition (ASR) is the technology that allows us to convert human speech into digital text. In this tutorial, we will dive into the current state-of-the-art model called Wav2vec2 using the Huggingface transformers library in Python.

Wav2Vec2 is a pre-trained model that was trained on speech audio alone (self-supervised) and then followed by fine-tuning on transcribed speech data (LibriSpeech dataset). It has outperformed previous semi-supervised models.

As in Masked Language Modeling, Wav2Vec2 encodes speech audio via a multi-layer convolutional neural network, and then masks spans of the resulting latent speech representations, these representations are then fed to a Transformer network to build contextualized representations, check the Wav2Vec2 paper for more information.

To get started, let's install the required libraries:

$ pip3 install transformers==4.11.2 soundfile sentencepiece torchaudio pydub pyaudio

We'll be using torchaudio for loading audio files. Note that you need to install PyAudio if you're going to use the code on your environment, and PyDub if you're on a Colab environment. We are going to use them for recording from the microphone in Python.

Let's import our libraries:

from transformers import *
import torch
import soundfile as sf
# import librosa
import os
import torchaudio

Next, loading the processor and the model weights of wav2vec2:

# model_name = "facebook/wav2vec2-base-960h" # 360MB
model_name = "facebook/wav2vec2-large-960h-lv60-self" # 1.18GB

processor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2ForCTC.from_pretrained(model_name)

There are 2 most used model architectures and weights for wav2vec. wav2vec2-base-960h is a base architecture with about 360MB of size, it achieved a 3.4% Word Error Rate (WER) on the clean test set, and was trained on 960 hours of LibriSpeech dataset on 16kHz sampled speech audio.

On the other hand, wav2vec2-large-960h-lv60-self is a larger model with about 1.18GB in size (probably won't fit your laptop RAM) but achieved 1.9% WER (the lower the better) on the clean test set. So this one is much better for recognition but heavier and takes more time for inference. Feel free to choose which one suits you best.

Wav2Vec2 was trained using Connectionist Temporal Classification (CTC), so that's why we're using the Wav2Vec2ForCTC class for loading the model.

Next, I've gathered some audio samples from the web:

# audio_url = "http://www.fit.vutbr.cz/~motlicek/sympatex/f2bjrop1.0.wav"
# audio_url = "http://www.fit.vutbr.cz/~motlicek/sympatex/f2bjrop1.1.wav"
# audio_url = "http://www.fit.vutbr.cz/~motlicek/sympatex/f2btrop6.0.wav"
# audio_url = "https://github.com/x4nth055/pythoncode-tutorials/raw/master/machine-learning/speech-recognition/16-122828-0002.wav"
audio_url = "https://github.com/x4nth055/pythoncode-tutorials/raw/master/machine-learning/speech-recognition/30-4447-0004.wav"
# audio_url = "https://github.com/x4nth055/pythoncode-tutorials/raw/master/machine-learning/speech-recognition/7601-291468-0006.wav"
# audio_url = "https://file-examples-com.github.io/uploads/2017/11/file_example_WAV_1MG.wav"
# audio_url = "http://www0.cs.ucl.ac.uk/teaching/GZ05/samples/lathe.wav"

Feel free to choose any. Loading the audio file:

# load our wav file
speech, sr = torchaudio.load(audio_url)
speech = speech.squeeze()
# or using librosa
# speech, sr = librosa.load(audio_file, sr=16000)
sr, speech.shape
(16000, torch.Size([274000]))

The torchaudio.load() function loads the audio file and returns the audio as a vector, and the sample rate, it also automatically downloads the file if it's a URL. If it's a path in the disk, it will load it as well.

Note we use the squeeze() method as well, it is to remove the dimensions with the size of 1. i.e converting tensor from (1, 274000) to (274000,).

Next, we need to make sure the input audio file to the model has the sample rate of 16000Hz because wav2vec2 is trained on that:

# resample from whatever the audio sampling rate to 16000
resampler = torchaudio.transforms.Resample(sr, 16000)
speech = resampler(speech)
speech.shape
torch.Size([274000])

We used Resample from torchaudio.transforms, which helps us to convert the loaded audio file in the fly from one sampling rate to another.

Before we do the inference, we pass the audio vector to the wav2vec2 processor:

# tokenize our wav
input_values = processor(speech, return_tensors="pt", sampling_rate=16000)["input_values"]
input_values.shape
torch.Size([1, 274000])

We specify the sampling_rate and pass "pt" to return_tensors argument to get PyTorch tensors in the results.

Performing inference:

# perform inference
logits = model(input_values)["logits"]
logits.shape
torch.Size([1, 856, 32])

Passing the logits to torch.argmax() to get the likely prediction:

# use argmax to get the predicted IDs
predicted_ids = torch.argmax(logits, dim=-1)
predicted_ids.shape
torch.Size([1, 856, 32])

Decoding them back to text, we also lower the text, as it's in all caps:

# decode the IDs to text
transcription = processor.decode(predicted_ids[0])
transcription.lower()
and missus goddard three ladies almost always at the service of an invitation from hartfield and who were fetched and carried home so often that mister woodhouse thought it no hardship for either james or the horses had it taken place only once a year it would have been a grievance

Now let's get all our previous code into a single function, which accepts the audio path and returns the transcription:

def get_transcription(audio_path):
  # load our wav file
  speech, sr = torchaudio.load(audio_path)
  speech = speech.squeeze()
  # or using librosa
  # speech, sr = librosa.load(audio_file, sr=16000)
  # resample from whatever the audio sampling rate to 16000
  resampler = torchaudio.transforms.Resample(sr, 16000)
  speech = resampler(speech)
  # tokenize our wav
  input_values = processor(speech, return_tensors="pt", sampling_rate=16000)["input_values"]
  # perform inference
  logits = model(input_values)["logits"]
  # use argmax to get the predicted IDs
  predicted_ids = torch.argmax(logits, dim=-1)
  # decode the IDs to text
  transcription = processor.decode(predicted_ids[0])
  return transcription.lower()

Awesome, you can pass any audio speech file path:

get_transcription("http://www0.cs.ucl.ac.uk/teaching/GZ05/samples/lathe.wav")
a late is a big tool grab every dish of sugar

Conclusion

Awesome, now if you want to use your voice, I have prepared a code snippet in the notebooks to record with your microphone depending on your environment:

Note that there are other wav2vec2 weights trained by other people in different languages than English, check the models' page and filter on the language of your desire to get the wanted model.

Learn also: Machine Translation using Transformers in Python.

Open In Colab

Happy learning ♥

View Full Code
Sharing is caring!



Read Also




Comment panel