Automatic Speech Recognition (ASR) is the technology that allows us to convert human speech into digital text. This tutorial will dive into the current state-of-the-art model called Wav2vec2 using the Huggingface transformers library in Python.
Wav2Vec2 is a pre-trained model that was trained on speech audio alone (self-supervised) and then followed by fine-tuning on transcribed speech data (LibriSpeech dataset). It has outperformed previous semi-supervised models.
As in Masked Language Modeling, Wav2Vec2 encodes speech audio via a multi-layer convolutional neural network and then masks spans of the resulting latent speech representations. These representations are then fed to a Transformer network to build contextualized representations; check the Wav2Vec2 paper for more information.
To get started, let's install the required libraries:
$ pip3 install transformers==4.11.2 soundfile sentencepiece torchaudio pydub pyaudio
We'll be using torchaudio for loading audio files. Note that you need to install PyAudio if you're going to use the code on your environment and PyDub if you're on a Colab environment. We are going to use them for recording from the microphone in Python.
Let's import our libraries:
from transformers import *
import torch
import soundfile as sf
# import librosa
import os
import torchaudio
Next, loading the processor and the model weights of wav2vec2:
# model_name = "facebook/wav2vec2-base-960h" # 360MB
model_name = "facebook/wav2vec2-large-960h-lv60-self" # 1.18GB
processor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2ForCTC.from_pretrained(model_name)
There are two most used model architectures and weights for wav2vec. wav2vec2-base-960h
is a base architecture with about 360MB of size, it achieved a 3.4% Word Error Rate (WER) on the clean test set and was trained on 960 hours of LibriSpeech dataset on 16kHz sampled speech audio.
On the other hand, wav2vec2-large-960h-lv60-self
is a larger model with about 1.18GB in size (probably won't fit your laptop RAM) but achieved 1.9% WER (the lower, the better) on the clean test set. So this one is much better for recognition but heavier and takes more time for inference. Feel free to choose which one suits you best.
Wav2Vec2 was trained using Connectionist Temporal Classification (CTC), so that's why we're using the Wav2Vec2ForCTC
class for loading the model.
Next, here are some audio samples:
# audio_url = "https://github.com/x4nth055/pythoncode-tutorials/raw/master/machine-learning/speech-recognition/16-122828-0002.wav"
audio_url = "https://github.com/x4nth055/pythoncode-tutorials/raw/master/machine-learning/speech-recognition/30-4447-0004.wav"
# audio_url = "https://github.com/x4nth055/pythoncode-tutorials/raw/master/machine-learning/speech-recognition/7601-291468-0006.wav"
Feel free to choose any of the above audio files. Below cell loads the audio file:
# load our wav file
speech, sr = torchaudio.load(audio_url)
speech = speech.squeeze()
# or using librosa
# speech, sr = librosa.load(audio_file, sr=16000)
sr, speech.shape
(16000, torch.Size([274000]))
The torchaudio.load()
function loads the audio file and returns the audio as a vector and the sample rate. It also automatically downloads the file if it's a URL. If it's a path in the disk, it will also load it.
Note we use the squeeze()
method as well, it is to remove the dimensions with the size of 1. i.e., converting tensor from (1, 274000)
to (274000,)
.
Next, we need to make sure the input audio file to the model has the sample rate of 16000Hz because wav2vec2 is trained on that:
# resample from whatever the audio sampling rate to 16000
resampler = torchaudio.transforms.Resample(sr, 16000)
speech = resampler(speech)
speech.shape
torch.Size([274000])
We used Resample
from torchaudio.transforms, which helps us to convert the loaded audio file in the fly from one sampling rate to another.
Before we make the inference, we pass the audio vector to the wav2vec2 processor:
# tokenize our wav
input_values = processor(speech, return_tensors="pt", sampling_rate=16000)["input_values"]
input_values.shape
torch.Size([1, 274000])
We specify the sampling_rate
and pass "pt"
to return_tensors
argument to get PyTorch tensors in the results.
Let's pass the vector into our model now:
# perform inference
logits = model(input_values)["logits"]
logits.shape
torch.Size([1, 856, 32])
Passing the logits to torch.argmax()
to get the likely prediction:
# use argmax to get the predicted IDs
predicted_ids = torch.argmax(logits, dim=-1)
predicted_ids.shape
torch.Size([1, 856, 32])
Decoding them back to text, we also lower the text, as it's in all caps:
# decode the IDs to text
transcription = processor.decode(predicted_ids[0])
transcription.lower()
and missus goddard three ladies almost always at the service of an invitation from hartfield and who were fetched and carried home so often that mister woodhouse thought it no hardship for either james or the horses had it taken place only once a year it would have been a grievance
Now let's collect all our previous code into a single function, which accepts the audio path and returns the transcription:
def get_transcription(audio_path):
# load our wav file
speech, sr = torchaudio.load(audio_path)
speech = speech.squeeze()
# or using librosa
# speech, sr = librosa.load(audio_file, sr=16000)
# resample from whatever the audio sampling rate to 16000
resampler = torchaudio.transforms.Resample(sr, 16000)
speech = resampler(speech)
# tokenize our wav
input_values = processor(speech, return_tensors="pt", sampling_rate=16000)["input_values"]
# perform inference
logits = model(input_values)["logits"]
# use argmax to get the predicted IDs
predicted_ids = torch.argmax(logits, dim=-1)
# decode the IDs to text
transcription = processor.decode(predicted_ids[0])
return transcription.lower()
Awesome, you can pass any audio speech file path:
get_transcription("http://www0.cs.ucl.ac.uk/teaching/GZ05/samples/lathe.wav")
a late is a big tool grab every dish of sugar
Awesome, now if you want to use your voice, I have prepared a code snippet in the notebooks to record with your microphone. Feel free to choose the environment you're using:
Note that there are other wav2vec2 weights trained by other people in different languages than English. Check the models' page and filter on the language of your desire to get the wanted model.
Below are some of our other NLP tutorials:
Happy learning ♥
View Full Code