Automatic Speech Recognition (ASR) is the technology that allows us to convert human speech into digital text. This tutorial will dive into the current state-of-the-art model called Wav2vec2 using the Huggingface transformers library in Python.
Wav2Vec2 is a pre-trained model that was trained on speech audio alone (self-supervised) and then followed by fine-tuning on transcribed speech data (LibriSpeech dataset). It has outperformed previous semi-supervised models.
As in Masked Language Modeling, Wav2Vec2 encodes speech audio via a multi-layer convolutional neural network and then masks spans of the resulting latent speech representations. These representations are then fed to a Transformer network to build contextualized representations; check the Wav2Vec2 paper for more information.
To get started, let's install the required libraries:
$ pip3 install transformers==4.11.2 soundfile sentencepiece torchaudio pydub pyaudio
We'll be using torchaudio for loading audio files. Note that you need to install PyAudio if you're going to use the code on your environment and PyDub if you're on a Colab environment. We are going to use them for recording from the microphone in Python.
Let's import our libraries:
from transformers import * import torch import soundfile as sf # import librosa import os import torchaudio
Next, loading the processor and the model weights of wav2vec2:
# model_name = "facebook/wav2vec2-base-960h" # 360MB model_name = "facebook/wav2vec2-large-960h-lv60-self" # 1.18GB processor = Wav2Vec2Processor.from_pretrained(model_name) model = Wav2Vec2ForCTC.from_pretrained(model_name)
There are two most used model architectures and weights for wav2vec.
wav2vec2-base-960h is a base architecture with about 360MB of size, it achieved a 3.4% Word Error Rate (WER) on the clean test set and was trained on 960 hours of LibriSpeech dataset on 16kHz sampled speech audio.
On the other hand,
wav2vec2-large-960h-lv60-self is a larger model with about 1.18GB in size (probably won't fit your laptop RAM) but achieved 1.9% WER (the lower, the better) on the clean test set. So this one is much better for recognition but heavier and takes more time for inference. Feel free to choose which one suits you best.
Wav2Vec2 was trained using Connectionist Temporal Classification (CTC), so that's why we're using the
Wav2Vec2ForCTC class for loading the model.
Next, here are some audio samples:
# audio_url = "https://github.com/x4nth055/pythoncode-tutorials/raw/master/machine-learning/speech-recognition/16-122828-0002.wav" audio_url = "https://github.com/x4nth055/pythoncode-tutorials/raw/master/machine-learning/speech-recognition/30-4447-0004.wav" # audio_url = "https://github.com/x4nth055/pythoncode-tutorials/raw/master/machine-learning/speech-recognition/7601-291468-0006.wav"
Feel free to choose any of the above audio files. Below cell loads the audio file:
# load our wav file speech, sr = torchaudio.load(audio_url) speech = speech.squeeze() # or using librosa # speech, sr = librosa.load(audio_file, sr=16000) sr, speech.shape
torchaudio.load() function loads the audio file and returns the audio as a vector and the sample rate. It also automatically downloads the file if it's a URL. If it's a path in the disk, it will also load it.
Note we use the
squeeze() method as well, it is to remove the dimensions with the size of 1. i.e., converting tensor from
(1, 274000) to
Next, we need to make sure the input audio file to the model has the sample rate of 16000Hz because wav2vec2 is trained on that:
# resample from whatever the audio sampling rate to 16000 resampler = torchaudio.transforms.Resample(sr, 16000) speech = resampler(speech) speech.shape
Before we make the inference, we pass the audio vector to the wav2vec2 processor:
# tokenize our wav input_values = processor(speech, return_tensors="pt", sampling_rate=16000)["input_values"] input_values.shape
We specify the
sampling_rate and pass
return_tensors argument to get PyTorch tensors in the results.
Let's pass the vector into our model now:
# perform inference logits = model(input_values)["logits"] logits.shape
torch.Size([1, 856, 32])
Passing the logits to
torch.argmax() to get the likely prediction:
# use argmax to get the predicted IDs predicted_ids = torch.argmax(logits, dim=-1) predicted_ids.shape
torch.Size([1, 856, 32])
Decoding them back to text, we also lower the text, as it's in all caps:
# decode the IDs to text transcription = processor.decode(predicted_ids) transcription.lower()
and missus goddard three ladies almost always at the service of an invitation from hartfield and who were fetched and carried home so often that mister woodhouse thought it no hardship for either james or the horses had it taken place only once a year it would have been a grievance
Now let's collect all our previous code into a single function, which accepts the audio path and returns the transcription:
def get_transcription(audio_path): # load our wav file speech, sr = torchaudio.load(audio_path) speech = speech.squeeze() # or using librosa # speech, sr = librosa.load(audio_file, sr=16000) # resample from whatever the audio sampling rate to 16000 resampler = torchaudio.transforms.Resample(sr, 16000) speech = resampler(speech) # tokenize our wav input_values = processor(speech, return_tensors="pt", sampling_rate=16000)["input_values"] # perform inference logits = model(input_values)["logits"] # use argmax to get the predicted IDs predicted_ids = torch.argmax(logits, dim=-1) # decode the IDs to text transcription = processor.decode(predicted_ids) return transcription.lower()
Awesome, you can pass any audio speech file path:
a late is a big tool grab every dish of sugar
Awesome, now if you want to use your voice, I have prepared a code snippet in the notebooks to record with your microphone. Feel free to choose the environment you're using:
Note that there are other wav2vec2 weights trained by other people in different languages than English. Check the models' page and filter on the language of your desire to get the wanted model.
Below are some of our other NLP tutorials:
Happy learning ♥View Full Code