How to Convert Speech to Text in Python

Abdeladim Fadheli · 9 min read · Updated may 2026 · Machine Learning · Application Programming Interfaces

Confused by complex code? Let our AI-powered Code Explainer demystify it for you. Try it out!

Speech recognition, also called automatic speech recognition (ASR), is the process of converting spoken audio into human-readable text. In this updated tutorial, you will learn how to convert speech to text in Python using modern, reliable tools that are suitable for real applications in 2026.

The older approach in this article used the SpeechRecognition package with recognize_google(). That is still fine for quick experiments, but it is not the best default anymore: it depends on a free/unofficial endpoint, has practical limits, and does not give you modern features such as strong multilingual accuracy, robust long-audio handling, voice activity detection, or high-quality timestamps.

As of May 14, 2026, a better Python speech-to-text stack is:

OpenAI gpt-4o-transcribe: a high-accuracy hosted API option for production transcription.
OpenAI gpt-4o-mini-transcribe: a cheaper hosted option when cost matters more than maximum accuracy.
Faster-Whisper: a fast local/offline implementation of Whisper using CTranslate2, great when you want privacy or no per-minute API cost.
Groq Whisper: a very fast hosted Whisper API, useful when you want low latency and OpenAI-compatible transcription calls.
WhisperX: useful when you need word-level alignment or speaker diarization for podcasts, interviews, and meetings.

In this tutorial, we will focus on three practical Python solutions: OpenAI for best hosted accuracy, Faster-Whisper for local/offline transcription, and Groq for fast hosted Whisper transcription. We will also handle microphone recording, long audio files, and SRT subtitle output.

Learn also: How to Convert Text to Speech in Python.

Which Speech-to-Text Tool Should You Use?

Tool	Best for	Pros	Trade-offs
`gpt-4o-transcribe`	Production accuracy	Excellent accuracy, multilingual, simple API	Requires an API key and uploads audio to OpenAI
`gpt-4o-mini-transcribe`	Lower-cost API transcription	Cheaper and still strong for many use cases	May be less accurate than the full model
Faster-Whisper	Offline/local transcription	Private, fast, no per-minute API cost, supports VAD	Needs local CPU/GPU resources
Groq Whisper	Fast hosted Whisper transcription	Very low latency, OpenAI-compatible style	Hosted API, model choices depend on provider
WhisperX	Diarization and word timestamps	Speaker labels and better alignment	Heavier setup, often needs GPU/Hugging Face token for diarization

If you only want the easiest reliable solution, use gpt-4o-transcribe. If you need offline transcription or you cannot upload audio to a third party, use Faster-Whisper. If you want a fast hosted Whisper API, Groq is a good option.

Installing the Dependencies

Create a virtual environment first:

python -m venv .venv
source .venv/bin/activate

On Windows PowerShell, activate it with:

.\.venv\Scripts\Activate.ps1

Install the Python packages:

pip install -U openai faster-whisper groq sounddevice scipy

You should also install FFmpeg, because it lets us convert MP3, MP4, M4A, WebM, and other formats to clean mono WAV audio when needed.

On Ubuntu/Debian:

sudo apt update
sudo apt install ffmpeg

On macOS:

brew install ffmpeg

On Windows, you can install FFmpeg with Chocolatey:

choco install ffmpeg

Method 1: Convert Speech to Text with OpenAI

This is the simplest production-ready option. Set your API key first:

export OPENAI_API_KEY="your-api-key-here"

On Windows PowerShell:

$env:OPENAI_API_KEY="your-api-key-here"

Now create a Python file called openai_transcribe.py:

from pathlib import Path
from openai import OpenAI

client = OpenAI()


def transcribe_with_openai(
    audio_path: str,
    model: str = "gpt-4o-transcribe",
    language: str | None = None,
    prompt: str | None = None,
) -> str:
    """Transcribe an audio file with OpenAI's speech-to-text API."""
    kwargs = {"model": model}
    if language:
        kwargs["language"] = language
    if prompt:
        kwargs["prompt"] = prompt

    with Path(audio_path).open("rb") as audio_file:
        transcript = client.audio.transcriptions.create(file=audio_file, **kwargs)

    return transcript.text


if __name__ == "__main__":
    text = transcribe_with_openai(
        "meeting.mp3",
        language="en",
        prompt="This is a technical meeting about Python, APIs, and machine learning.",
    )
    print(text)

Run it:

python openai_transcribe.py

The optional language parameter is useful when you already know the language. For example, use "en" for English, "fr" for French, "es" for Spanish, and so on. The optional prompt helps the model with names, acronyms, product names, or domain-specific vocabulary.

If you want a cheaper model, change:

model="gpt-4o-transcribe"

to:

model="gpt-4o-mini-transcribe"

Method 2: Convert Speech to Text Locally with Faster-Whisper

Faster-Whisper is a fast Whisper implementation powered by CTranslate2. It is a great choice when you want offline transcription, more control, or better privacy.

Create a file called local_transcribe.py:

from faster_whisper import WhisperModel


def transcribe_locally(audio_path: str, language: str | None = None) -> str:
    """Transcribe audio locally using Faster-Whisper."""
    model = WhisperModel(
        "large-v3",
        device="cpu",       # use "cuda" if you have an NVIDIA GPU
        compute_type="int8" # use "float16" on CUDA for better speed
    )

    segments, info = model.transcribe(
        audio_path,
        beam_size=5,
        language=language,
        vad_filter=True,
        vad_parameters={"min_silence_duration_ms": 500},
    )

    print(f"Detected language: {info.language} ({info.language_probability:.2f})")
    return "".join(segment.text for segment in segments).strip()


if __name__ == "__main__":
    print(transcribe_locally("meeting.mp3", language="en"))

Run it:

python local_transcribe.py

If your machine is slow, start with a smaller model:

model = WhisperModel("small", device="cpu", compute_type="int8")

If you have a decent NVIDIA GPU, use:

model = WhisperModel("large-v3", device="cuda", compute_type="float16")

For real-time-ish or faster transcription, you can also try large-v3-turbo or a Faster-Whisper-compatible large-v3-turbo checkpoint from Hugging Face.

Method 3: Fast Hosted Whisper Transcription with Groq

Groq provides fast hosted Whisper models such as whisper-large-v3 and whisper-large-v3-turbo. First, set your Groq API key:

export GROQ_API_KEY="your-groq-api-key"

On Windows PowerShell:

$env:GROQ_API_KEY="your-groq-api-key"

Then use the Groq SDK:

import os
from pathlib import Path
from groq import Groq

client = Groq(api_key=os.environ["GROQ_API_KEY"])


def transcribe_with_groq(audio_path: str, language: str | None = None) -> str:
    kwargs = {
        "model": "whisper-large-v3-turbo",
        "temperature": 0.0,
    }
    if language:
        kwargs["language"] = language

    with Path(audio_path).open("rb") as audio_file:
        transcript = client.audio.transcriptions.create(file=audio_file, **kwargs)

    return transcript.text


print(transcribe_with_groq("meeting.mp3", language="en"))

Transcribing Long Audio Files

Hosted APIs usually have file-size limits, and long recordings can also be easier to retry if they are split into smaller chunks. A reliable approach is:

Convert the input file to mono 16 kHz WAV with FFmpeg.
Split the WAV into chunks, for example 10 minutes each.
Transcribe each chunk.
Join the partial transcripts.

Here is the core chunking logic:

import wave
from pathlib import Path


def chunk_wav(input_wav: str, chunk_seconds: int = 600) -> list[Path]:
    """Split a WAV file into fixed-size chunks without loading it all into memory."""
    input_wav = Path(input_wav)
    output_dir = input_wav.parent / f"{input_wav.stem}_chunks"
    output_dir.mkdir(parents=True, exist_ok=True)

    chunks = []
    with wave.open(str(input_wav), "rb") as reader:
        params = reader.getparams()
        frames_per_chunk = int(params.framerate * chunk_seconds)
        index = 1

        while True:
            frames = reader.readframes(frames_per_chunk)
            if not frames:
                break

            chunk_path = output_dir / f"chunk_{index:04d}.wav"
            with wave.open(str(chunk_path), "wb") as writer:
                writer.setparams(params)
                writer.writeframes(frames)

            chunks.append(chunk_path)
            index += 1

    return chunks

The complete script at the end of this tutorial includes transcribe_large_file_with_openai(), which converts, chunks, transcribes, and joins the results automatically.

Generating SRT Subtitles

Faster-Whisper returns timestamped segments, so we can easily write an SRT file:

def seconds_to_srt_time(seconds: float) -> str:
    milliseconds = round(seconds * 1000)
    hours, remainder = divmod(milliseconds, 3_600_000)
    minutes, remainder = divmod(remainder, 60_000)
    secs, millis = divmod(remainder, 1000)
    return f"{hours:02}:{minutes:02}:{secs:02},{millis:03}"


def write_srt(segments, output_path: str) -> None:
    lines = []
    for i, segment in enumerate(segments, start=1):
        lines.extend([
            str(i),
            f"{seconds_to_srt_time(segment.start)} --> {seconds_to_srt_time(segment.end)}",
            segment.text.strip(),
            "",
        ])
    Path(output_path).write_text("\n".join(lines), encoding="utf-8")

Using the full script below, you can generate subtitles like this:

python speech_to_text_2026.py video.mp4 --engine faster-whisper --model large-v3 --srt captions.srt

Recording from the Microphone

If you want to record from your microphone and then transcribe the recording, use sounddevice and scipy:

from pathlib import Path
import sounddevice as sd
from scipy.io.wavfile import write


def record_microphone(output_path: str = "microphone.wav", seconds: int = 8, sample_rate: int = 16_000) -> Path:
    print(f"Recording for {seconds} seconds...")
    audio = sd.rec(int(seconds * sample_rate), samplerate=sample_rate, channels=1, dtype="int16")
    sd.wait()
    write(output_path, sample_rate, audio)
    return Path(output_path)

With the complete script, record and transcribe 8 seconds of microphone audio using OpenAI:

python speech_to_text_2026.py --record 8 --engine openai --language en

Or record and transcribe locally:

python speech_to_text_2026.py --record 8 --engine faster-whisper --model small --language en

Complete CLI Usage

The full code section contains a complete script named speech_to_text_2026.py. Here are some examples:

# Best hosted accuracy
python speech_to_text_2026.py meeting.mp3 --engine openai --language en

# Cheaper OpenAI transcription
python speech_to_text_2026.py meeting.mp3 --engine openai --model gpt-4o-mini-transcribe --language en

# Long file with OpenAI chunking
python speech_to_text_2026.py long_meeting.mp3 --engine openai --long --chunk-seconds 600 --language en

# Local/offline transcription
python speech_to_text_2026.py meeting.mp3 --engine faster-whisper --model large-v3 --language en

# Local transcription with SRT subtitles
python speech_to_text_2026.py video.mp4 --engine faster-whisper --model large-v3 --srt captions.srt

# Fast hosted Whisper transcription
python speech_to_text_2026.py meeting.mp3 --engine groq --language en

Improving Transcription Accuracy

Use a language hint when possible, such as language="en".
Use a context prompt for product names, acronyms, people names, and technical vocabulary.
Convert noisy audio to mono 16 kHz WAV before transcription.
Use VAD when transcribing locally to reduce silence-related hallucinations.
Use a better model for difficult audio. For Faster-Whisper, large-v3 is usually better than small.
Use diarization when you need speaker labels. For that, look at WhisperX or a diarization-capable hosted API.

Conclusion

For modern Python speech-to-text applications, you no longer need to rely on the old SpeechRecognition demo-style workflow. If you want a simple hosted API, use OpenAI's gpt-4o-transcribe or gpt-4o-mini-transcribe. If you want local and private transcription, use Faster-Whisper. If you want a very fast hosted Whisper endpoint, Groq is also a strong option.

The complete script below gives you a practical CLI that supports hosted transcription, local transcription, microphone recording, long-audio chunking, and SRT subtitle generation.

Happy Coding ♥

Save time and energy with our Python Code Generator. Why start from scratch when you can generate? Give it a try!

View Full Code Explain The Code for Me

Sharing is caring!

Comment panel

Got a coding query or need some guidance before you comment? Check out this Python Code Assistant for expert advice and handy tips. It's like having a coding tutor right in your fingertips!