Auto-Generate Video Subtitles Locally with Python and Whisper

Overview
Quickstart
Minimal Working Example (Python)
Step-by-step
Accuracy tips
Performance notes
Pitfalls and how to avoid them
Customization
Tiny FAQ
What you built

Overview

This guide shows how to auto-generate subtitles for videos for free using Python, entirely offline. We’ll self-host an open-source Whisper model (via faster-whisper) and use FFmpeg to handle media. You’ll get SRT (and optional WebVTT) files you can embed or burn into videos.

Category: AI Engineering
Collection: Self-Hosting AI Models & Tools
Tools: Python, faster-whisper (CTranslate2), FFmpeg
Cost: Free (models downloaded once; runs locally)

Quickstart

Install FFmpeg (available via your OS package manager). Ensure ffmpeg is on PATH.
Create a virtual environment (optional) and install Python packages:
- pip install faster-whisper soundfile
  (soundfile enables WAV I/O; faster-whisper downloads Whisper models as needed.)
Save the Minimal Working Example below as autosub.py.
Run: python autosub.py input.mp4 --model small --out subtitles.srt
Embed or burn subtitles:
- Soft-sub (MP4): ffmpeg -i input.mp4 -i subtitles.srt -c copy -c:s mov_text output.mp4
- Hard-sub (burn-in): ffmpeg -i input.mp4 -vf subtitles=subtitles.srt -c:a copy output.mp4

Minimal Working Example (Python)

#!/usr/bin/env python3
import argparse
import os
import subprocess
import tempfile
from faster_whisper import WhisperModel


def to_srt_time(t_seconds: float) -> str:
    ms_total = int(round(t_seconds * 1000))
    h, rem = divmod(ms_total, 3600_000)
    m, rem = divmod(rem, 60_000)
    s, ms = divmod(rem, 1000)
    return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"


def write_srt(segments, path: str):
    with open(path, "w", encoding="utf-8") as f:
        for i, seg in enumerate(segments, start=1):
            start = to_srt_time(seg.start)
            end = to_srt_time(seg.end)
            text = seg.text.strip()
            f.write(f"{i}\n{start} --> {end}\n{text}\n\n")


def write_vtt(segments, path: str):
    with open(path, "w", encoding="utf-8") as f:
        f.write("WEBVTT\n\n")
        for seg in segments:
            start = to_srt_time(seg.start).replace(",", ".")
            end = to_srt_time(seg.end).replace(",", ".")
            text = seg.text.strip()
            f.write(f"{start} --> {end}\n{text}\n\n")


def extract_audio(input_video: str, out_wav: str, sr: int = 16000):
    # Mono, 16 kHz WAV for speed and compatibility
    cmd = [
        "ffmpeg", "-y", "-i", input_video,
        "-ac", "1", "-ar", str(sr), "-f", "wav", out_wav
    ]
    subprocess.run(cmd, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)


def transcribe(
    input_video: str,
    model_name: str = "small",
    device: str = "cpu",
    compute_type: str = "int8",
    language: str | None = None,
    beam_size: int = 5,
    vad_filter: bool = True,
):
    model = WhisperModel(model_name, device=device, compute_type=compute_type)

    with tempfile.TemporaryDirectory() as tmp:
        wav_path = os.path.join(tmp, "audio.wav")
        extract_audio(input_video, wav_path)
        segments, info = model.transcribe(
            wav_path,
            language=language,
            beam_size=beam_size,
            vad_filter=vad_filter,
        )
        segs = list(segments)  # materialize iterator
    return segs


def main():
    p = argparse.ArgumentParser(description="Auto-generate subtitles locally with Whisper.")
    p.add_argument("input", help="Path to video/audio file (e.g., .mp4, .mkv, .mp3)")
    p.add_argument("--model", default="small", help="Whisper model: tiny/base/small/medium/large-v3")
    p.add_argument("--device", default="cpu", choices=["cpu", "cuda"], help="Use CPU or CUDA GPU")
    p.add_argument("--compute", default="int8", help="Compute type: int8/int8_float16/float16/float32")
    p.add_argument("--lang", default=None, help="Force language code (e.g., en, es). Default: auto-detect")
    p.add_argument("--out", default="subtitles.srt", help="Output subtitle file (.srt or .vtt)")
    p.add_argument("--beam", type=int, default=5, help="Beam size (accuracy/speed trade-off)")
    p.add_argument("--no-vad", action="store_true", help="Disable VAD filtering")
    args = p.parse_args()

    segments = transcribe(
        args.input,
        model_name=args.model,
        device=args.device,
        compute_type=args.compute,
        language=args.lang,
        beam_size=args.beam,
        vad_filter=not args.no_vad,
    )

    out = args.out
    if out.lower().endswith(".vtt"):
        write_vtt(segments, out)
    else:
        write_srt(segments, out)

    print(f"Wrote {out}")


if __name__ == "__main__":
    main()

Step-by-step

Install dependencies
- FFmpeg: install from your OS package manager; ensure ffmpeg is in PATH.
- Python packages: pip install faster-whisper soundfile.
Choose a model size
- tiny/base: fastest, lower accuracy
- small/medium: good balance
- large-v3: best accuracy, slowest/heaviest
Run transcription
- CPU example: python autosub.py input.mp4 --model small --device cpu --compute int8
- GPU example: python autosub.py input.mp4 --model medium --device cuda --compute float16
Choose subtitle format
- Default SRT (.srt) or pass --out subtitles.vtt for WebVTT.
Embed or burn subtitles with FFmpeg as needed (see Quickstart step 5).

Accuracy tips

Pick the smallest model that meets your accuracy needs; upgrade model size if key content is mis-transcribed.
Provide language with --lang if you already know it (skips language detection).
Use --beam 5 to --beam 8 for a small accuracy boost; higher beams slow down decoding.
Keep audio clean: reduce background noise; prefer 16 kHz mono.
Consider --no-vad off by default; VAD helps avoid hallucinations in silences.

Performance notes

Device and precision
- CPU: --compute int8 is typically fastest with minimal accuracy loss.
- GPU: --compute float16 on CUDA balances speed/accuracy well.
Throughput knobs
- Smaller models are much faster; start with small.
- Reduce --beam for speed (e.g., 1–3) when accuracy is “good enough.”
- Use 16 kHz mono input to minimize decode cost.
Memory
- Large models require significant RAM/VRAM. If you hit OOM, use a smaller model or lower precision.

Example guidance (approximate):

tiny/base: low VRAM/RAM; realtime or faster on modern CPUs
small: still CPU-friendly; fast on entry GPUs
medium: GPU recommended; slower on CPU
large-v3: high VRAM; best accuracy, slowest

Pitfalls and how to avoid them

Long videos: Process the full file; faster-whisper streams internally. If RAM-constrained, split the video into chunks with FFmpeg and merge SRTs.
Misaligned timestamps: Ensure constant frame rate audio extraction (we force 16 kHz mono WAV). Don’t resample multiple times.
Language/punctuation issues: Force --lang to avoid wrong-language detection; bigger models handle punctuation better.
Speaker labels: Whisper doesn’t do diarization. For speaker tags, post-process with a diarization tool and map segments.
Embedded audio quirks: Some containers use VFR or odd codecs. Always re-extract audio as WAV (as shown) to standardize.
Licensing/privacy: Running locally avoids data egress. Ensure you have rights to transcribe the content.

Customization

Word timestamps: faster-whisper can return word-level timings; iterate segment.words if enabled in your installed version and format accordingly.
Chunked workflows: For batch jobs, run multiple processes over a folder and join subtitle files.
Post-editing: Open the SRT in a subtitle editor to quickly correct errors.

Tiny FAQ

Can I run fully offline? Yes. After the first model download, everything runs locally.
Do I need a GPU? No. CPU with --compute int8 works. A GPU speeds up medium/large models.
Which model should I start with? Try small. If accuracy is insufficient, move to medium or large-v3.
How do I get WebVTT instead of SRT? Use --out subtitles.vtt. The script writes VTT when the filename ends with .vtt.
How do I burn subtitles into the video? Use FFmpeg’s subtitles filter: -vf subtitles=subtitles.srt for hard subs, or -c:s mov_text for soft subs in MP4.

What you built

A self-hosted, free Python toolchain that extracts audio with FFmpeg, transcribes with an open Whisper model (faster-whisper), and outputs clean SRT/WebVTT subtitles ready for editing or embedding.

Series: Self-Hosting AI Models & Tools

AI Engineering