KhueApps
Home/AI Engineering/Auto-Generate Video Subtitles Locally with Python and Whisper

Auto-Generate Video Subtitles Locally with Python and Whisper

Last updated: October 06, 2025

Overview

This guide shows how to auto-generate subtitles for videos for free using Python, entirely offline. We’ll self-host an open-source Whisper model (via faster-whisper) and use FFmpeg to handle media. You’ll get SRT (and optional WebVTT) files you can embed or burn into videos.

  • Category: AI Engineering
  • Collection: Self-Hosting AI Models & Tools
  • Tools: Python, faster-whisper (CTranslate2), FFmpeg
  • Cost: Free (models downloaded once; runs locally)

Quickstart

  1. Install FFmpeg (available via your OS package manager). Ensure ffmpeg is on PATH.
  2. Create a virtual environment (optional) and install Python packages:
    • pip install faster-whisper soundfile
      (soundfile enables WAV I/O; faster-whisper downloads Whisper models as needed.)
  3. Save the Minimal Working Example below as autosub.py.
  4. Run: python autosub.py input.mp4 --model small --out subtitles.srt
  5. Embed or burn subtitles:
    • Soft-sub (MP4): ffmpeg -i input.mp4 -i subtitles.srt -c copy -c:s mov_text output.mp4
    • Hard-sub (burn-in): ffmpeg -i input.mp4 -vf subtitles=subtitles.srt -c:a copy output.mp4

Minimal Working Example (Python)

#!/usr/bin/env python3
import argparse
import os
import subprocess
import tempfile
from faster_whisper import WhisperModel


def to_srt_time(t_seconds: float) -> str:
    ms_total = int(round(t_seconds * 1000))
    h, rem = divmod(ms_total, 3600_000)
    m, rem = divmod(rem, 60_000)
    s, ms = divmod(rem, 1000)
    return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"


def write_srt(segments, path: str):
    with open(path, "w", encoding="utf-8") as f:
        for i, seg in enumerate(segments, start=1):
            start = to_srt_time(seg.start)
            end = to_srt_time(seg.end)
            text = seg.text.strip()
            f.write(f"{i}\n{start} --> {end}\n{text}\n\n")


def write_vtt(segments, path: str):
    with open(path, "w", encoding="utf-8") as f:
        f.write("WEBVTT\n\n")
        for seg in segments:
            start = to_srt_time(seg.start).replace(",", ".")
            end = to_srt_time(seg.end).replace(",", ".")
            text = seg.text.strip()
            f.write(f"{start} --> {end}\n{text}\n\n")


def extract_audio(input_video: str, out_wav: str, sr: int = 16000):
    # Mono, 16 kHz WAV for speed and compatibility
    cmd = [
        "ffmpeg", "-y", "-i", input_video,
        "-ac", "1", "-ar", str(sr), "-f", "wav", out_wav
    ]
    subprocess.run(cmd, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)


def transcribe(
    input_video: str,
    model_name: str = "small",
    device: str = "cpu",
    compute_type: str = "int8",
    language: str | None = None,
    beam_size: int = 5,
    vad_filter: bool = True,
):
    model = WhisperModel(model_name, device=device, compute_type=compute_type)

    with tempfile.TemporaryDirectory() as tmp:
        wav_path = os.path.join(tmp, "audio.wav")
        extract_audio(input_video, wav_path)
        segments, info = model.transcribe(
            wav_path,
            language=language,
            beam_size=beam_size,
            vad_filter=vad_filter,
        )
        segs = list(segments)  # materialize iterator
    return segs


def main():
    p = argparse.ArgumentParser(description="Auto-generate subtitles locally with Whisper.")
    p.add_argument("input", help="Path to video/audio file (e.g., .mp4, .mkv, .mp3)")
    p.add_argument("--model", default="small", help="Whisper model: tiny/base/small/medium/large-v3")
    p.add_argument("--device", default="cpu", choices=["cpu", "cuda"], help="Use CPU or CUDA GPU")
    p.add_argument("--compute", default="int8", help="Compute type: int8/int8_float16/float16/float32")
    p.add_argument("--lang", default=None, help="Force language code (e.g., en, es). Default: auto-detect")
    p.add_argument("--out", default="subtitles.srt", help="Output subtitle file (.srt or .vtt)")
    p.add_argument("--beam", type=int, default=5, help="Beam size (accuracy/speed trade-off)")
    p.add_argument("--no-vad", action="store_true", help="Disable VAD filtering")
    args = p.parse_args()

    segments = transcribe(
        args.input,
        model_name=args.model,
        device=args.device,
        compute_type=args.compute,
        language=args.lang,
        beam_size=args.beam,
        vad_filter=not args.no_vad,
    )

    out = args.out
    if out.lower().endswith(".vtt"):
        write_vtt(segments, out)
    else:
        write_srt(segments, out)

    print(f"Wrote {out}")


if __name__ == "__main__":
    main()

Step-by-step

  1. Install dependencies
    • FFmpeg: install from your OS package manager; ensure ffmpeg is in PATH.
    • Python packages: pip install faster-whisper soundfile.
  2. Choose a model size
    • tiny/base: fastest, lower accuracy
    • small/medium: good balance
    • large-v3: best accuracy, slowest/heaviest
  3. Run transcription
    • CPU example: python autosub.py input.mp4 --model small --device cpu --compute int8
    • GPU example: python autosub.py input.mp4 --model medium --device cuda --compute float16
  4. Choose subtitle format
    • Default SRT (.srt) or pass --out subtitles.vtt for WebVTT.
  5. Embed or burn subtitles with FFmpeg as needed (see Quickstart step 5).

Accuracy tips

  • Pick the smallest model that meets your accuracy needs; upgrade model size if key content is mis-transcribed.
  • Provide language with --lang if you already know it (skips language detection).
  • Use --beam 5 to --beam 8 for a small accuracy boost; higher beams slow down decoding.
  • Keep audio clean: reduce background noise; prefer 16 kHz mono.
  • Consider --no-vad off by default; VAD helps avoid hallucinations in silences.

Performance notes

  • Device and precision
    • CPU: --compute int8 is typically fastest with minimal accuracy loss.
    • GPU: --compute float16 on CUDA balances speed/accuracy well.
  • Throughput knobs
    • Smaller models are much faster; start with small.
    • Reduce --beam for speed (e.g., 1–3) when accuracy is “good enough.”
    • Use 16 kHz mono input to minimize decode cost.
  • Memory
    • Large models require significant RAM/VRAM. If you hit OOM, use a smaller model or lower precision.

Example guidance (approximate):

  • tiny/base: low VRAM/RAM; realtime or faster on modern CPUs
  • small: still CPU-friendly; fast on entry GPUs
  • medium: GPU recommended; slower on CPU
  • large-v3: high VRAM; best accuracy, slowest

Pitfalls and how to avoid them

  • Long videos: Process the full file; faster-whisper streams internally. If RAM-constrained, split the video into chunks with FFmpeg and merge SRTs.
  • Misaligned timestamps: Ensure constant frame rate audio extraction (we force 16 kHz mono WAV). Don’t resample multiple times.
  • Language/punctuation issues: Force --lang to avoid wrong-language detection; bigger models handle punctuation better.
  • Speaker labels: Whisper doesn’t do diarization. For speaker tags, post-process with a diarization tool and map segments.
  • Embedded audio quirks: Some containers use VFR or odd codecs. Always re-extract audio as WAV (as shown) to standardize.
  • Licensing/privacy: Running locally avoids data egress. Ensure you have rights to transcribe the content.

Customization

  • Word timestamps: faster-whisper can return word-level timings; iterate segment.words if enabled in your installed version and format accordingly.
  • Chunked workflows: For batch jobs, run multiple processes over a folder and join subtitle files.
  • Post-editing: Open the SRT in a subtitle editor to quickly correct errors.

Tiny FAQ

  • Can I run fully offline? Yes. After the first model download, everything runs locally.

  • Do I need a GPU? No. CPU with --compute int8 works. A GPU speeds up medium/large models.

  • Which model should I start with? Try small. If accuracy is insufficient, move to medium or large-v3.

  • How do I get WebVTT instead of SRT? Use --out subtitles.vtt. The script writes VTT when the filename ends with .vtt.

  • How do I burn subtitles into the video? Use FFmpeg’s subtitles filter: -vf subtitles=subtitles.srt for hard subs, or -c:s mov_text for soft subs in MP4.

What you built

A self-hosted, free Python toolchain that extracts audio with FFmpeg, transcribes with an open Whisper model (faster-whisper), and outputs clean SRT/WebVTT subtitles ready for editing or embedding.

Series: Self-Hosting AI Models & Tools

AI Engineering