Overview
This guide shows how to auto-generate subtitles for videos for free using Python, entirely offline. We’ll self-host an open-source Whisper model (via faster-whisper) and use FFmpeg to handle media. You’ll get SRT (and optional WebVTT) files you can embed or burn into videos.
- Category: AI Engineering
 - Collection: Self-Hosting AI Models & Tools
 - Tools: Python, faster-whisper (CTranslate2), FFmpeg
 - Cost: Free (models downloaded once; runs locally)
 
Quickstart
- Install FFmpeg (available via your OS package manager). Ensure 
ffmpegis on PATH. - Create a virtual environment (optional) and install Python packages: 
pip install faster-whisper soundfile
(soundfile enables WAV I/O; faster-whisper downloads Whisper models as needed.)
 - Save the Minimal Working Example below as 
autosub.py. - Run: 
python autosub.py input.mp4 --model small --out subtitles.srt - Embed or burn subtitles: 
- Soft-sub (MP4): 
ffmpeg -i input.mp4 -i subtitles.srt -c copy -c:s mov_text output.mp4 - Hard-sub (burn-in): 
ffmpeg -i input.mp4 -vf subtitles=subtitles.srt -c:a copy output.mp4 
 - Soft-sub (MP4): 
 
Minimal Working Example (Python)
#!/usr/bin/env python3
import argparse
import os
import subprocess
import tempfile
from faster_whisper import WhisperModel
def to_srt_time(t_seconds: float) -> str:
    ms_total = int(round(t_seconds * 1000))
    h, rem = divmod(ms_total, 3600_000)
    m, rem = divmod(rem, 60_000)
    s, ms = divmod(rem, 1000)
    return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"
def write_srt(segments, path: str):
    with open(path, "w", encoding="utf-8") as f:
        for i, seg in enumerate(segments, start=1):
            start = to_srt_time(seg.start)
            end = to_srt_time(seg.end)
            text = seg.text.strip()
            f.write(f"{i}\n{start} --> {end}\n{text}\n\n")
def write_vtt(segments, path: str):
    with open(path, "w", encoding="utf-8") as f:
        f.write("WEBVTT\n\n")
        for seg in segments:
            start = to_srt_time(seg.start).replace(",", ".")
            end = to_srt_time(seg.end).replace(",", ".")
            text = seg.text.strip()
            f.write(f"{start} --> {end}\n{text}\n\n")
def extract_audio(input_video: str, out_wav: str, sr: int = 16000):
    # Mono, 16 kHz WAV for speed and compatibility
    cmd = [
        "ffmpeg", "-y", "-i", input_video,
        "-ac", "1", "-ar", str(sr), "-f", "wav", out_wav
    ]
    subprocess.run(cmd, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
def transcribe(
    input_video: str,
    model_name: str = "small",
    device: str = "cpu",
    compute_type: str = "int8",
    language: str | None = None,
    beam_size: int = 5,
    vad_filter: bool = True,
):
    model = WhisperModel(model_name, device=device, compute_type=compute_type)
    with tempfile.TemporaryDirectory() as tmp:
        wav_path = os.path.join(tmp, "audio.wav")
        extract_audio(input_video, wav_path)
        segments, info = model.transcribe(
            wav_path,
            language=language,
            beam_size=beam_size,
            vad_filter=vad_filter,
        )
        segs = list(segments)  # materialize iterator
    return segs
def main():
    p = argparse.ArgumentParser(description="Auto-generate subtitles locally with Whisper.")
    p.add_argument("input", help="Path to video/audio file (e.g., .mp4, .mkv, .mp3)")
    p.add_argument("--model", default="small", help="Whisper model: tiny/base/small/medium/large-v3")
    p.add_argument("--device", default="cpu", choices=["cpu", "cuda"], help="Use CPU or CUDA GPU")
    p.add_argument("--compute", default="int8", help="Compute type: int8/int8_float16/float16/float32")
    p.add_argument("--lang", default=None, help="Force language code (e.g., en, es). Default: auto-detect")
    p.add_argument("--out", default="subtitles.srt", help="Output subtitle file (.srt or .vtt)")
    p.add_argument("--beam", type=int, default=5, help="Beam size (accuracy/speed trade-off)")
    p.add_argument("--no-vad", action="store_true", help="Disable VAD filtering")
    args = p.parse_args()
    segments = transcribe(
        args.input,
        model_name=args.model,
        device=args.device,
        compute_type=args.compute,
        language=args.lang,
        beam_size=args.beam,
        vad_filter=not args.no_vad,
    )
    out = args.out
    if out.lower().endswith(".vtt"):
        write_vtt(segments, out)
    else:
        write_srt(segments, out)
    print(f"Wrote {out}")
if __name__ == "__main__":
    main()
Step-by-step
- Install dependencies 
- FFmpeg: install from your OS package manager; ensure 
ffmpegis in PATH. - Python packages: 
pip install faster-whisper soundfile. 
 - FFmpeg: install from your OS package manager; ensure 
 - Choose a model size 
- tiny/base: fastest, lower accuracy
 - small/medium: good balance
 - large-v3: best accuracy, slowest/heaviest
 
 - Run transcription 
- CPU example: 
python autosub.py input.mp4 --model small --device cpu --compute int8 - GPU example: 
python autosub.py input.mp4 --model medium --device cuda --compute float16 
 - CPU example: 
 - Choose subtitle format 
- Default SRT (
.srt) or pass--out subtitles.vttfor WebVTT. 
 - Default SRT (
 - Embed or burn subtitles with FFmpeg as needed (see Quickstart step 5).
 
Accuracy tips
- Pick the smallest model that meets your accuracy needs; upgrade model size if key content is mis-transcribed.
 - Provide language with 
--langif you already know it (skips language detection). - Use 
--beam 5to--beam 8for a small accuracy boost; higher beams slow down decoding. - Keep audio clean: reduce background noise; prefer 16 kHz mono.
 - Consider 
--no-vadoff by default; VAD helps avoid hallucinations in silences. 
Performance notes
- Device and precision 
- CPU: 
--compute int8is typically fastest with minimal accuracy loss. - GPU: 
--compute float16on CUDA balances speed/accuracy well. 
 - CPU: 
 - Throughput knobs 
- Smaller models are much faster; start with 
small. - Reduce 
--beamfor speed (e.g., 1–3) when accuracy is “good enough.” - Use 16 kHz mono input to minimize decode cost.
 
 - Smaller models are much faster; start with 
 - Memory 
- Large models require significant RAM/VRAM. If you hit OOM, use a smaller model or lower precision.
 
 
Example guidance (approximate):
- tiny/base: low VRAM/RAM; realtime or faster on modern CPUs
 - small: still CPU-friendly; fast on entry GPUs
 - medium: GPU recommended; slower on CPU
 - large-v3: high VRAM; best accuracy, slowest
 
Pitfalls and how to avoid them
- Long videos: Process the full file; faster-whisper streams internally. If RAM-constrained, split the video into chunks with FFmpeg and merge SRTs.
 - Misaligned timestamps: Ensure constant frame rate audio extraction (we force 16 kHz mono WAV). Don’t resample multiple times.
 - Language/punctuation issues: Force 
--langto avoid wrong-language detection; bigger models handle punctuation better. - Speaker labels: Whisper doesn’t do diarization. For speaker tags, post-process with a diarization tool and map segments.
 - Embedded audio quirks: Some containers use VFR or odd codecs. Always re-extract audio as WAV (as shown) to standardize.
 - Licensing/privacy: Running locally avoids data egress. Ensure you have rights to transcribe the content.
 
Customization
- Word timestamps: faster-whisper can return word-level timings; iterate 
segment.wordsif enabled in your installed version and format accordingly. - Chunked workflows: For batch jobs, run multiple processes over a folder and join subtitle files.
 - Post-editing: Open the SRT in a subtitle editor to quickly correct errors.
 
Tiny FAQ
Can I run fully offline? Yes. After the first model download, everything runs locally.
Do I need a GPU? No. CPU with
--compute int8works. A GPU speeds up medium/large models.Which model should I start with? Try
small. If accuracy is insufficient, move tomediumorlarge-v3.How do I get WebVTT instead of SRT? Use
--out subtitles.vtt. The script writes VTT when the filename ends with.vtt.How do I burn subtitles into the video? Use FFmpeg’s subtitles filter:
-vf subtitles=subtitles.srtfor hard subs, or-c:s mov_textfor soft subs in MP4.
What you built
A self-hosted, free Python toolchain that extracts audio with FFmpeg, transcribes with an open Whisper model (faster-whisper), and outputs clean SRT/WebVTT subtitles ready for editing or embedding.