Remove Duplicate Files with Python: Size and Content Hashing

Overview
Quickstart
Minimal working example
How it works (numbered steps)
CLI options
Pitfalls and safety
Performance notes
Customizations
FAQ

Overview

Automate boring tasks with Python: scan a directory, find duplicate files, and optionally remove them safely. This guide shows a practical, fast approach using file size and cryptographic hashes to avoid false positives.

Quickstart

You’ll run a single Python script that:
- Groups files by size (cheap prefilter)
- Hashes only the first N bytes (fast pre-check)
- Confirms duplicates with a full content hash
- Prints a dry-run report by default; deletes only with --delete

Minimal working example

Save as dedupe.py and run with Python 3.8+.

#!/usr/bin/env python3
import argparse
import hashlib
import os
from pathlib import Path
from typing import Dict, List, Iterable

CHUNK_SIZE = 1 << 20  # 1 MiB

def iter_files(root: Path) -> Iterable[Path]:
    for dirpath, dirnames, filenames in os.walk(root, followlinks=False):
        for name in filenames:
            p = Path(dirpath) / name
            # Skip broken symlinks and non-regular files
            try:
                if not p.is_file():
                    continue
            except OSError:
                continue
            yield p

def sha256_first_n(path: Path, n: int) -> str:
    h = hashlib.sha256()
    try:
        with path.open('rb') as f:
            h.update(f.read(n))
    except OSError:
        return ""
    return h.hexdigest()

def sha256_full(path: Path, chunk_size: int = CHUNK_SIZE) -> str:
    h = hashlib.sha256()
    try:
        with path.open('rb') as f:
            for chunk in iter(lambda: f.read(chunk_size), b""):
                h.update(chunk)
    except OSError:
        return ""
    return h.hexdigest()

def find_duplicates(root: Path, min_size: int, first_bytes: int) -> List[List[Path]]:
    by_size: Dict[int, List[Path]] = {}
    for p in iter_files(root):
        try:
            size = p.stat().st_size
        except OSError:
            continue
        if size < min_size:
            continue
        by_size.setdefault(size, []).append(p)

    dup_groups: List[List[Path]] = []

    for size, paths in by_size.items():
        if len(paths) < 2:
            continue
        # Phase 1: partial hash
        by_partial: Dict[str, List[Path]] = {}
        for p in paths:
            h = sha256_first_n(p, first_bytes) if first_bytes > 0 else sha256_full(p)
            if not h:
                continue
            by_partial.setdefault(h, []).append(p)

        for plist in by_partial.values():
            if len(plist) < 2:
                continue
            # Phase 2: full hash
            by_full: Dict[str, List[Path]] = {}
            for p in plist:
                h = sha256_full(p)
                if not h:
                    continue
                by_full.setdefault(h, []).append(p)

            for glist in by_full.values():
                if len(glist) > 1:
                    dup_groups.append(glist)

    return dup_groups

def main():
    ap = argparse.ArgumentParser(description="Find and remove duplicate files by size and content hash.")
    ap.add_argument("path", type=Path, help="Root directory to scan")
    ap.add_argument("--delete", action="store_true", help="Actually delete duplicates (default is dry-run)")
    ap.add_argument("--min-size", type=int, default=1, help="Ignore files smaller than this many bytes (default: 1)")
    ap.add_argument("--first-bytes", type=int, default=1_048_576, help="Bytes to hash for the fast pre-check (default: 1MiB; 0=skip)")
    args = ap.parse_args()

    root = args.path.resolve()
    if not root.exists() or not root.is_dir():
        ap.error(f"Not a directory: {root}")

    groups = find_duplicates(root, min_size=args.min_size, first_bytes=args.first_bytes)

    total_would_delete = 0
    bytes_reclaimed = 0

    for group in groups:
        group = sorted(group)  # deterministic: keep the lexicographically first
        keeper = group[0]
        dups = group[1:]
        print(f"KEEP: {keeper}")
        for p in dups:
            try:
                size = p.stat().st_size
            except OSError:
                size = 0
            if args.delete:
                try:
                    p.unlink()
                    print(f"DEL : {p}")
                    bytes_reclaimed += size
                    total_would_delete += 1
                except OSError as e:
                    print(f"ERR : {p} ({e})")
            else:
                print(f"DUPE: {p}")
                bytes_reclaimed += size
                total_would_delete += 1
        print()

    mode = "Deleted" if args.delete else "Would delete"
    print(f"{mode} {total_would_delete} files; bytes reclaimed: {bytes_reclaimed}")
    if not args.delete:
        print("Dry-run only. Re-run with --delete to remove duplicates.")

if __name__ == "__main__":
    main()

Example usage:

python dedupe.py /path/to/scan            # dry-run
python dedupe.py /path/to/scan --delete   # actually delete duplicates
python dedupe.py /path/to/scan --min-size 4096

How it works (numbered steps)

Walk the directory tree and collect regular files.
Group files by size. Unique sizes cannot be duplicates.
For size groups with 2+ files, compute a fast partial hash (first N bytes).
For partial-hash groups with 2+ files, compute a full content hash.
Files in the same size and full-hash group are byte-identical. Keep one; delete the rest (only if --delete is set).

CLI options

Option	Default	Meaning
--delete	off	If set, delete duplicates; otherwise dry-run prints actions.
--min-size BYTES	1	Skip files smaller than this (commonly 1 or 4096).
--first-bytes BYTES	1048576	Bytes hashed in the fast pre-check (0 to skip and go straight to full hash).

Pitfalls and safety

Dry-run first. Confirm groups before deleting.
Symlinks and special files are skipped; only regular files are processed.
Hard links: two paths can point to the same inode. Deleting one keeps the data via the other path, but the script treats them as duplicates. If you want to avoid touching hard-linked files, add a filter using os.stat().st_nlink.
Permissions: some files may be unreadable or undeletable; the script logs errors and continues.
Case sensitivity: on Windows/macOS (default), case-insensitive paths may behave unexpectedly when two paths differ only by case.
Backups: deletions are permanent. Use version control or snapshots if available.

Performance notes

Size prefiltering removes most non-duplicates cheaply.
Partial-hash pre-check reduces full hashing on large files. Increase --first-bytes for fewer full hashes; decrease for faster scans. Setting 0 skips the pre-check.
Streaming I/O: full hashing reads in 1 MiB chunks to keep memory bounded. Tune CHUNK_SIZE for your storage (e.g., 4–8 MiB on fast SSDs).
Parallelism: to go faster on HDD/SSD arrays, you can parallelize the hash step with ThreadPoolExecutor. On a single spinning disk, excessive parallel reads may hurt.
Skipping small files: raising --min-size to 4096 or 16384 can greatly speed up scans on directories with many tiny files.

Example change to hash chunk size:

CHUNK_SIZE = 4 << 20  # 4 MiB for faster sequential reads on SSD

Customizations

Keep newest instead of lexicographically first:

# replace keeper selection
keeper = min(group, key=lambda p: p.stat().st_mtime)  # oldest
# or newest
keeper = max(group, key=lambda p: p.stat().st_mtime)

Exclude certain extensions:

EXCLUDE = {".iso", ".zip"}
# in iter_files():
if p.suffix.lower() in EXCLUDE:
    continue

Move to a quarantine directory instead of deleting:

quarantine = Path("./.duplicates")
quarantine.mkdir(exist_ok=True)
# replace p.unlink() with
p.replace(quarantine / p.name)

FAQ

Does it scan subdirectories?
- Yes. It walks the entire tree under the given path.
Does it follow symlinks?
- No, to avoid loops and surprises. You can change os.walk(..., followlinks=True) if you need it.
Can it handle huge files (GBs)?
- Yes. Hashing streams in fixed-size chunks; memory usage stays low.
Are partial-hash collisions a risk?
- No for correctness: full hashes verify before deletion. The partial step only saves work.
Can I replace duplicates with hard links instead of deleting?
- Not in this minimal script. You could keep one file, delete the duplicate, then os.link(keeper, duplicate_path) to hardlink, but ensure both are on the same filesystem.

Series: Automate boring tasks with Python

Python