Overview
Automate boring tasks with Python: scan a directory, find duplicate files, and optionally remove them safely. This guide shows a practical, fast approach using file size and cryptographic hashes to avoid false positives.
Quickstart
- You’ll run a single Python script that:
- Groups files by size (cheap prefilter)
- Hashes only the first N bytes (fast pre-check)
- Confirms duplicates with a full content hash
- Prints a dry-run report by default; deletes only with --delete
Minimal working example
Save as dedupe.py and run with Python 3.8+.
#!/usr/bin/env python3
import argparse
import hashlib
import os
from pathlib import Path
from typing import Dict, List, Iterable
CHUNK_SIZE = 1 << 20 # 1 MiB
def iter_files(root: Path) -> Iterable[Path]:
for dirpath, dirnames, filenames in os.walk(root, followlinks=False):
for name in filenames:
p = Path(dirpath) / name
# Skip broken symlinks and non-regular files
try:
if not p.is_file():
continue
except OSError:
continue
yield p
def sha256_first_n(path: Path, n: int) -> str:
h = hashlib.sha256()
try:
with path.open('rb') as f:
h.update(f.read(n))
except OSError:
return ""
return h.hexdigest()
def sha256_full(path: Path, chunk_size: int = CHUNK_SIZE) -> str:
h = hashlib.sha256()
try:
with path.open('rb') as f:
for chunk in iter(lambda: f.read(chunk_size), b""):
h.update(chunk)
except OSError:
return ""
return h.hexdigest()
def find_duplicates(root: Path, min_size: int, first_bytes: int) -> List[List[Path]]:
by_size: Dict[int, List[Path]] = {}
for p in iter_files(root):
try:
size = p.stat().st_size
except OSError:
continue
if size < min_size:
continue
by_size.setdefault(size, []).append(p)
dup_groups: List[List[Path]] = []
for size, paths in by_size.items():
if len(paths) < 2:
continue
# Phase 1: partial hash
by_partial: Dict[str, List[Path]] = {}
for p in paths:
h = sha256_first_n(p, first_bytes) if first_bytes > 0 else sha256_full(p)
if not h:
continue
by_partial.setdefault(h, []).append(p)
for plist in by_partial.values():
if len(plist) < 2:
continue
# Phase 2: full hash
by_full: Dict[str, List[Path]] = {}
for p in plist:
h = sha256_full(p)
if not h:
continue
by_full.setdefault(h, []).append(p)
for glist in by_full.values():
if len(glist) > 1:
dup_groups.append(glist)
return dup_groups
def main():
ap = argparse.ArgumentParser(description="Find and remove duplicate files by size and content hash.")
ap.add_argument("path", type=Path, help="Root directory to scan")
ap.add_argument("--delete", action="store_true", help="Actually delete duplicates (default is dry-run)")
ap.add_argument("--min-size", type=int, default=1, help="Ignore files smaller than this many bytes (default: 1)")
ap.add_argument("--first-bytes", type=int, default=1_048_576, help="Bytes to hash for the fast pre-check (default: 1MiB; 0=skip)")
args = ap.parse_args()
root = args.path.resolve()
if not root.exists() or not root.is_dir():
ap.error(f"Not a directory: {root}")
groups = find_duplicates(root, min_size=args.min_size, first_bytes=args.first_bytes)
total_would_delete = 0
bytes_reclaimed = 0
for group in groups:
group = sorted(group) # deterministic: keep the lexicographically first
keeper = group[0]
dups = group[1:]
print(f"KEEP: {keeper}")
for p in dups:
try:
size = p.stat().st_size
except OSError:
size = 0
if args.delete:
try:
p.unlink()
print(f"DEL : {p}")
bytes_reclaimed += size
total_would_delete += 1
except OSError as e:
print(f"ERR : {p} ({e})")
else:
print(f"DUPE: {p}")
bytes_reclaimed += size
total_would_delete += 1
print()
mode = "Deleted" if args.delete else "Would delete"
print(f"{mode} {total_would_delete} files; bytes reclaimed: {bytes_reclaimed}")
if not args.delete:
print("Dry-run only. Re-run with --delete to remove duplicates.")
if __name__ == "__main__":
main()
Example usage:
python dedupe.py /path/to/scan # dry-run
python dedupe.py /path/to/scan --delete # actually delete duplicates
python dedupe.py /path/to/scan --min-size 4096
How it works (numbered steps)
- Walk the directory tree and collect regular files.
- Group files by size. Unique sizes cannot be duplicates.
- For size groups with 2+ files, compute a fast partial hash (first N bytes).
- For partial-hash groups with 2+ files, compute a full content hash.
- Files in the same size and full-hash group are byte-identical. Keep one; delete the rest (only if --delete is set).
CLI options
Option | Default | Meaning |
---|---|---|
--delete | off | If set, delete duplicates; otherwise dry-run prints actions. |
--min-size BYTES | 1 | Skip files smaller than this (commonly 1 or 4096). |
--first-bytes BYTES | 1048576 | Bytes hashed in the fast pre-check (0 to skip and go straight to full hash). |
Pitfalls and safety
- Dry-run first. Confirm groups before deleting.
- Symlinks and special files are skipped; only regular files are processed.
- Hard links: two paths can point to the same inode. Deleting one keeps the data via the other path, but the script treats them as duplicates. If you want to avoid touching hard-linked files, add a filter using os.stat().st_nlink.
- Permissions: some files may be unreadable or undeletable; the script logs errors and continues.
- Case sensitivity: on Windows/macOS (default), case-insensitive paths may behave unexpectedly when two paths differ only by case.
- Backups: deletions are permanent. Use version control or snapshots if available.
Performance notes
- Size prefiltering removes most non-duplicates cheaply.
- Partial-hash pre-check reduces full hashing on large files. Increase --first-bytes for fewer full hashes; decrease for faster scans. Setting 0 skips the pre-check.
- Streaming I/O: full hashing reads in 1 MiB chunks to keep memory bounded. Tune CHUNK_SIZE for your storage (e.g., 4–8 MiB on fast SSDs).
- Parallelism: to go faster on HDD/SSD arrays, you can parallelize the hash step with ThreadPoolExecutor. On a single spinning disk, excessive parallel reads may hurt.
- Skipping small files: raising --min-size to 4096 or 16384 can greatly speed up scans on directories with many tiny files.
Example change to hash chunk size:
CHUNK_SIZE = 4 << 20 # 4 MiB for faster sequential reads on SSD
Customizations
- Keep newest instead of lexicographically first:
# replace keeper selection
keeper = min(group, key=lambda p: p.stat().st_mtime) # oldest
# or newest
keeper = max(group, key=lambda p: p.stat().st_mtime)
- Exclude certain extensions:
EXCLUDE = {".iso", ".zip"}
# in iter_files():
if p.suffix.lower() in EXCLUDE:
continue
- Move to a quarantine directory instead of deleting:
quarantine = Path("./.duplicates")
quarantine.mkdir(exist_ok=True)
# replace p.unlink() with
p.replace(quarantine / p.name)
FAQ
- Does it scan subdirectories?
- Yes. It walks the entire tree under the given path.
- Does it follow symlinks?
- No, to avoid loops and surprises. You can change os.walk(..., followlinks=True) if you need it.
- Can it handle huge files (GBs)?
- Yes. Hashing streams in fixed-size chunks; memory usage stays low.
- Are partial-hash collisions a risk?
- No for correctness: full hashes verify before deletion. The partial step only saves work.
- Can I replace duplicates with hard links instead of deleting?
- Not in this minimal script. You could keep one file, delete the duplicate, then os.link(keeper, duplicate_path) to hardlink, but ensure both are on the same filesystem.