Fix Operation not permitted for SYS_ADMIN in Docker containers

Overview
Quickstart: fix the common mount failure
Minimal working example (Dockerfile + run)
Common fixes by operation
Diagnose systematically (least-to-most invasive)
Kubernetes equivalents
Pitfalls
Performance notes
Tiny FAQ

Overview

Docker isolates processes with namespaces, capabilities, seccomp, and LSMs (AppArmor/SELinux). Privileged operations (mount, loop devices, iptables, FUSE, eBPF) often fail with: Operation not permitted. This guide shows practical fixes using the least privilege necessary.

Key causes:

Missing capabilities (e.g., CAP_SYS_ADMIN, CAP_NET_ADMIN)
Seccomp profile blocks syscalls (e.g., mount)
AppArmor/SELinux confinement
Rootless mode or user namespaces
Missing device nodes (/dev/fuse, /dev/net/tun, /dev/loop*)
Read-only filesystem, locked sysctls, cgroup restrictions

Quickstart: fix the common mount failure

When a container tries to mount, it typically needs CAP_SYS_ADMIN plus an unconfined seccomp profile, and may need AppArmor unconfined on some hosts.

Reproduce the failure:

# Fails with Operation not permitted
docker run --rm alpine sh -c 'mkdir -p /mnt && mount -t tmpfs tmpfs /mnt'

Minimal fix with least privilege:

# Add CAP_SYS_ADMIN and disable seccomp filtering for mount
# AppArmor unconfined is needed on many Ubuntu hosts
docker run --rm \
  --cap-add SYS_ADMIN \
  --security-opt seccomp=unconfined \
  --security-opt apparmor=unconfined \
  alpine sh -c 'mkdir -p /mnt && mount -t tmpfs tmpfs /mnt && mount | grep /mnt && umount /mnt'

If you must, use the sledgehammer:

# Broadest access; prefer targeted caps
docker run --rm --privileged alpine sh -c 'mount -t tmpfs tmpfs /mnt && umount /mnt'

Minimal working example (Dockerfile + run)

# Dockerfile
FROM alpine:3.20
RUN mkdir -p /mnt
CMD ["sh", "-c", "mount -t tmpfs tmpfs /mnt && echo 'mounted' && umount /mnt"]

Build and run:

docker build -t mount-test .
# Likely fails without privileges
docker run --rm mount-test
# Works with targeted privileges
docker run --rm \
  --cap-add SYS_ADMIN \
  --security-opt seccomp=unconfined \
  --security-opt apparmor=unconfined \
  mount-test

Common fixes by operation

Mount (tmpfs, bind, overlay):
- --cap-add SYS_ADMIN
- --security-opt seccomp=unconfined
- --security-opt apparmor=unconfined (host dependent)
FUSE (rclone, sshfs):
- --device /dev/fuse
- --cap-add SYS_ADMIN (often required for mount)
- seccomp=unconfined; apparmor=unconfined if needed
iptables, tc, routes:
- --cap-add NET_ADMIN --cap-add NET_RAW
- For sysctls: --sysctl net.ipv4.ip_forward=1 (host must allow)
Loop devices, mkfs:
- --cap-add SYS_ADMIN --cap-add MKNOD
- --device /dev/loop-control --device /dev/loop0 --device /dev/loop1 ...
ptrace, debuggers (gdb, strace):
- --cap-add SYS_PTRACE
- --security-opt seccomp=unconfined for broader ptrace
eBPF or perf:
- Often requires --privileged or a tuned seccomp profile plus CAP_BPF/CAP_SYS_ADMIN depending on kernel

Diagnose systematically (least-to-most invasive)

Check Docker mode and isolation:

docker info | grep -i rootless
If Rootless: some privileged ops (mounting block devices, creating device nodes) will never work. Consider rootful Docker.

Verify the attempted operation and syscall:

Look at the exact command failing and error. For mount, assume CAP_SYS_ADMIN + seccomp unconfined.
Optional: strace the command in a test container to confirm the blocked syscall.

Add the minimal capability:

docker run --cap-add <CAP> ...
List current caps inside a running container: cat /proc/1/status | grep CapEff

Address seccomp:

Default profile blocks some syscalls. Try --security-opt seccomp=unconfined as a test.
If that fixes it, adopt a custom, minimally-permissive seccomp profile later.

Address AppArmor/SELinux:

AppArmor: --security-opt apparmor=unconfined
SELinux (on Fedora/CentOS hosts): prefer correct labels on bind mounts (:z or :Z). If still blocked, test with --security-opt label=disable (privileged implies this).

Map required devices:

--device /dev/fuse, /dev/net/tun, /dev/loop* as needed
For TUN: also CAP_NET_ADMIN

Consider namespaces and userns:

If Docker daemon uses userns-remap, some host resources appear unprivileged. Test with --userns=host for the container, or disable remapping for that service.

As a last resort, use --privileged:

Validate the operation works, then iterate to least privilege (caps + device + profiles).

Kubernetes equivalents

For Pods, set securityContext and annotations.

Minimal pod with mount inside container:

apiVersion: v1
kind: Pod
metadata:
  name: mount-test
  annotations:
    container.apparmor.security.beta.kubernetes.io/mount: unconfined
spec:
  containers:
  - name: mount
    image: alpine:3.20
    command: ["sh", "-c", "mkdir -p /mnt && mount -t tmpfs tmpfs /mnt && sleep 5"]
    securityContext:
      privileged: false
      allowPrivilegeEscalation: true
      capabilities:
        add: ["SYS_ADMIN"]

Notes:

Admission policies may forbid privileged or SYS_ADMIN.
On SELinux hosts, use proper volume labels; cluster defaults may still block mounts.
For FUSE, add volume for /dev/fuse and capability as above.

Pitfalls

Relying on --privileged hides missing device mappings and policy issues; prefer targeted caps.
Rootless Docker cannot perform many privileged operations regardless of caps.
Read-only rootfs or masked paths can mimic permission errors; check docker run --read-only and masked paths.
Bind-mounting host paths with restrictive SELinux labels causes EPERM; use :z or :Z on volumes on SELinux hosts.
Kubernetes PSP/PodSecurity/OPA/Gatekeeper may silently strip capabilities; verify the effective securityContext at runtime.
Default seccomp profiles vary by engine version; an upgrade can change behavior.

Performance notes

FUSE filesystems incur user-space context switch overhead versus kernel mounts; expect higher CPU and lower throughput.
Giving --privileged disables many resource isolations; noisy-neighbor effects can degrade cluster performance.
eBPF/perf inside containers can contend for system-wide resources; limit scope and sampling.
Excess capabilities and unconfined seccomp have negligible direct performance cost but increase attack surface; security incidents have far greater operational impact.
Loop devices and dm-crypt in containers add I/O overhead; consider host-managed storage instead.

Tiny FAQ

Q: What does CAP_SYS_ADMIN cover? A: It is a broad capability used for many operations (mount, namespace control, device mgmt). Use it sparingly; prefer specific alternatives when available.

Q: Why does --cap-add SYS_ADMIN still fail? A: The syscall may be blocked by seccomp or LSM (AppArmor/SELinux), or the needed device is missing. Add seccomp=unconfined and unconfine AppArmor, and map devices.

Q: Is --privileged the same as adding a few caps? A: No. --privileged grants all capabilities, disables seccomp/AppArmor constraints, and gives broad device access. Use only as a last resort.

Q: Does this work in rootless Docker? A: Many privileged operations (mounting filesystems, creating device nodes) are not possible in rootless mode. Use rootful Docker or move the operation to the host.

Series: Docker

DevOps