Overview
Docker isolates processes with namespaces, capabilities, seccomp, and LSMs (AppArmor/SELinux). Privileged operations (mount, loop devices, iptables, FUSE, eBPF) often fail with: Operation not permitted. This guide shows practical fixes using the least privilege necessary.
Key causes:
- Missing capabilities (e.g., CAP_SYS_ADMIN, CAP_NET_ADMIN)
- Seccomp profile blocks syscalls (e.g., mount)
- AppArmor/SELinux confinement
- Rootless mode or user namespaces
- Missing device nodes (/dev/fuse, /dev/net/tun, /dev/loop*)
- Read-only filesystem, locked sysctls, cgroup restrictions
Quickstart: fix the common mount failure
When a container tries to mount, it typically needs CAP_SYS_ADMIN plus an unconfined seccomp profile, and may need AppArmor unconfined on some hosts.
- Reproduce the failure:
# Fails with Operation not permitted
docker run --rm alpine sh -c 'mkdir -p /mnt && mount -t tmpfs tmpfs /mnt'
- Minimal fix with least privilege:
# Add CAP_SYS_ADMIN and disable seccomp filtering for mount
# AppArmor unconfined is needed on many Ubuntu hosts
docker run --rm \
--cap-add SYS_ADMIN \
--security-opt seccomp=unconfined \
--security-opt apparmor=unconfined \
alpine sh -c 'mkdir -p /mnt && mount -t tmpfs tmpfs /mnt && mount | grep /mnt && umount /mnt'
- If you must, use the sledgehammer:
# Broadest access; prefer targeted caps
docker run --rm --privileged alpine sh -c 'mount -t tmpfs tmpfs /mnt && umount /mnt'
Minimal working example (Dockerfile + run)
# Dockerfile
FROM alpine:3.20
RUN mkdir -p /mnt
CMD ["sh", "-c", "mount -t tmpfs tmpfs /mnt && echo 'mounted' && umount /mnt"]
Build and run:
docker build -t mount-test .
# Likely fails without privileges
docker run --rm mount-test
# Works with targeted privileges
docker run --rm \
--cap-add SYS_ADMIN \
--security-opt seccomp=unconfined \
--security-opt apparmor=unconfined \
mount-test
Common fixes by operation
- Mount (tmpfs, bind, overlay):
- --cap-add SYS_ADMIN
- --security-opt seccomp=unconfined
- --security-opt apparmor=unconfined (host dependent)
- FUSE (rclone, sshfs):
- --device /dev/fuse
- --cap-add SYS_ADMIN (often required for mount)
- seccomp=unconfined; apparmor=unconfined if needed
- iptables, tc, routes:
- --cap-add NET_ADMIN --cap-add NET_RAW
- For sysctls: --sysctl net.ipv4.ip_forward=1 (host must allow)
- Loop devices, mkfs:
- --cap-add SYS_ADMIN --cap-add MKNOD
- --device /dev/loop-control --device /dev/loop0 --device /dev/loop1 ...
- ptrace, debuggers (gdb, strace):
- --cap-add SYS_PTRACE
- --security-opt seccomp=unconfined for broader ptrace
- eBPF or perf:
- Often requires --privileged or a tuned seccomp profile plus CAP_BPF/CAP_SYS_ADMIN depending on kernel
Diagnose systematically (least-to-most invasive)
- Check Docker mode and isolation:
- docker info | grep -i rootless
- If Rootless: some privileged ops (mounting block devices, creating device nodes) will never work. Consider rootful Docker.
- Verify the attempted operation and syscall:
- Look at the exact command failing and error. For mount, assume CAP_SYS_ADMIN + seccomp unconfined.
- Optional: strace the command in a test container to confirm the blocked syscall.
- Add the minimal capability:
- docker run --cap-add <CAP> ...
- List current caps inside a running container: cat /proc/1/status | grep CapEff
- Address seccomp:
- Default profile blocks some syscalls. Try --security-opt seccomp=unconfined as a test.
- If that fixes it, adopt a custom, minimally-permissive seccomp profile later.
- Address AppArmor/SELinux:
- AppArmor: --security-opt apparmor=unconfined
- SELinux (on Fedora/CentOS hosts): prefer correct labels on bind mounts (:z or :Z). If still blocked, test with --security-opt label=disable (privileged implies this).
- Map required devices:
- --device /dev/fuse, /dev/net/tun, /dev/loop* as needed
- For TUN: also CAP_NET_ADMIN
- Consider namespaces and userns:
- If Docker daemon uses userns-remap, some host resources appear unprivileged. Test with --userns=host for the container, or disable remapping for that service.
- As a last resort, use --privileged:
- Validate the operation works, then iterate to least privilege (caps + device + profiles).
Kubernetes equivalents
For Pods, set securityContext and annotations.
Minimal pod with mount inside container:
apiVersion: v1
kind: Pod
metadata:
name: mount-test
annotations:
container.apparmor.security.beta.kubernetes.io/mount: unconfined
spec:
containers:
- name: mount
image: alpine:3.20
command: ["sh", "-c", "mkdir -p /mnt && mount -t tmpfs tmpfs /mnt && sleep 5"]
securityContext:
privileged: false
allowPrivilegeEscalation: true
capabilities:
add: ["SYS_ADMIN"]
Notes:
- Admission policies may forbid privileged or SYS_ADMIN.
- On SELinux hosts, use proper volume labels; cluster defaults may still block mounts.
- For FUSE, add volume for /dev/fuse and capability as above.
Pitfalls
- Relying on --privileged hides missing device mappings and policy issues; prefer targeted caps.
- Rootless Docker cannot perform many privileged operations regardless of caps.
- Read-only rootfs or masked paths can mimic permission errors; check docker run --read-only and masked paths.
- Bind-mounting host paths with restrictive SELinux labels causes EPERM; use :z or :Z on volumes on SELinux hosts.
- Kubernetes PSP/PodSecurity/OPA/Gatekeeper may silently strip capabilities; verify the effective securityContext at runtime.
- Default seccomp profiles vary by engine version; an upgrade can change behavior.
Performance notes
- FUSE filesystems incur user-space context switch overhead versus kernel mounts; expect higher CPU and lower throughput.
- Giving --privileged disables many resource isolations; noisy-neighbor effects can degrade cluster performance.
- eBPF/perf inside containers can contend for system-wide resources; limit scope and sampling.
- Excess capabilities and unconfined seccomp have negligible direct performance cost but increase attack surface; security incidents have far greater operational impact.
- Loop devices and dm-crypt in containers add I/O overhead; consider host-managed storage instead.
Tiny FAQ
Q: What does CAP_SYS_ADMIN cover? A: It is a broad capability used for many operations (mount, namespace control, device mgmt). Use it sparingly; prefer specific alternatives when available.
Q: Why does --cap-add SYS_ADMIN still fail? A: The syscall may be blocked by seccomp or LSM (AppArmor/SELinux), or the needed device is missing. Add seccomp=unconfined and unconfine AppArmor, and map devices.
Q: Is --privileged the same as adding a few caps? A: No. --privileged grants all capabilities, disables seccomp/AppArmor constraints, and gives broad device access. Use only as a last resort.
Q: Does this work in rootless Docker? A: Many privileged operations (mounting filesystems, creating device nodes) are not possible in rootless mode. Use rootful Docker or move the operation to the host.