Escaping seccomp
Last updated
Last updated
Recall that we have discussed the "trust boundary" in the "Into the Jail" section: when a child process needs to perform a privileged action, it must ask the parent process for permission. In other word, to do anything useful, a sandboxed process needs to communicate with the privileged process. That is, the sandboxed process needs to use some of the syscalls. This relaxation opens up some attack vectors:
Permissive policies
Syscall confusion
Kernel vulnerabilities (in the syscall handlers)
System calls are complex, and there are a lot of them. Developers might avoid breaking functionality by erring on the side of permissiveness.
A well-known example is ptrace()
. Depending on system configuration, allowing the ptrace()
system call could let a sandboxed process to "puppet" a non-sandboxed process.
Some less well-known effects include:
sendmsg()
can transfer file descriptors between processes.
prctl()
has bizarre possible effects.
process_vm_writev()
allows direct access to other process' memory.
Policies that allow both 32-bit and 64-bit syscalls can fail to properly sandbox one or the other mode.
Many 64-bit architectures are backwards compatible with their 32-bit ancestors. For example:
amd64 / x86_64 => x86
aarch64 => arm
mips64 => mips
powerpc64 => ppc
sparc64 => sparc
On some systems (including amd64), you can switch between 32-bit mode and 64-bit mode in the same process, so the kernel must be ready for either. However, syscalls numbers differ between architectures, including 32-bit and 64-bit variants of the same architecture. For example, the syscall number for execve
is 0xb
on x86 and 0x3b
on x86-64; the syscall number for exit
is 0x1
on x86 and 0x3c
on x86-64. This behavior causes the potential "syscall confusion" when the both 32-bit and 64-bit syscalls are allowed.
Even if the seccomp sandbox is correctly configured, attackers can still interact with whitelisted syscalls.
As long as attackers can use some of the syscalls, they are able to trigger vulnerabilities in the kernel. For real-world examples, check out Chrome sandbox escape exploit:
Think: what is your goal as an attacker? Is it always code execution?
Not really. Often, your goal is data exfiltration (like /flag
!). Even if you can't directly communicate with the outside world, often you can send "smoke signals":
Runtime of a process (see sleep(x)
system call) can convey a lot of data.
Clean termination or a crash? This can convey one bit.
Return value of a program (exit(x)
) can convey one byte.
For a real-world example, attackers use DNS queries to bypass network egress filters. As long as you can communicate 1 bit, you can repeat the attack to get more and more bits!