Syscall

Motivation 1: OS-level API

Syscall is the OS-level "API" for implementing programs. For example, we can implement ls and shell using syscalls:

ls

Implement ls:

// ls.c
main()
{
	char buf[512];
	fd = open(".");

	while(read(fd, buf, 512) > 0)
	{
		printf("%s", buf);
	}
}

The syscalls used are:

  • open()

  • read()

  • write(): printf() actually calls write(1, buf, len).

shell

Implement shell:

// shell.c
main()
{
	char cmd[100];

	while(1)
	{
		scanf("[/usr/root]#%s", cmd);
		// Call execvp() in child
		if(!fork())
		{
			execvp(cmd, NULL);
		}

		wait();
	}
}

The syscalls used are:

  • scanf()

  • fork()

  • exec()

  • wait()

Motivation 2: User Mode => Kernel Mode

The architecture of most modern processors, with the exception of some embedded systems, involves a security model. For example, the rings model specifies multiple privilege levels under which software may be executed: a program is usually limited to its own address space so that it cannot access or modify other running programs or the operating system itself, and is usually prevented from directly manipulating hardware devices (e.g. the frame buffer or network devices).

  • Ring 3 => User Mode

  • Ring 0 => Kernel Mode.

  • The smaller the number, the higher the privilege.

However, many applications need access to these components, so system calls are made available by the operating system to provide well-defined, safe implementations for such operations. The operating system executes at the highest level of privilege, and allows applications to request services via system calls, which are often initiated via interrupts. An interrupt automatically puts the CPU into some elevated privilege level and then passes control to the kernel, which determines whether the calling program should be granted the requested service. If the service is granted, the kernel executes a specific set of instructions over which the calling program has no direct control, returns the privilege level to that of the calling program, and then returns control to the calling program. Pictorially:

          Interrupt
          (syscall)
User Mode ---------> Kernel Mode

Syscalls

Almost all programs have to interact with the outside world! This is primarily done via system calls (man syscalls). Each system call is well-documented in section 2 of the man pages (i.e., man 2 open).

System calls (on amd64) are triggered by:

  1. Set rax to the system call number.

  2. Store arguments in rdi, rsi, etc (more on this later).

  3. Call the syscall instruction.

Below are some important syscalls.

fork

The fork() syscall creates an almost-the-same copy of the calling process (addresses, registers and PC will differ). The original process is called the parent and the newly-created process is called the child. Pictorially:

                                parent
   parent (calls fork())      |---------------->
-------------------------------  child
                              |---------------->

If the forking process failed, it returns a negative number. For the parent, fork() returns the PID of the child; for the child, fork() returns 0. Therefore, we can distinguish parent and child by simple if statement:

// fork.c
main()
{
	int pid
	// pid > 0 => child
	if(!(pid = fork()))
	{
		printf("child: %d", pid);
		exit()
	}
	// pid = 0 => parent
	printf("parent: %d", pid);
}

wait

When a parent process calls fork(), it can then call wait() to wait for the child finish its execution. Definition:

int wait(int * stat_addr);

exec

The exec() syscall executes another program in the current process, maintaining the same PID. You can think of fork() as creating a box containing some stuff and think of exec() as replacing the stuff inside. Usually we will call fork() and then call exec(), but these two syscalls shouldn't be merged into one. The reason why it is true will be explained in the "Process" section:

Preemptive Multitasking

exec() contains a family of functions. The most common one in pwn is execve():

int execve(const char *pathname, char *const argv[], char *const envp[]);

Oftentime we need to call execve("/bin/sh", 0, 0) to spawn a shell.

exit

The exit() syscall ends a process. Definition:

void exit(int status);

By convention, exit(0) means success and exit(1) means error.

open

The open() syscall opens a file. Definition:

int open(char *filename, int mode)

The return value is a newly-created file descriptor.

read

The read() syscall reads the content from a file descriptor into a buffer. Definition:

int read(int fd, char *buf, int count);

The return value is the actual number of bytes gets read.

write

The write() syscall writes the content from a buffer into a file descriptor. Definition:

int write(int fd, char *buf, int count);

The return value is the actual number of bytes gets written.

strace

We can trace process syscalls using strace. For example, we can run strace whoami and observe the output:

execve("/usr/bin/whoami", ["whoami"], 0x7fffcfdd2110 /* 60 vars */) = 0
brk(NULL)                               = 0x55b49282e000
arch_prctl(0x3001 /* ARCH_??? */, 0x7ffe94f4b640) = -1 EINVAL (Invalid argument)
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=110721, ...}) = 0
mmap(NULL, 110721, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f47e990d000
close(3)                                = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\360A\2\0\0\0\0\0"..., 832) = 832
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
pread64(3, "\4\0\0\0\20\0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0", 32, 848) = 32
pread64(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0\237\333t\347\262\27\320l\223\27*\202C\370T\177"..., 68, 880) = 68
fstat(3, {st_mode=S_IFREG|0755, st_size=2029560, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f47e990b000
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
pread64(3, "\4\0\0\0\20\0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0", 32, 848) = 32
pread64(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0\237\333t\347\262\27\320l\223\27*\202C\370T\177"..., 68, 880) = 68
mmap(NULL, 2037344, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f47e9719000
mmap(0x7f47e973b000, 1540096, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x22000) = 0x7f47e973b000
mmap(0x7f47e98b3000, 319488, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x19a000) = 0x7f47e98b3000
mmap(0x7f47e9901000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1e7000) = 0x7f47e9901000
mmap(0x7f47e9907000, 13920, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f47e9907000
close(3)                                = 0
arch_prctl(ARCH_SET_FS, 0x7f47e990c580) = 0
mprotect(0x7f47e9901000, 16384, PROT_READ) = 0
mprotect(0x55b491e7a000, 4096, PROT_READ) = 0
mprotect(0x7f47e9956000, 4096, PROT_READ) = 0
munmap(0x7f47e990d000, 110721)          = 0
brk(NULL)                               = 0x55b49282e000
brk(0x55b49284f000)                     = 0x55b49284f000
openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=5699248, ...}) = 0
mmap(NULL, 5699248, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f47e91a9000
close(3)                                = 0
geteuid()                               = 1000
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
close(3)                                = 0
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
close(3)                                = 0
openat(AT_FDCWD, "/etc/nsswitch.conf", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=542, ...}) = 0
read(3, "# /etc/nsswitch.conf\n#\n# Example"..., 4096) = 542
read(3, "", 4096)                       = 0
close(3)                                = 0
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=110721, ...}) = 0
mmap(NULL, 110721, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f47e990d000
close(3)                                = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libnss_files.so.2", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\3005\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0644, st_size=51832, ...}) = 0
mmap(NULL, 79672, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f47e9195000
mmap(0x7f47e9198000, 28672, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3000) = 0x7f47e9198000
mmap(0x7f47e919f000, 8192, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xa000) = 0x7f47e919f000
mmap(0x7f47e91a1000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xb000) = 0x7f47e91a1000
mmap(0x7f47e91a3000, 22328, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f47e91a3000
close(3)                                = 0
mprotect(0x7f47e91a1000, 4096, PROT_READ) = 0
munmap(0x7f47e990d000, 110721)          = 0
openat(AT_FDCWD, "/etc/passwd", O_RDONLY|O_CLOEXEC) = 3
lseek(3, 0, SEEK_CUR)                   = 0
fstat(3, {st_mode=S_IFREG|0644, st_size=2824, ...}) = 0
read(3, "root:x:0:0:root:/root:/usr/bin/z"..., 4096) = 2824
close(3)                                = 0
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0x3), ...}) = 0
write(1, "ret2basic\n", 10ret2basic
)             = 10
close(1)                                = 0
close(2)                                = 0
exit_group(0)                           = ?
+++ exited with 0 +++

The whoami command executes /usr/bin/whoami:

execve("/usr/bin/whoami", ["whoami"], 0x7fffcfdd2110 /* 60 vars */) = 0

Then it opens libc:

openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3

reads from /etc/passwd:

openat(AT_FDCWD, "/etc/passwd", O_RDONLY|O_CLOEXEC) = 3

and prints out my name:

write(1, "ret2basic\n", 10ret2basic
)             = 10

Reference

Last updated