# Syscall

## Motivation 1: OS-level API

Syscall is the OS-level "API" for implementing programs. For example, we can implement ls and shell using syscalls:

### ls

Implement ls:

```c
// ls.c
main()
{
	char buf[512];
	fd = open(".");

	while(read(fd, buf, 512) > 0)
	{
		printf("%s", buf);
	}
}
```

The syscalls used are:

* `open()`
* `read()`
* `write()`: `printf()` actually calls `write(1, buf, len)`.

### shell

Implement `shell`:

```c
// shell.c
main()
{
	char cmd[100];

	while(1)
	{
		scanf("[/usr/root]#%s", cmd);
		// Call execvp() in child
		if(!fork())
		{
			execvp(cmd, NULL);
		}

		wait();
	}
}
```

The syscalls used are:

* `scanf()`
* `fork()`
* `exec()`
* `wait()`

## Motivation 2: User Mode => Kernel Mode

The architecture of most modern processors, with the exception of some embedded systems, involves a security model. For example, the **rings** model specifies multiple privilege levels under which software may be executed: a program is usually limited to its own address space so that it cannot access or modify other running programs or the operating system itself, and is usually prevented from directly manipulating hardware devices (e.g. the frame buffer or network devices).

{% hint style="info" %}

* **Ring 3 => User Mode**
* **Ring 0 => Kernel Mode.**
* The smaller the number, the higher the privilege.
  {% endhint %}

However, many applications need access to these components, so system calls are made available by the operating system to provide well-defined, safe implementations for such operations. The operating system executes at the highest level of privilege, and allows applications to request services via system calls, which are often initiated via **interrupts**. An interrupt automatically puts the CPU into some elevated privilege level and then passes control to the kernel, which determines whether the calling program should be granted the requested service. If the service is granted, the kernel executes a specific set of instructions over which the calling program has no direct control, returns the privilege level to that of the calling program, and then returns control to the calling program. Pictorially:

```
          Interrupt
          (syscall)
User Mode ---------> Kernel Mode
```

## Syscalls

Almost all programs have to interact with the outside world! This is primarily done via system calls (`man syscalls`). Each system call is well-documented in section 2 of the man pages (i.e., `man 2 open`).

System calls (on amd64) are triggered by:

1. Set rax to the system call number.
2. Store arguments in rdi, rsi, etc (more on this later).
3. Call the syscall instruction.

Below are some important syscalls.

### fork

The `fork()` syscall creates an almost-the-same copy of the calling process (addresses, registers and PC will differ). The original process is called the **parent** and the newly-created process is called the **child**. Pictorially:

```c
                                parent
   parent (calls fork())      |---------------->
-------------------------------  child
                              |---------------->
```

If the forking process failed, it returns a negative number. For the parent, `fork()` returns the PID of the child; for the child, `fork()` returns 0. Therefore, we can distinguish parent and child by simple if statement:

```c
// fork.c
main()
{
	int pid
	// pid > 0 => child
	if(!(pid = fork()))
	{
		printf("child: %d", pid);
		exit()
	}
	// pid = 0 => parent
	printf("parent: %d", pid);
}
```

### wait

When a parent process calls `fork()`, it can then call `wait()` to wait for the child finish its execution. Definition:

```c
int wait(int * stat_addr);
```

### exec

The `exec()` syscall executes another program in the current process, maintaining the same PID. You can think of `fork()` as creating a box containing some stuff and think of `exec()` as replacing the stuff inside. Usually we will call `fork()` and then call `exec()`, but these two syscalls shouldn't be merged into one. The reason why it is true will be explained in the "Process" section:

{% content-ref url="/pages/XEl3LFp7rnmwi9x1G9OT" %}
[Preemptive Multitasking](/ctfnote/computer-science/computer-systems/preemptive-multitasking.md)
{% endcontent-ref %}

`exec()` contains a family of functions. The most common one in pwn is `execve()`:

```c
int execve(const char *pathname, char *const argv[], char *const envp[]);
```

Oftentime we need to call `execve("/bin/sh", 0, 0)` to spawn a shell.

### exit

The `exit()` syscall ends a process. Definition:

```c
void exit(int status);
```

By convention, `exit(0)` means success and `exit(1)` means error.

### open

The `open()` syscall opens a file. Definition:

```c
int open(char *filename, int mode)
```

The return value is a newly-created file descriptor.

### read

The `read()` syscall reads the content from a file descriptor into a buffer. Definition:

```c
int read(int fd, char *buf, int count);
```

The return value is the actual number of bytes gets read.

### write

The `write()` syscall writes the content from a buffer into a file descriptor. Definition:

```c
int write(int fd, char *buf, int count);
```

The return value is the actual number of bytes gets written.

## strace

We can trace process syscalls using `strace`. For example, we can run strace whoami and observe the output:

```c
execve("/usr/bin/whoami", ["whoami"], 0x7fffcfdd2110 /* 60 vars */) = 0
brk(NULL)                               = 0x55b49282e000
arch_prctl(0x3001 /* ARCH_??? */, 0x7ffe94f4b640) = -1 EINVAL (Invalid argument)
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=110721, ...}) = 0
mmap(NULL, 110721, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f47e990d000
close(3)                                = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\360A\2\0\0\0\0\0"..., 832) = 832
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
pread64(3, "\4\0\0\0\20\0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0", 32, 848) = 32
pread64(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0\237\333t\347\262\27\320l\223\27*\202C\370T\177"..., 68, 880) = 68
fstat(3, {st_mode=S_IFREG|0755, st_size=2029560, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f47e990b000
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
pread64(3, "\4\0\0\0\20\0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0", 32, 848) = 32
pread64(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0\237\333t\347\262\27\320l\223\27*\202C\370T\177"..., 68, 880) = 68
mmap(NULL, 2037344, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f47e9719000
mmap(0x7f47e973b000, 1540096, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x22000) = 0x7f47e973b000
mmap(0x7f47e98b3000, 319488, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x19a000) = 0x7f47e98b3000
mmap(0x7f47e9901000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1e7000) = 0x7f47e9901000
mmap(0x7f47e9907000, 13920, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f47e9907000
close(3)                                = 0
arch_prctl(ARCH_SET_FS, 0x7f47e990c580) = 0
mprotect(0x7f47e9901000, 16384, PROT_READ) = 0
mprotect(0x55b491e7a000, 4096, PROT_READ) = 0
mprotect(0x7f47e9956000, 4096, PROT_READ) = 0
munmap(0x7f47e990d000, 110721)          = 0
brk(NULL)                               = 0x55b49282e000
brk(0x55b49284f000)                     = 0x55b49284f000
openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=5699248, ...}) = 0
mmap(NULL, 5699248, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f47e91a9000
close(3)                                = 0
geteuid()                               = 1000
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
close(3)                                = 0
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
close(3)                                = 0
openat(AT_FDCWD, "/etc/nsswitch.conf", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=542, ...}) = 0
read(3, "# /etc/nsswitch.conf\n#\n# Example"..., 4096) = 542
read(3, "", 4096)                       = 0
close(3)                                = 0
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=110721, ...}) = 0
mmap(NULL, 110721, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f47e990d000
close(3)                                = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libnss_files.so.2", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\3005\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0644, st_size=51832, ...}) = 0
mmap(NULL, 79672, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f47e9195000
mmap(0x7f47e9198000, 28672, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3000) = 0x7f47e9198000
mmap(0x7f47e919f000, 8192, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xa000) = 0x7f47e919f000
mmap(0x7f47e91a1000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xb000) = 0x7f47e91a1000
mmap(0x7f47e91a3000, 22328, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f47e91a3000
close(3)                                = 0
mprotect(0x7f47e91a1000, 4096, PROT_READ) = 0
munmap(0x7f47e990d000, 110721)          = 0
openat(AT_FDCWD, "/etc/passwd", O_RDONLY|O_CLOEXEC) = 3
lseek(3, 0, SEEK_CUR)                   = 0
fstat(3, {st_mode=S_IFREG|0644, st_size=2824, ...}) = 0
read(3, "root:x:0:0:root:/root:/usr/bin/z"..., 4096) = 2824
close(3)                                = 0
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0x3), ...}) = 0
write(1, "ret2basic\n", 10ret2basic
)             = 10
close(1)                                = 0
close(2)                                = 0
exit_group(0)                           = ?
+++ exited with 0 +++
```

The whoami command executes /usr/bin/whoami:

```c
execve("/usr/bin/whoami", ["whoami"], 0x7fffcfdd2110 /* 60 vars */) = 0
```

Then it opens libc:

```c
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
```

reads from `/etc/passwd`:&#x20;

```c
openat(AT_FDCWD, "/etc/passwd", O_RDONLY|O_CLOEXEC) = 3
```

and prints out my name:

```c
write(1, "ret2basic\n", 10ret2basic
)             = 10
```

## Reference

{% embed url="<https://en.wikipedia.org/wiki/System_call>" %}
System call - Wikipedia
{% endembed %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://ret2basic.gitbook.io/ctfnote/computer-science/computer-systems/syscall.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
