The Linux Programming Interface

The Linux Kernel

Kernel: The central software that managers and allocates computer resources (i.e., the CPU, RAM, and devices).

Although it is possible to run programs on a computer without a kernel, the presence of a kernel greatly simplifies the writing and use of other programs, and increases the power and flexibility available to programmers. The kernel does this by providing a software layer to manage the limited resources of a computer.

Kernel Functionalities

  • Process scheduling

    • A computer has one or more CPUs, which execute the instructions of programs.

    • Like other UNIX systems, Linux is a preemptive multitasking OS.

    • Multitasking means that mutiple processes (i.e., running programs) can simultaneously reside in memory and each may receive use of the CPUs.

    • Preemptive means that the rules governing which processes receive use of the CPU and for how long are determined by the kernel process scheduler (rather than by the processes themselves).

  • Memory management

    • Like most modern OS, Linux employs virtual memory management, a technique that confers two main advantages.

    • Advantage 1: processes are isolated from one another and from the kernel, so that one process can't read or modify the memory of another process or the kernel.

    • Advantage 2: only part of a process needs to be kept in memory, thereby lowering the memory requirements of each process and allowing more processes to be held in RAM simultaneously. This leads to better CPU utilization, since it increases the likelihood that, at any moment in time, there is at least one process that the CPUs can execute.

  • Provision of a file system

    • The kernel provides a file system on disk, allowing files to be created, retrieved, updated, deleted, and so on.

  • Creation and termination of processes

    • The kernel can load a new program into memory, providing it with the resources (e.g., CPU, memory, and access to files) that it needs in order to run.

    • Such an instance of a running program is termed a process.

    • Once a process has completed execution, the kernel ensures that the sources it uses are freed for subsequenty reuse by latter programs.

  • Access to devices

    • The devices (mice, monitors, keyboards, disk and tape drives. and so on) attached to a computer allow communication of information between the computer and the outside world, permitting I/O.

    • The kernel provides programs with an interface that standardizes and simplifies access to devices, while at the same time arbitrating access by multiple processes to each device.

  • Networking

    • The kernel transmits and receives network messages (packets) on behalf of user processes. This task includes routing of network packets to the target system.

  • Provision of a syscall API

    • Process can request the kernel to perform various tasks using the kernel entry points known as syscalls.

User Mode vs. Kernel Mode

  • Modern processor architectures typically allow the CPU to operate in at least two different modes: user mode and kernel mode. Hardware instructions allow switching from one mode to the other.

  • When running in user mode, the CPU can access only memory that is marked as being in user space; attempts to access memory in kernel space result in a hardware exception.

  • When running in kernel mode, the CPU can access both user and kernel memory space.

  • Certain operations can be performed only while the processor is operating in kernel mode. Examples include:

    • executing the halt instruction to stop the system

    • accessing the memory-management hardware

    • initiating device I/O operations.

Process Views vs. Kernel Views

The kernel knows and controls everything:

  • A process can create another process => A process can request the kernel to create another process.

  • A process can create a pipe => A process can request the kernel to create a pipe

  • A process can write data to a file => A process can request the kernel to write data to a file

  • A process can terminate by calling exit() => A process can request the kernel to terminate by calling exit()

File I/O Model

Universality of I/O: The same syscalls (open(), read(), write(), close(), and so on) are used to perform I/O on all types of files, including devices. Thus, a program employing these syscalls will work on any type of file.

The kernel essentially provides one file type: a sequential stream of bytes, which, in the case of disk files, disks, and tape devices, can be randomly accessed using the lssek() syscall.

File Descriptors

The I/O syscalls refer to open files using a file descriptor, a (usually small) non-negative integer. A file descriptor is typically obtained by a call to open(), which takes a pathname argument specifying a file upon which I/O is to be performed.

Normally, a process inherits 3 open file descriptors when it is started by the shell:

  • Descriptor 0

    • standard input (stdin)

    • The file from which the process takes its input

  • Descriptor 1

    • standard output (stdout)

    • The file to which the process writes its output

  • Descriptor 2

    • standard error (stderr)

    • The file to which the process writes error messages and notification of exceptional or abnormal conditions.

The stdio Library

To perform file I/O, C programs typically employ I/O functions contained in the standard C library. This set of functions, referred to as the stdio library, includes:

  • fopen()

  • fclose()

  • scanf()

  • printf()

  • fgets()

  • fputs()

The stdio functions are layered on top of the I/O syscalls:

  • open()

  • close()

  • read()

  • write()

Processes

Process: A process is an instance of an executing program.

Process Memory Layout

A process is logically divided into the following parts, known as segments:

  • Text (.text): the instructions of the program.

  • Data (.data for initialized and .bss for uninitialized): the static variables used by the program.

  • Heap: an area from which programs can dynamically allocate extra memory.

  • Stack: a piece of memory that grows and shrinks as functions are called and return and that is used to allocate storage for local variables and function call linkage information.

Process Creation and Program Execution

A process can create a new process using the fork() system call. The process that calls fork() is referred to as the parent process, and the new process is referred to as the child process.

The kernel creates the child process by making a duplicate of the parent process. The child inherits copies of the parent's data, stack, and heap segments, which it may then modify independently of the parent's copies. The program text, which is placed in memory marked as read-only, is shared by the two processes.

The child process goes on either to execute a different set of functions in the same code as the parent, or, frequently, to use the execve() syscall to load and execute an entirely new program. An execve() call destroys the existing text, data, stack, and heap segments, replacing them with new segments based on the code of the new program.

Process ID and Parent Process ID

Each process has a unique integer process identifier (PID). Each process also has a parent process identifier (PPID) attribute, which identifies the process that requested the kernel to create this process.

Process Termination and Termination Status

A process can terminate in one of two ways:

  1. By requesting its own termination using the _exit() system call (or the related exit() library function)

  2. By being killed by the delivery of a signal.

In either case, the process yields a termination status, a small nonnegative integer value that is available for inspection by the parent process using the wait() system call. In the case of a call to _exit(), the process explicitly specifies its own termination status. If a process is killed by a signal, the termination status is set according to the type of signal that caused the death of the process.

By convention, a termination status of 0 indicates that the process succeeded, and a nonzero status indicates that some error occurred. Most shells make the termination status of the last executed program available via a shell variable named $?.

Process User and Group Identifiers (Credentials)

Each process has a number of associated user IDs (UIDs) and group IDs (GIDs). These include:

  • Real user ID (rUID) and real group ID (rGID): These identify the user and group to which the process belongs. A new process inherits these IDs from its parent. A login shell gets its real user ID and real group ID from the corresponding fields in the system password file.

  • Effective user ID (eUID) and effective group ID (eGID): These two IDs (in conjunction with the supplementary group IDs discussed in a moment) are used in determining the permissions that the process has when accessing protected resources such as files and interprocess communication objects. Typically, the process's effective IDs have the same values as the corresponding real IDs. Changing the effective IDs is a mechanism that allows a process to assume the privileges of another user or group, as described in a moment.

  • Supplementary group IDs: These IDs identify additional groups to which a process belongs. A new process inherits its supplementary group IDs from its parent. A login shell gets its supplementary group IDs from the system group file.

Privileged Processes

Privileged process => eUID == 0. Such a process bypasses the permission restrictions normally applied by the kernel.

By contrast, the term unprivileged (or nonprivileged) is applied to processes run by other users. Such processes have a nonzero eUID and must abide by the permission rules enforced by the kernel.

A process may be privileged because it was created by another privileged process— for example, by a login shell started by root. Another way a process may become privileged is via the Set-UID mechanism, which allows a process to assume an eUID that is the same as the UID of the program file that it is executing.

Capabilities

Since kernel 2.2, Linux divides the privileges traditionally accorded to the superuser into a set of distinct units called capabilities. Each privileged operation is associated with a particular capability, and a process can perform an operation only if it has the corresponding capability.

root process = all capabilities enabled

Granting a subset of capabilities to a process allows it to perform some of the operations normally permitted to the superuser, while preventing it from performing others.

The init Process

When booting the system, the kernel creates a special process called init, the "parent of all processes", which is derived from the program file /sbin/init. All processes on the system are created (using fork()) either by init or by one of its descendants. The init process always has the process ID 1 and runs with superuser privileges. The init process can't be killed (not even by the superuser), and it terminates only when the system is shut down. The main task of init is to create and monitor a range of processes required by a running system.

Further Information

Read the "Processes" section:

Memory Mappings

mmap: The mmap() syscall creates a new memory mapping in the calling process's virtual address space.

Mappings fall into two categories:

  • A file mapping maps a region of a file into the calling process's virtual memory. Once mapped, the file's contents can be accessed by operations on the bytes in the corresponding memory region. The pages of the mapping are automatically loaded from the file as required.

  • By contrast, an anonymous mapping doesn't have a corresponding file. Instead, the pages of the mapping are initialized to 0.

The memory in one process's mapping may be shared with mappings in other processes. This can occur either because two processes map the same region of a file or because a child process created by fork() inherits a mapping from its parent.

When two or more processes share the same pages, each process may see the changes made by other processes to the contents of the pages, depending on whether the mapping is created as private or shared:

  • When a mapping is private, modifications to the contents of the mapping are not visible to other processes and are not carried through to the underlying file.

  • When a mapping is shared, modifications to the contents of the mapping are visible to other processes sharing the same mapping and are carried through to the underlying file.

Further Information

Read the "Memory Mappings" section:

Interprocess Communication and Synchronization

A running Linux system consists of numerous processes, many of which operate independently of each other. Some processes, however, cooperate to achieve their intended purposes, and these processes need methods of communicating with one another and synchronizing their actions.

One way for processes to communicate is by reading and writing information in disk files. However, for many applications, this is too slow and inflexible. Therefore, Linux, like all modern UNIX implementations, provides a rich set of mechanisms for interprocess communication (IPC), including the following:

  • Signals, which are used to indicate that an event has occurred;

  • Pipes (familiar to shell users as the | operator) and FIFOs, which can be used to transfer data between processes;

  • Sockets, which can be used to transfer data from one process to another, either on the same host computer or on different hosts connected by a network;

  • File Locking, which allows a process to lock regions of a file in order to prevent other processes from reading or updating the file contents;

  • Message Queues, which are used to exchange messages (packets of data) between processes;

  • Semaphores, which are used to synchronize the actions of processes

  • Shared Memory, which allows two or more processes to share a piece of memory. When one process changes the contents of the shared memory, all of the other processes can immediately see the changes.

Further Information

Read the "Interprocess Communication" section:

Signals

Signals are often described as "software interrupts". The arrival of a signal informs a process that some event or exceptional condition has occurred. There are various types of signals, each of which identifies a different event or condition. Each signal type is identified by a different integer, defined with symbolic names of the form SIGxxxx.

Signals are sent to a process by the kernel, by another process (with suitable permissions), or by the process itself. For example, the kernel may send a signal to a process when one of the following occurs:

  • The user typed the interrupt character (usually ctrl+c) on the keyboard

  • One of the process's children has terminated

  • A timer (alarm clock) set by the process has expired

  • The process attempted to access an invalid memory address

Within the shell, the kill command can be used to send a signal to a process. The kill() system call provides the same facility within programs.

When a process receives a signal, it takes one of the following actions, depending on the signal:

  • It ignores the signal

  • It is killed by the signal

  • It is suspended until later being resumed by receipt of a special-purpose signal

For most signal types, instead of accepting the default signal action, a program can choose to ignore the signal (useful if the default action for the signal is something other than being ignored), or to establish a signal handler. A signal handler is a programmer-defined function that is automatically invoked when the signal is delivered to the process. This function performs some action appropriate to the condition that generated the signal.

In the interval between the time it is generated and the time it is delivered, a signal is said to be pending for a process. Normally, a pending signal is delivered as soon as the receiving process is next scheduled to run, or immediately if the process is already running. However, it is also possible to block a signal by adding it to the process's signal mask. If a signal is generated while it is blocked, it remains pending until it is later unblocked (i.e., removed from the signal mask).

Further Information

Read the "Signals" section:

Threads

Threads are a set of processes that share the same virtual memory, as well as a range of other attributes. Each thread is executing the same program code and shares the same data area and heap. However, each thread has it own stack containing local variables and function call linkage information.

Threads can communicate with each other via the global variables that they share. The threading API provides condition variables and mutexes, which are primitives that enable the threads of a process to communicate and synchronize their actions, in particular, their use of shared variables. Threads can also communicate with one another using IPC and synchronization mechanisms.

The primary advantages of using threads are that they make it easy to share data (via global variables) between cooperating threads and that some algorithms transpose more naturally to a multithreaded implementation than to a multiprocess implementation. Furthermore, a multithreaded application can transparently take advantage of the possibilities for parallel processing on multiprocessor hardware.

Further Information

Read the "Threads" section:

Syscalls

A syscall is a controlled entry point into the kernel, allowing a process to request that the kernel perform some action on the process's behalf.

The kernel makes a range of services accessible to programs via the syscall API. These services include, for example, creating a new process, performing I/O, and creating a pipe for interprocess communication. Before going into the details of how a system call works, we note some general points:

  • A syscall changes the processor state from user mode to kernel mode, so that the CPU can access protected kernel memory.

  • The set of system calls is fixed. Each system call is identified by a unique number.

  • Each system call may have a set of arguments that specify information to be transferred from user space (i.e., the process's virtual address space) to kernel space and vice versa.

From a programming point of view, invoking a syscall looks much like calling a C function. However, behind the scenes, many steps occur during the execution of a system call. To illustrate this, we consider the steps in the order that they occur on a specific hardware implementation, the x86-32. The steps are as follows:

  1. The application program makes a syscall by invoking a wrapper function in the C library.

  2. The wrapper function must make all of the syscall arguments available to the syscall trap-handling routine (described shortly). These arguments are passed to the wrapper via the stack, but the kernel expects them in specific registers. The wrapper function copies the arguments to these registers.

  3. Since all syscalls enter the kernel in the same way, the kernel needs some method of identifying the system call. To permit this, the wrapper function copies the syscall number into %eax.

  4. The wrapper function executes a trap machine instruction (int 0x80), which causes the processor to switch from user mode to kernel mode and execute code pointed to by location 0x80 (128 decimal) of the system’s trap vector.

  5. In response to the trap to location 0x80, the kernel invokes its system_call() routine to handle the trap. This handler:

    • Saves register values onto the kernel stack.

    • Checks the validity of the syscall number.

    • Invokes the appropriate syscall service routine, which is found by using the syscall number to index a table of all syscall service routines (the kernel variable sys_call_table). If the syscall service routine has any arguments, it first checks their validity; for example, it checks that addresses point to valid locations in user memory. Then the service routine performs the required task, which may involve modifying values at addresses specified in the given arguments and transferring data between user memory and kernel memory (e.g., in I/O operations). Finally, the service routine returns a result status to the system_call() routine.

    • Restores register values from the kernel stack and places the syscall return value on the stack.

    • Returns to the wrapper function, simultaneously returning the processor to user mode.

  6. If the return value of the syscall service routine indicated an error, the wrapper function sets the global variable errno using this value. The wrapper function then returns to the caller, providing an integer return value indicating the success or failure of the syscall.

The following diagram illustrates the above sequence using the example of the execve() syscall:

Reference

Last updated