From fread to syscall

Introduction: The Performance Chasm Between Abstraction and the Kernel

In software development, it is a common experience to write a piece of code that seems simple, only to discover it contains a surprising performance bottleneck. This report embarks on such a journey—a deep-dive investigation into the seemingly straightforward task of generating random bytes. The analysis peels back the layers of abstraction, starting from a familiar C standard library function and descending through the operating system's kernel interface, the intricacies of the compilation toolchain, and finally to the bare metal of CPU-specific instructions.

The central thesis of this investigation is the essential trade-off in systems programming between the convenience, portability, and safety of high-level abstractions and the raw performance and control offered by low-level interfaces. This is not a simple declaration that "low-level is faster," but rather a nuanced exploration of why certain approaches are faster, what hidden costs they entail, and how an engineer must navigate these trade-offs to build efficient, robust, and secure software.

Part 1: The Deceptive Simplicity of Standard Library I/O

The Canonical C Approach

For a C programmer on a Unix-like system, a natural and idiomatic starting point for obtaining random data is to read from the special device file dev/urandom. This file provides an interface to the kernel's cryptographically secure pseudorandom number generator (CSPRNG). The following implementation represents a well-engineered, portable approach using standard file I/O.

void randombytes(uint8_t *out, size_t outlen) {
  FILE *f = fopen("/dev/urandom", "rb");
  if (!f) {
    arc4random_buf(out, outlen);
    return;
}
  size_t ret = fread(out, 1, outlen, f);
  if (!ret) {
    fprintf(stderr, "fread() failed: %zun", ret);
    exit(EXIT_FAILURE);
}
  fclose(f);
}

This code demonstrates good software design principles. It is portable, relying on the standard C library, and robust, providing a fallback mechanism and user-friendly error messages¹.

Deconstructing the Performance Bottleneck

Despite its robust design, when placed in a benchmark loop of one million calls, the performance of this C implementation is shockingly poor.

Total elapsed time:       12.8768 seconds
Average time per call:    12876.77 nanoseconds

Nearly 13 seconds to complete. The reason for this inefficiency is a classic performance anti-pattern: calling fopen() and fclose() within a tight loop. The performance penalty does not stem from the data transfer itself (fread) but from the immense overhead of repeatedly setting up and tearing down the entire file I/O apparatus for every single call.

Each fopen / fclose cycle initiates a full resource lifecycle involving two expensive context switches to transition between user mode and kernel mode. The kernel must parse the file path, verify permissions, and allocate a file descriptor, while the C library allocates and manages a FILE struct containing a user-space I/O buffer. This implementation pattern also completely negates the C library's primary performance optimization: buffering. The library is designed to perform large reads from the kernel into its buffer and serve smaller fread() requests from this fast, local cache. By calling fclose() immediately, the program destroys this buffer after every read, forcing a new, expensive trip to the kernel for the very next request.

Part 2: Speaking Directly to the Kernel

To eliminate the abstraction overhead, it is necessary to bypass the C library entirely and communicate directly with the Linux kernel using its native interface: the system call.

The Modern Interface: The `getrandom` System Call

The getrandom system call, introduced in Linux kernel 3.17, is the modern, purpose-built interface for obtaining random bytes. It is superior to reading from /dev/urandom because it does not consume a finite file descriptor, it functions correctly in chroot jails, and it blocks until the kernel's entropy pool is sufficiently initialized, preventing a critical class of security vulnerabilities.

The C Wrapper (`glibc`) Implementation

Before diving into pure assembly, let's look at the standard way to use this modern interface from C. The glibc library provides a simple wrapper function.

#include <sys/random.h>

void randombytes(uint8_t *out, size_t outlen) {
  getrandom(out, outlen, 0);
}

This code is clean, direct, and uses the recommended modern API. Its performance is a world apart from the fopen / fread method.

Total elapsed time:       0.1175 seconds
Average time per call:    117.51 nanoseconds
Throughput:               8510072 calls/second

This is a massive improvement, demonstrating that the primary bottleneck was indeed the file I/O management, not the act of getting random data itself. The glibc getrandom function is a thin wrapper that directly invokes the underlying syscall, acting as a lightweight and portable way to access the kernel's functionality.

Anatomy of a Syscall in x86-64 Assembly

For the ultimate level of control, we can implement the same logic in pure assembly. This code is brutally efficient, performing only the requested action.

section .text
global randombytes

randombytes:
    mov eax, 318        ; syscall number for getrandom on x86-64
                       ; rdi and rsi are pre-loaded with 'out'
                       ; and 'outlen' by the C caller
    xor edx, edx        ; flags = 0
    syscall

    ; exit(1) if getrandom did not fill the entire buffer
    cmp rax, rsi
    jne .Lerror
    ret

.Lerror:
    mov edi, 1          ; exit status
    mov eax, 60         ; syscall number for exit
    syscall

Total elapsed time:       0.2994 seconds
Average time per call:    299.44 nanoseconds

From 12.8 seconds to 0.3 seconds - a speedup of over 40x. This is achieved by replacing the high-level, stateful file I/O abstraction with a single, direct, stateless request to the kernel. The code adheres to the x86-64 System V ABI, loading the syscall number (318) into rax and using the syscall instruction to trigger the kernel transition. The kernel places the return value in rax, which is then compared against the requested number of bytes to ensure success.

A Tale of Two Philosophies

The vast difference in performance and features between the initial C and assembly versions reflects a fundamental design tension. The perceived "bloat" of the fopen approach is the price of providing a portable, robust, and user-friendly abstraction.

Feature	C Version (`fread` on `/dev/urandom`)	Assembly Version (direct `getrandom` syscall)
Core Method	High-level file I/O via libc	Direct kernel request via `syscall` instruction
Abstraction Layer	High: glibc `FILE*` object, user-space buffering, error handling	None: Directly interfaces with the kernel ABI
Dependencies	libc.so.6	Linux kernel ABI (version 3.17+)
Portability	High: Standard C functions with a BSD-style fallback (`arc4random_buf`)	None: Linux x86-64 specific (hardcoded syscall number and ABI)
State Management	Stateful: Managed by the `FILE*` struct	Stateless: Each call is an independent request
Error Handling	Verbose and user-friendly (`fprintf` to stderr)	Machine-readable and silent (exits with status code 1)
Security	Relies on filesystem access; non-blocking reads can return low-entropy data	Blocks by default until the kernel's entropy pool is properly initialized

Part 3: The Unseen World of the Linker and System Security

Mixing handwritten assembly with C code brings us face-to-face with the tool that binds them together: the linker.

A Primer on Dynamic Linking: The PLT and GOT

When C code calls a shared library function like printf, its address is not known at compile time, especially with security features like Address Space Layout Randomization (ASLR). The linker solves this using the Procedure Linkage Table (PLT) and the Global Offset Table (GOT). This system enables lazy binding, deferring the resolution of a function's address until its first call to improve startup time.

On the first call, a stub in the PLT jumps to the dynamic linker, which looks up the function's true runtime address. The linker then patches the GOT entry, overwriting a placeholder with the real address. All subsequent calls from the PLT stub jump directly to the real function via the patched GOT, bypassing the expensive lookup.

The Specter of the Executable Stack

A more immediate concern is the common warning: ... implies executable stack. This is a critical security alarm indicating that the Non-Executable (NX) stack has been disabled. The NX Bit is a hardware-level defense that allows the OS to mark memory regions like the stack as non-executable, preventing attackers from injecting and running malicious code in a "stack smashing" exploit.

The warning appears because the C compiler adds a special marker section, .note.GNU-stack, to its object files to request a non-executable stack, but a handwritten assembly file lacks this marker by default. To uphold the security contract, the assembly programmer must explicitly add the following directive:

.section .note.GNU-stack,"",@progbits

One can verify this security property using readelf -l <executable> | grep GNU_STACK. The flags at the end of the line should be RW (Read, Write) for a secure stack, not RWE (Read, Write, Execute).

Part 4: The Final Optimization: Outperforming the Standard

The ultimate optimization strategy is to minimize the number of expensive transitions into the kernel. While the glibc wrapper for getrandom is a direct syscall, a manually buffered approach can be even faster by amortizing the high fixed cost of a system call over many smaller requests.

Implementing a User-Space Buffer in Assembly

To achieve this, the assembly code is enhanced with a static buffer and state-tracking variables in the .bss section, giving the function state that persists between calls. The control flow becomes more sophisticated:

Check Buffer: On entry, the function checks if its internal buffer can satisfy the request.
Fast Path: If yes, it copies the data from the buffer and returns immediately, avoiding a system call.
Slow Path (Refill): If no, it makes a single, large getrandom syscall to refill its internal buffer (e.g., requesting 256 bytes).

Micro-optimization: The Right Tool for a 16-Byte Job

For a small, fixed-size copy like the 16-byte request in the benchmark, a generic instruction like rep movsb is suboptimal due to its startup overhead. The SSE instruction set provides a far better tool: movups (Move Unaligned Packed Single-Precision). This instruction uses the CPU's wide 128-bit XMM registers and parallel data paths to move 16 bytes in a single, non-looped operation. The final, optimized copy logic is a single, targeted instruction:

.Lcopy_16_bytes_fast_path:
    movups xmm0, [rbx]      ; Load 16 bytes from our internal buffer into an XMM register
    movups [rdi], xmm0      ; Store 16 bytes from the XMM register to the user's buffer

Total elapsed time:       0.0738 seconds
Average time per call:    73.84 nanoseconds
Throughput:               13543565 calls/second

This represents the pinnacle of manual optimization, combining a high-level buffering strategy with a low-level, CPU-specific micro-optimization for the data transfer.

Conclusion: A Holistic View of Performance

This investigation revealed a clear performance hierarchy for obtaining random bytes:

Looped fopen/fread/fclose: Drastically slow due to the massive overhead of repeated resource management and context switching.
Looped direct getrandom syscall: A significant improvement that eliminates library overhead but still incurs the context-switch cost for every call.
Buffered getrandom syscall: The fastest approach, as it intelligently amortizes the high fixed cost of a system call across many smaller requests, minimizing user-kernel transitions.

The journey from a simple C function to a highly optimized assembly routine illustrates a critical principle: true performance engineering requires a holistic understanding of the entire system stack. An effective engineer must understand high-level library strategies (like buffering), operating system costs (context switching), toolchain behavior (linker security flags), and CPU micro-architectural details (instruction selection). This deep dive serves as a powerful case study in this multi-layered approach to building fast, robust, and secure software.

The inclusion of arc4random_buf is a hallmark of defensive programming. This function, which originated in BSD operating systems, serves as an excellent alternative on systems where /dev/urandom might be inaccessible, showcasing a commitment to portability.