Recently, I’ve been experimenting with Intel’s Transactional Synchronization Extensions (TSX), an x86 instruction set extension that implements Restricted Transactional Memory (RTM), which is a hardware implementation of transactional memory. But, executing various instructions can cause a hardware transaction to spuriously fail, so I’d like to be able to avoid calling them during runtime at all. It turned out achieving this was much more difficult than expected; for example, the GNU C Library (glibc) uses runtime hardware capability detection to execute the most optimal function supported by hardware. In the end, I ended up using hardware support for CPUID faulting to dynamically manipulate the hardware capability flags, together with a shim dynamic loader to ensure that my override is installed as early as possible. The implementation of libcpuidoverride is open-source and available online.

Thanks to Sol Boucher for help with the ELF format, dynamic linker, and the kernel system call interface!

Background: Transactional Memory

For those unfamiliar, transactional memory is an alternative concurrency control mechanism that simplifies concurrent programming by providing developers with first-class transactional regions that guarantee atomicity, consistency, and isolation. In comparison with existing techniques, developers do not need to explicitly reason about placement, duration, or granularity of low-level mutual exclusion primitives (such as locks), which are error-prone and can result in well-known concurrency bugs such as resource starvation and race conditions.

Unfortunately, a number of operations can cause a hardware transaction to fail, due to various microarchitectural design choices. For example, the underlying hardware implementation of transactional memory on Intel’s Haswell processor buffers transactional operations within the L1 cache, and modifies the cache coherency protocol to detect when contention occurs. If a transaction performs enough operations to exceed the L1 cache size, then a capacity abort may occur. Intel provides a reference document that enumerates the various operations that may cause a transaction to abort, one of which is VZEROUPPER.

This instruction is called by many of glibc’s vectorized functions (e.g. __memset_avx2_unaligned_erms()) that utilize Intel’s Advanced Vector Extensions (AVX), in order to zero the upper half of the 256-bit YMM AVX2 registers. This is necessary because the lower 128-bits of the YMM registers alias with the 128-bit XMM registers, which are part of Intel’s older Streaming SIMD Extensions (SSE). By explicitly zeroing these registers, the processor avoids explicitly saving and restoring the upper half of these registers when transitioning between SSE and AVX instructions, which improves performance.

But, since this instruction is causing transaction aborts, I’d like to prevent glibc or any other library from utilizing the AVX extensions.

Hardware Capability Detection

Linux Kernel

Hardware capabilities can be detected using the CPUID instruction, which returns various information about the processor and system configuration. An input value passed in the EAX register, commonly referred to as a leaf, is used to determine what information is requested. Certain leaves may support requesting more fine-grained information by passing another input value in the ECX register, commonly referred to as a sub-leaf.

The Linux kernel provides information about certain architecture-specific hardware capabilities to ELF programs within the auxiliary vector, keyed under the types AT_HWCAP and AT_HWCAP2. Unfortunately, the corresponding bitmask values are somewhat poorly-documented; within the Linux kernel source code, on non-x86 architectures they are typically defined within asm/hwcap.h and asm/hwcap2.h (e.g. HWCAP_THUMB on ARM), but on x86, they are defined within asm/cpufeatures.h (only the feature flags for leaf 0x01 register EDX) and asm/hwcap2.h (e.g. X86_FEATURE_FPU and HWCAP2_RING3MWAIT).

GNU C Library (glibc)

A dynamically-loaded library like glibc can implement multiple versions of any given function (e.g. memset()) using a compiler feature known as function multi-versioning (FMV), which is implemented by the GNU C Compiler (GCC). Behind the scenes, the compiler inserts a layer of indirection: under lazy binding, the first call to a multi-versioned function actually executes a special dispatch function that performs hardware capability detection to identify the most optimal version, which is then executed directly on all subsequent calls. For the actual details of this process, refer to the Global Offset Table (GOT) and Procedure Linkage Table (PLT) sections within the System-V Application Binary Interface (ABI) AMD64 Architecture Processor Supplement. For a comprehensive discussion of writing a shared library, refer to How to Write Shared Libraries.

Programs that utilize the Executable and Linkable Format (ELF) mark these functions with a special STT_GNU_IFUNC indirect function symbol. This informs the dynamic loader to resolve the address of these function symbols dynamically, which can occur either eagerly at program startup (compile with -Wl,-z,now or environment variable LD_BIND_NOW) or lazily when first called (default compile, or explicitly with -Wl,-z,lazy).

Hardware capability features from the kernel are inherited by the dynamic loader, and can be overridden using the LD_HWCAP_MASK environment variable. This environment variable is deprecated with glibc 2.26+ in favor of tunables, and should instead be specified using GLIBC_TUNABLES=glibc.cpu.hwcap_mask=. Aside from those inherited from AT_HWCAP, tunables also allow other glibc features to manually overridden; for example, glibc.cpu.x86_non_temporal_threshold for controlling the non-temporal store execution threshold.

CPUID Faulting

Unfortunately, these existing mechanisms for controlling hardware feature detection are poorly-documented, library specific (e.g. LD_HWCAP_MASK or glibc tunables), and limited in scope (e.g. AT_HWCAP). Indeed, other developers have encountered similar problems on StackOverflow, and ended up resorting to ugly workarounds like binary patching or even rebuilding glibc. A better approach would be a more generic method for overriding feature detection, perhaps by instrumenting or interrupting execution of the CPUID instruction.

It turns out that recent Intel processors post-Ivy Bridge do include such a feature, known as VT FlexMigration, which triggers a General Protection Fault (GPF) when a CPUID instruction is executed. This allows the results of the CPUID instruction to be manipulated, which essentially overrides capability detection.

Thanks to the folks working on trace portability for the rr reversible debugger, this feature was exposed to user-mode programs under Linux kernel 4.12+, through the ARCH_SET_CPUID and ARCH_GET_CPUID subfunctions of the arch_prctl() system call. When enabled by calling arch_prctl(ARCH_SET_CPUID, 0) (not a typo), executing the CPUID instruction will generate a SIGSEGV signal that can be caught using a custom handler. This behavior is inherited through fork(), but not execve().

An initial implementation of libcpuoverride utilized this approach to interrupt execution of the CPUID instruction. Three different designs came to mind for actually fetching the CPUID value from within the signal handler:

Use ptrace() to control execution of the target process. This dynamic approach is elegant, because ptrace() can also be used to force the target process to enable CPUID faulting using PTRACE_SETREGS. When an appropriate SIGSEGV signal arrives, the external tracer can catch it, suppress the signal, and as a separate process, execute the CPUID instruction and write back the overridden result into the tracee. Unfortunately, one major drawback is that most existing debugging and instrumentation tools (such as rr) already use ptrace(), and a target process can only have at most one tracer.
Use LD_PRELOAD and -Wl,-z,initfirst to inject a dynamic library that is loaded first. __attribute__((constructor)) can be used to register a constructor function that is executed when the library is loaded, which can enable CPUID faulting and register a signal handler. When a SIGSEGV signal is received, disable CPUID faulting, execute the CPUID instruction, write back the overridden result, and re-enable CPUID faulting. A drawback of this approach is that calling arch_prctl() within a signal handler is not asynchronous signal safe. Note that although SIGSEGV is a synchronous signal, it may be delivered asynchronously.
Same as (2), except instead of calling arch_prctl() or executing CPUID within a signal handler, pre-cache all valid CPUID leaves and sub-leaves within the constructor function, and simply reuse their values within the signal handler, which solves the asynchronous signal safety problem. Unfortunately, it turns out that there is no good way to enumerate all valid CPUID leaves and sub-leaves without hard-coding which sub-leaves are valid for each leaf (if any).
Same as (2), except instead of calling arch_prctl() or executing CPUID within a signal handler, spawn a separate process in the same memory space using clone() with the CLONE_VM flag, which can execute the CPUID instruction and write back the overridden result. Unfortunately, directly calling the clone() system call (see below for why direct system calls are necessary) is fairly tricky, because the child stack must be manually setup, and some intricate assembly is necessary to ensure that both the parent and child return correctly from the system call.

Rather than spend time hard-coding CPUID sub-leaves or re-implementing a wrapper for the clone() system call, I initially implemented method (2) in libcpuidoverride.

Dynamic Loader Interposition

However, it turned out this approach didn’t seem to work – despite successfully injecting my shared library using LD_PRELOAD, no faulting occurred when glibc was loaded. After doing some digging, it turned out that as of glibc 2.26+, hardware capability detection was moved out of libc and into the dynamic loader itself, for performance reasons. Rather than re-detecting hardware capabilities when resolving each indirect function symbol, or loading of each dynamic library, the dynamic loader itself performs hardware capability detection, and makes the results available when subsequently loading dynamic libraries.

As a result, designs (2) through (4) no longer function correctly, because the dynamic loader has already performed hardware capability detection by the time the library is preloaded. Instead, I ended up writing a shim dynamic loader that first enables CPUID faulting and registers a signal handler, before calling the actual dynamic loader. This approach is valid because the actual dynamic loader must relocate itself, since it may not be loaded at a predictable address (e.g. if Address Space Layout Randomization (ASLR) is enabled), and is the final method used by libcpuidoverride. The target executable only needs to be modified to use this library as the dynamic loader.

Although my shim dynamic loader does not need to actual load any shared libraries or relocate any symbols, it still does need to implement the kernel’s interface for ELF binaries. Additionally, since this is a dynamic loader, it cannot dynamically link against glibc, and must directly make system calls. Rather than reproduce the actual code, an overview of this process is provided below.

elf_stack_parse() parses the stack set up by the kernel’s ELF loader, which stores the variables argc, argv, envp, and auxv on the program stack.
elf_auxv_parse() parses the auxiliary vector to obtain the system page size, and some random values that are used to compute the base load address of the actual dynamic loader.
parse_interpreter() maps the actual dynamic loader into memory and parses both the ELF header and ELF program headers.
elf_phdr_load() is called to load all segments marked PT_LOAD into memory with correct permissions at the specified virtual offset relative to the base load address. Additional anonymous pages are mapped into memory if the memory size of a segment exceeds the file size; e.g. for the .bss section. Virtual addresses are aligned against the page size, and rounded accordingly. Additionally, the PT_GNU_STACK header is parsed to determine if the stack should be executable; e.g. for nested functions.
The entry point address of the dynamic loader is computed and returned, then executed with the original stack pointer set up by the kernel.

Processor CPUID Faulting and Dynamic Loader Interposing May 1, 2019