Recently, I’ve been experimenting with Intel’s Transactional Synchronization Extensions (TSX), an x86 instruction set extension that implements Restricted Transactional Memory (RTM), which is a hardware implementation of transactional memory. But, executing various instructions can cause a hardware transaction to spuriously fail, so I’d like to be able to avoid calling them during runtime at all. It turned out achieving this was much more difficult than expected; for example, the GNU C Library (glibc) uses runtime hardware capability detection to execute the most optimal function supported by hardware. In the end, I ended up using hardware support for CPUID faulting to dynamically manipulate the hardware capability flags, together with a shim dynamic loader to ensure that my override is installed as early as possible. The implementation of libcpuidoverride is open-source and available online.
Thanks to Sol Boucher for help with the ELF format, dynamic linker, and the kernel system call interface!
Background: Transactional Memory
For those unfamiliar, transactional memory is an alternative concurrency control mechanism that simplifies concurrent programming by providing developers with first-class transactional regions that guarantee atomicity, consistency, and isolation. In comparison with existing techniques, developers do not need to explicitly reason about placement, duration, or granularity of low-level mutual exclusion primitives (such as locks), which are error-prone and can result in well-known concurrency bugs such as resource starvation and race conditions.
Unfortunately, a number of operations can cause a hardware transaction to fail, due to various microarchitectural design choices. For example, the underlying hardware implementation of transactional memory on Intel’s Haswell processor buffers transactional operations within the L1 cache, and modifies the cache coherency protocol to detect when contention occurs. If a transaction performs enough operations to exceed the L1 cache size, then a capacity abort may occur. Intel provides a reference document that enumerates the various operations that may cause a transaction to abort, one of which is
This instruction is called by many of glibc’s vectorized functions (e.g.
__memset_avx2_unaligned_erms()) that utilize Intel’s Advanced Vector Extensions (AVX), in order to zero the upper half of the 256-bit YMM AVX2 registers. This is necessary because the lower 128-bits of the YMM registers alias with the 128-bit XMM registers, which are part of Intel’s older Streaming SIMD Extensions (SSE). By explicitly zeroing these registers, the processor avoids explicitly saving and restoring the upper half of these registers when transitioning between SSE and AVX instructions, which improves performance.
But, since this instruction is causing transaction aborts, I’d like to prevent glibc or any other library from utilizing the AVX extensions.
Hardware Capability Detection
Hardware capabilities can be detected using the
CPUID instruction, which returns various information about the processor and system configuration. An input value passed in the
EAX register, commonly referred to as a leaf, is used to determine what information is requested. Certain leaves may support requesting more fine-grained information by passing another input value in the
ECX register, commonly referred to as a sub-leaf.
The Linux kernel provides information about certain architecture-specific hardware capabilities to ELF programs within the auxiliary vector, keyed under the types
AT_HWCAP2. Unfortunately, the corresponding bitmask values are somewhat poorly-documented; within the Linux kernel source code, on non-x86 architectures they are typically defined within
HWCAP_THUMB on ARM), but on x86, they are defined within
asm/cpufeatures.h (only the feature flags for leaf
GNU C Library (glibc)
A dynamically-loaded library like glibc can implement multiple versions of any given function (e.g.
memset()) using a compiler feature known as function multi-versioning (FMV), which is implemented by the GNU C Compiler (GCC). Behind the scenes, the compiler inserts a layer of indirection: under lazy binding, the first call to a multi-versioned function actually executes a special dispatch function that performs hardware capability detection to identify the most optimal version, which is then executed directly on all subsequent calls. For the actual details of this process, refer to the Global Offset Table (GOT) and Procedure Linkage Table (PLT) sections within the System-V Application Binary Interface (ABI) AMD64 Architecture Processor Supplement. For a comprehensive discussion of writing a shared library, refer to How to Write Shared Libraries.
Programs that utilize the Executable and Linkable Format (ELF) mark these functions with a special
STT_GNU_IFUNC indirect function symbol. This informs the dynamic loader to resolve the address of these function symbols dynamically, which can occur either eagerly at program startup (compile with
-Wl,-z,now or environment variable
LD_BIND_NOW) or lazily when first called (default compile, or explicitly with
Hardware capability features from the kernel are inherited by the dynamic loader, and can be overridden using the
LD_HWCAP_MASK environment variable. This environment variable is deprecated with glibc 2.26+ in favor of tunables, and should instead be specified using
GLIBC_TUNABLES=glibc.cpu.hwcap_mask=. Aside from those inherited from
AT_HWCAP, tunables also allow other glibc features to manually overridden; for example,
glibc.cpu.x86_non_temporal_threshold for controlling the non-temporal store execution threshold.
Unfortunately, these existing mechanisms for controlling hardware feature detection are poorly-documented, library specific (e.g.
LD_HWCAP_MASK or glibc tunables), and limited in scope (e.g.
AT_HWCAP). Indeed, other developers have encountered similar problems on StackOverflow, and ended up resorting to ugly workarounds like binary patching or even rebuilding glibc. A better approach would be a more generic method for overriding feature detection, perhaps by instrumenting or interrupting execution of the CPUID instruction.
It turns out that recent Intel processors post-Ivy Bridge do include such a feature, known as VT FlexMigration, which triggers a General Protection Fault (GPF) when a
CPUID instruction is executed. This allows the results of the
CPUID instruction to be manipulated, which essentially overrides capability detection.
Thanks to the folks working on trace portability for the rr reversible debugger, this feature was exposed to user-mode programs under Linux kernel 4.12+, through the
ARCH_GET_CPUID subfunctions of the
arch_prctl() system call. When enabled by calling
arch_prctl(ARCH_SET_CPUID, 0) (not a typo), executing the
CPUID instruction will generate a
SIGSEGV signal that can be caught using a custom handler. This behavior is inherited through
fork(), but not
An initial implementation of libcpuoverride utilized this approach to interrupt execution of the
CPUID instruction. Three different designs came to mind for actually fetching the
CPUID value from within the signal handler:
ptrace()to control execution of the target process. This dynamic approach is elegant, because
ptrace()can also be used to force the target process to enable CPUID faulting using
PTRACE_SETREGS. When an appropriate
SIGSEGVsignal arrives, the external tracer can catch it, suppress the signal, and as a separate process, execute the
CPUIDinstruction and write back the overridden result into the tracee. Unfortunately, one major drawback is that most existing debugging and instrumentation tools (such as rr) already use
ptrace(), and a target process can only have at most one tracer.
-Wl,-z,initfirstto inject a dynamic library that is loaded first.
__attribute__((constructor))can be used to register a constructor function that is executed when the library is loaded, which can enable CPUID faulting and register a signal handler. When a
SIGSEGVsignal is received, disable CPUID faulting, execute the
CPUIDinstruction, write back the overridden result, and re-enable CPUID faulting. A drawback of this approach is that calling
arch_prctl()within a signal handler is not asynchronous signal safe. Note that although
SIGSEGVis a synchronous signal, it may be delivered asynchronously.
Same as (2), except instead of calling
CPUIDwithin a signal handler, pre-cache all valid CPUID leaves and sub-leaves within the constructor function, and simply reuse their values within the signal handler, which solves the asynchronous signal safety problem. Unfortunately, it turns out that there is no good way to enumerate all valid CPUID leaves and sub-leaves without hard-coding which sub-leaves are valid for each leaf (if any).
Same as (2), except instead of calling
CPUIDwithin a signal handler, spawn a separate process in the same memory space using
CLONE_VMflag, which can execute the
CPUIDinstruction and write back the overridden result. Unfortunately, directly calling the
clone()system call (see below for why direct system calls are necessary) is fairly tricky, because the child stack must be manually setup, and some intricate assembly is necessary to ensure that both the parent and child return correctly from the system call.
Rather than spend time hard-coding CPUID sub-leaves or re-implementing a wrapper for the
clone() system call, I initially implemented method (2) in libcpuidoverride.
Dynamic Loader Interposition
However, it turned out this approach didn’t seem to work – despite successfully injecting my shared library using
LD_PRELOAD, no faulting occurred when glibc was loaded. After doing some digging, it turned out that as of glibc 2.26+, hardware capability detection was moved out of libc and into the dynamic loader itself, for performance reasons. Rather than re-detecting hardware capabilities when resolving each indirect function symbol, or loading of each dynamic library, the dynamic loader itself performs hardware capability detection, and makes the results available when subsequently loading dynamic libraries.
As a result, designs (2) through (4) no longer function correctly, because the dynamic loader has already performed hardware capability detection by the time the library is preloaded. Instead, I ended up writing a shim dynamic loader that first enables CPUID faulting and registers a signal handler, before calling the actual dynamic loader. This approach is valid because the actual dynamic loader must relocate itself, since it may not be loaded at a predictable address (e.g. if Address Space Layout Randomization (ASLR) is enabled), and is the final method used by libcpuidoverride. The target executable only needs to be modified to use this library as the dynamic loader.
Although my shim dynamic loader does not need to actual load any shared libraries or relocate any symbols, it still does need to implement the kernel’s interface for ELF binaries. Additionally, since this is a dynamic loader, it cannot dynamically link against glibc, and must directly make system calls. Rather than reproduce the actual code, an overview of this process is provided below.
elf_stack_parse()parses the stack set up by the kernel’s ELF loader, which stores the variables
auxvon the program stack.
parse_interpreter()maps the actual dynamic loader into memory and parses both the ELF header and ELF program headers.
elf_phdr_load()is called to load all segments marked
PT_LOADinto memory with correct permissions at the specified virtual offset relative to the base load address. Additional anonymous pages are mapped into memory if the memory size of a segment exceeds the file size; e.g. for the .bss section. Virtual addresses are aligned against the page size, and rounded accordingly. Additionally, the
PT_GNU_STACKheader is parsed to determine if the stack should be executable; e.g. for nested functions.
The entry point address of the dynamic loader is computed and returned, then executed with the original stack pointer set up by the kernel.