Compilers and constant-time code

Compilers normally do scary things to security-sensitive code. Consider for example the problem of wiping keys from memory once we’re done working with them. This is hard to do in C. The naive way to zeroize the key buffer is to call memset(). However, since the overwriting value is not used anymore, any performance optimization from the compiler will very likely remove the call to memset() intended to wipe keys; effectively working against our intention.¹

Surprisingly enough, sometimes the compiler adds a bit of security. The following is one amusing example of how optimization flags of LLVM/clang manage to turn variable-time code into (seemingly) constant-time code. (This is rather an anecdote that I found while looking into timing variability, and certainly not representative of average compiler behavior.)

A common variable-time function is the following AES xtime() implementation:²

  uint8_t xtime(uint8_t x) {
      return (x & 0x80) ? ((x << 1) ^ 0x1b) : (x<<1);
  }

This function normally appears in byte-oriented AES implementations; it just multiplies by 2 (or by the polynomial ‘x’, hence the name) in Rijndael’s finite field. As the code reads, it is “almost always” just a left shift by one, except if the input has its topmost bit set. In that case, we reduce the output by xoring with 0x1b.

This is a wonderful schoolbook timing leakage source. Let us inspect the code clang -O0 emits for x86_64:

This output is easy to follow. The relevant bit is that the left branch does one extra thing: it xors eax with Rijndael’s magic 0x1b constant.

Let us run this implementation in dudect and plot the corresponding cumulative distribution functions for cycle count: compareO0

The blue line is when the input to xtime is 0, the orange line is when the input to xtime is uniform random. The blue cdf is telling you that most of the executions took around 40 or 60 cycles. The orange cdf says that most of the executions took slightly less than 100 cycles. The distributions are completely different. They will fail the Kolmogorov-Smirnov test at any reasonable significance level, and dudect complains loudly.

Let us change the optimization flag. We compile now with -O1. We get a straight-line executable:

All the juice is now in the cmovns instruction: conditional move if not sign of DIL register. This DIL register only captures the lower 8 bits of the EDI register that holds the input, so this is effectively the (x & 0x80) condition from the C code. Surprisingly, this results in constant-time code! Let us plot the same two distributions when compiled with clang -O1:³

compareO2

(Only one line is visible, the blue line is behind the orange line.) The execution times are reduced (there is much less code to be executed) and most notably the distribution is independent of the input value! This is not something new, but I was amused to be triggered by an optimizing compiler itself.

This is an example of same source code with wildly different timing behavior. It follows that it’s not enough just to stare at the source code to determine if the code is constant-time or not. It’s probably not enough even to look at the assembly output, unless you know all the microarchitectural details of the processor, which are often kept unpublished (for example, multipliers with early abort in case some chunks of the operands are 0s). I prefer to also measure execution times, and sometimes you get surprises like this one :-)

Footnotes:

There are ways to wipe keys in C: see OpenBSD’s explicit_bzero(), C11 memset_s, Colin Percival take on the issue, and some more hints. ↩
A proper constant-time implementation of the AES would use only arithmetic (or vector arithmetic) operations, probably in a bitsliced way to amortize costs. Emilia Kasper did that at CHES 2009, breaking the predicted theoretical peak performance of 10 cycles/byte. ↩
You can play yourself with different compiler options in the compiler explorer. ↩