Benchmarking memcmp() for timing attacks

We were interested in timing attacks against memcmp() for various reasons (CVE-2017-2624) and decided to explore the topic a bit further.

Timing attacks against memcmp() have been successful in the past against light bulbs or even the Xbox 360.

The basic idea behind these kinds of attacks is, that memcmp() takes longer if the bytes of the memory areas match, since an additional byte will be compared. If this difference could be measured, an attacker could brute force secrets much more efficient by testing all values for the first byte, taking the one which took the longest to compare and then go on with testing all values for the second byte. The difference in timing is pretty small since only a few instructions differ in contrast to other widely abused timing attacks, like the timing attack against OpenSSH which allowed to enumerate the users. In the OpenSSH case, the password send to the daemon was only verified against the locally stored hash, if the user exists, if the user did not exists, there was no hash to compare against and the login attempt was rejected instantly. In order to exploit this issue efficiently a very long password used, which takes a long time to hash. This increased the time between rejection of a existing user and a non existing one enough to be easily measurable over the network.

In order to protect against this, especially in cryptographically relevant code, there are several implementations of a timing constant memory comparisons.

How big is the timing difference when comparing “AA” to “BA”? This highly depends on the implementation of memcmp(). Some implementations use a block size compare to speed things up, others use modern instructions like SSE for faster comparisons. Take a look at the glibc implementation which is quite lengthy to achieve the best speed for all use cases. We decided to do a quick benchmark.

This of course highly depends on the system and the load on the machine. In order to reduce side influences we used nice to give the process the highest priority and pinned it to a single CPU using taskset. Furthermore you want to disable powersaving during the test to make sure that the CPU speed is not changing.

Lets take a look of two runs, one comparing two and the other comparing 64 bytes. We run each comparison a 10000000 times and add the timings. To reduce the impact of other tasks even further, we interleave the tests:

for (i = 0; i < count; i++) {
	a = get_ticks();
	comp(cookiegood, cookie, comparesize);
	b = get_ticks();
	totalg += (b - a);

	a = get_ticks();
	comp(cookiebad, cookie, comparesize);
	b = get_ticks();
	totalb += (b - a);
}	

How to measure the timing? We have opted to use rdtsc, which returns the processors time stamp counter, which is usually considered a high-resolution timer with low overhead. This is another reason why we bind the program to a single CPU using taskset, since the counters on various CPUs could differ.

Disclaimer: In our benchmark, not all functions have the exactly the same
behaviour, some just return that there is a difference, others return
the difference between the first non-matching bytes.

What are the results of a run when only two bytes are compared? Note: our source can be found here

Version Equal Unequal Difference Faster Version
glibc 001395356192 001394972053 000000384139 b
naiive 001450446420 001399492504 000050953916 b
blockbased 001477100272 001407210544 000069889728 b
builtin 001393116852 001393273220 000000156368 g
fastmemeq 001517470488 001447338372 000070132116 b
consttime 001803111820 001803071588 000000040232 b
timingsafe 001783485212 001783893148 000000407936 g
crypto 001524643216 001524825004 000000181788 g
openssl 001382549144 001382676276 000000127132 g
sse2 001579218680 001583962136 000004743456 g
         

How does this differ for 64 bytes?

Version Equal Unequal Difference Faster Version
glibc 001459868600 001449436793 000010431807 b
naiive 006893595988 001406745332 005486850656 b
blockbased 002008822664 001523502340 000485320324 b
builtin 001453912568 001443792076 000010120492 b
fastmemeq 002505249908 001411098452 001094151456 b
consttime 015470300784 015470952360 000000651576 g
timingsafe 013019553596 012996265528 000023288068 b
crypto 006730822580 006727159920 000003662660 b
openssl 002673272920 002692763332 000019490412 g
sse2 002487218616 001835804324 000651414292 b
         

What we can easily see, even for the naiive implementation, we do not get a time that is 32 times longer when comparing equal strings. This means there is a high calling/setup overhead.

The second view reveals, that for some implementations, the equal memory areas are compared faster than the differing ones. This is for example the case for the consttime and openssl implementation in the 64 Byte run. The other timing safe implementations flicker. If several runs are performed sometimes the equal and sometimes the unequal cases take longer, depending on context switches and other load happening on the box. Furthermore it is clearly visible, that it is possible to perform timing attacks against memcmp() if the implementation is not fully optimised but not on all systems.

In conclusion, it highly depends on the circumstances if a memcmp() is subject to a timing attack.

Update: We were asked by @PaulM, if we did disable hyperthreading. In our experiments, HT did not make a difference, we assume because we bound the process to a single CPU.