Are You Smarter Than the Compiler?

So, are you? It’s extremely unlikely, but you probably knew that. I recently questioned whether I could refactor away a branch in some arithmetic code, and I had a fun time rediscovering first-hand that I am not, in fact, smarter.

The Scenario

On my current project, I was working on some code that would take an integer (from 0-23) representing an hour of the day and format it for display to the user. This being the United States, we wanted to display it in 12-hour time. I’m sure you’re already imagining the implementation of a function such that f(23)=11, f(1)=1, f(0)=12, and so on.

The easiest implementation you’re thinking of probably looks like this:

int normalize(int h) {
	h = h % 12;
	if (h == 0) h = 12;
	return h;

…or, at least it did for me.

As a software consultant, my sharpest skills are in the expression of logic. The value I create is largely derived from my ability to analyze my clients’ business domain and synthesize from it logical machinery that is useful, succinctly expressed, maintainable, and—not least of all—correct. I don’t often run into situations where performance is a concern.

Despite this, I know (only) a few things about how computers work. I can tell you that modern processors implement a pipelining architecture, where they achieve greater performance by attempting to execute multiple machine instructions at the same time, or even in different orders. I also know that branches, such as that if (h == 0) above, pose an interesting challenge to the processor.

What if the processor is trying to dispatch the conditional branch (if (h == 0)), but it has not yet finished computing the new value of h from the previous step? The CPU will guess. Perhaps it’ll decide that h is probably going to be zero, so it’ll continue along and set h to 12. Disastrously, if the processor comes to realize that h % 12 did not yield zero, it will effectively have to throw away the work it did, jump back to the if statement, and start over.

I also know this code will be running on a phone—potentially very frequently—as a user quickly scrolls through items in a table view. Furthermore, my mathematical intuition tells me I should be able to express this function using another modulus instead of a branch, which would keep the processor’s pipeline chugging along happily. I wondered aloud whether it was worth my time to bother refactoring.

The Experiment

This is when Job, my partner on the project, chimed in. As I expected, he said there’s no way it could be worth our time, as modern compilers should be able to see right through such simple code. Regardless, we were both feeling exceptionally curious about what the compiler would actually do, and we couldn’t just leave it at that. So Job introduced me to gcc.godbolt.org, and we began to look at the assembly output for two implementations of this function. In particular, this is the code that would be generated by LLVM Clang 3.8 with the -Os flag.

int branchingNormalize(int h) {
  h = h % 12;
  if (h == 0) h = 12;
  return h;

int modulusNormalize(int h) {
  return ((h + 11) % 12) + 1;
        movsxd  rcx, edi
        imul    rax, rcx, 715827883
        mov     rdx, rax
        shr     rdx, 63
        shr     rax, 33
        add     eax, edx
        shl     eax, 2
        lea     eax, [rax + 2*rax]
        sub     ecx, eax
        mov     eax, 12
        cmovne  eax, ecx
        add     edi, 11
        movsxd  rax, edi
        imul    rcx, rax, 715827883
        mov     rdx, rcx
        shr     rdx, 63
        sar     rcx, 33
        add     ecx, edx
        shl     ecx, 2
        lea     ecx, [rcx + 2*rcx]
        sub     eax, ecx
        inc     eax

What’s Going On Here?

As Job predicted, these implementations look vastly more similar than dissimilar. They even share the same number of instructions.

Naturally, I first wanted to know where the branch was—if there was one at all. Indeed, there isn’t. Instead, the second-to-last instruction of branchingNormalize is cmovne, which is a conditional move. This might seem to be a petty distinction on the surface, but it’s actually quite significant. This conditional move will end up potentially modifying the eax register, but it would never result in changing which instructions need to be executed next. The processor is quite capable of handling data dependencies between instructions as it parallelizes and reorders them.

Looking beyond that, I was fairly confused about the rest of the instructions; there’s no integer division at all.

From grade school, you may remember how much more difficult it was to divide two numbers than it was to multiply them. You may also remember, from your college computer architecture class, that processors find it similarly more difficult to compute. Personally, I had long buried both of those memories. After all, that’s why I have a computer, and it never seemed to make a fuss about it.

That’s totally okay, as it turns out—Clang has my back. Rather than ask the processor to perform division in pursuit of our remainder, it’s figured out that it can achieve the same thing by utilizing a combination of multiplication, bit shifting, addition, and subtraction. The numbers 715827883, 63, and 33 were picked by the compiler in order to leverage integer overflow, based on the number of bits used to store the integers. Essentially, really cool magic.


This was an extremely fun diversion, and it’s one of the many examples of why it’s great to work with such smart colleagues. Without Job’s enthusiasm, it’s doubtful I would have actually done this analysis.

At first thought, it made me slightly uncomfortable to see how well the compiler can optimize. If the compiler doesn’t care, should I? What’s the point in paying attention to those small details in my code unless I actually observe a problem?

Upon slightly deeper reflection, however, I find that fear to be silly. I create much more value for the world, and have more fun, when I’m thinking and working at a higher level. I want to be as close as possible to pure logic, not micro-managing (pun intended?) hardware.