Are You Smarter Than the Compiler?

So, are you? It’s extremely unlikely, but you probably knew that. I recently questioned whether I could refactor away a branch in some arithmetic code, and I had a fun time rediscovering first-hand that I am not, in fact, smarter.

The Scenario

On my current project, I was working on some code that would take an integer (from 0-23) representing an hour of the day and format it for display to the user. This being the United States, we wanted to display it in 12-hour time. I’m sure you’re already imagining the implementation of a function such that f(23)=11, f(1)=1, f(0)=12, and so on.

The easiest implementation you’re thinking of probably looks like this:


int normalize(int h) {
	h = h % 12;
	if (h == 0) h = 12;
	return h;
}

…or, at least it did for me.

As a software consultant, my sharpest skills are in the expression of logic. The value I create is largely derived from my ability to analyze my clients’ business domain and synthesize from it logical machinery that is useful, succinctly expressed, maintainable, and—not least of all—correct. I don’t often run into situations where performance is a concern.

Despite this, I know (only) a few things about how computers work. I can tell you that modern processors implement a pipelining architecture, where they achieve greater performance by attempting to execute multiple machine instructions at the same time, or even in different orders. I also know that branches, such as that if (h == 0) above, pose an interesting challenge to the processor.

What if the processor is trying to dispatch the conditional branch (if (h == 0)), but it has not yet finished computing the new value of h from the previous step? The CPU will guess. Perhaps it’ll decide that h is probably going to be zero, so it’ll continue along and set h to 12. Disastrously, if the processor comes to realize that h % 12 did not yield zero, it will effectively have to throw away the work it did, jump back to the if statement, and start over.

I also know this code will be running on a phone—potentially very frequently—as a user quickly scrolls through items in a table view. Furthermore, my mathematical intuition tells me I should be able to express this function using another modulus instead of a branch, which would keep the processor’s pipeline chugging along happily. I wondered aloud whether it was worth my time to bother refactoring.

The Experiment

This is when Job, my partner on the project, chimed in. As I expected, he said there’s no way it could be worth our time, as modern compilers should be able to see right through such simple code. Regardless, we were both feeling exceptionally curious about what the compiler would actually do, and we couldn’t just leave it at that. So Job introduced me to gcc.godbolt.org, and we began to look at the assembly output for two implementations of this function. In particular, this is the code that would be generated by LLVM Clang 3.8 with the -Os flag.


int branchingNormalize(int h) {
  h = h % 12;
  if (h == 0) h = 12;
  return h;
}

int modulusNormalize(int h) {
  return ((h + 11) % 12) + 1;
}
branchingNormalize(int):
        movsxd  rcx, edi
        imul    rax, rcx, 715827883
        mov     rdx, rax
        shr     rdx, 63
        shr     rax, 33
        add     eax, edx
        shl     eax, 2
        lea     eax, [rax + 2*rax]
        sub     ecx, eax
        mov     eax, 12
        cmovne  eax, ecx
        ret
modulusNormalize(int):
        add     edi, 11
        movsxd  rax, edi
        imul    rcx, rax, 715827883
        mov     rdx, rcx
        shr     rdx, 63
        sar     rcx, 33
        add     ecx, edx
        shl     ecx, 2
        lea     ecx, [rcx + 2*rcx]
        sub     eax, ecx
        inc     eax
        ret

What’s Going On Here?

As Job predicted, these implementations look vastly more similar than dissimilar. They even share the same number of instructions.

Naturally, I first wanted to know where the branch was—if there was one at all. Indeed, there isn’t. Instead, the second-to-last instruction of branchingNormalize is cmovne, which is a conditional move. This might seem to be a petty distinction on the surface, but it’s actually quite significant. This conditional move will end up potentially modifying the eax register, but it would never result in changing which instructions need to be executed next. The processor is quite capable of handling data dependencies between instructions as it parallelizes and reorders them.

Looking beyond that, I was fairly confused about the rest of the instructions; there’s no integer division at all.

From grade school, you may remember how much more difficult it was to divide two numbers than it was to multiply them. You may also remember, from your college computer architecture class, that processors find it similarly more difficult to compute. Personally, I had long buried both of those memories. After all, that’s why I have a computer, and it never seemed to make a fuss about it.

That’s totally okay, as it turns out—Clang has my back. Rather than ask the processor to perform division in pursuit of our remainder, it’s figured out that it can achieve the same thing by utilizing a combination of multiplication, bit shifting, addition, and subtraction. The numbers 715827883, 63, and 33 were picked by the compiler in order to leverage integer overflow, based on the number of bits used to store the integers. Essentially, really cool magic.

Conclusion

This was an extremely fun diversion, and it’s one of the many examples of why it’s great to work with such smart colleagues. Without Job’s enthusiasm, it’s doubtful I would have actually done this analysis.

At first thought, it made me slightly uncomfortable to see how well the compiler can optimize. If the compiler doesn’t care, should I? What’s the point in paying attention to those small details in my code unless I actually observe a problem?

Upon slightly deeper reflection, however, I find that fear to be silly. I create much more value for the world, and have more fun, when I’m thinking and working at a higher level. I want to be as close as possible to pure logic, not micro-managing (pun intended?) hardware.

Conversation
  • joekinley says:

    The “branchingNormalize” function has a typo in the return. You return n not h.

  • Nathan Nyme says:

    Why not :

    int branchingNormalize(int h) {
    if (h > 12) return h – 12;
    return h;
    }

    • Chris Farber Chris Farber says:

      We were working with some time ranges that could spill over into the next day, therefore h may sometimes be > 24. Regardless, that’s not the point!

      • Nathan Nyme says:

        Ok, understood :)
        Thought the point was “how to obtain the most efficient asm” and my answer was : “keep simple, litteral” but you are right !

  • Gautham Ganapathy says:

    You could try this:

    int branchingNormalize(int h) {
    if (h >= 12)
    h -= 12;
    if (h == 0)
    h = 12;
    return h;
    }

    Results in the foll code for the same compiler and options:
    branchingNormalize(int): # @branchingNormalize(int)
    leal -12(%rdi), %ecx
    cmpl $11, %edi
    cmovlel %edi, %ecx
    testl %ecx, %ecx
    movl $12, %eax
    cmovnel %ecx, %eax
    retq

  • The Dude says:

    ((h – 1) % 12) + 1

    • DDT says:

      static int time_us[] = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 };
      #define f(a) time_us[(a)]

      For a function you make a macro:
      int time =f(14);

      time is 2 now

  • Shon says:

    I spent the first few years of my career in the mid ’80s programming in assembly language and at that time it was very easy to write faster assembly than the compiler would generate, at least for traditional CISC processors. As RISC became a big player and CISC started incorporating more RISC concepts like parallel pipelines and out of order execution, it became increasingly difficult to beat the compiler and now it is now I would say almost impossible without an incredible amount of work.

  • Dan McLeran says:

    Profiling would have told you to not waste your time optimizing here.

  • Dan Neely says:

    I’m curious if this is a widespread micro-optimization or currently a clang specific trick? Has anyone looked at what GCC or MSVC do?

  • TS says:

    In most cases time is always non-negative, thus “unsigned int” might be more appropriate.
    Might result in different (simpler?) compiler output.

    Note that I wouldn´t call that compiler behaviour “smart” (it still can only apply transformations it knows and has no idea about the actual purpose of the code it compiles and the nature of the data processed) but rather “efficient”: No matter if the code is compiled for a different/newer platform, transformed to SIMD instructions, parallelized or inlined – the compiler instantly adapts at almost zero cost, thus saving months or years of doing it manually or not at all.

  • TS says:

    Also, knowing that the compiler will optimize the algebraic and control flow in a machine-friendly way anyway allows the programmer to focus on a different matter: To optimize the code for readability and clarity.
    Speed optimization is still a primarily human task, though with todays compilers and machines on the algorithmic and data structure level rather than the machine code level.

  • Comments are closed.