Optimizing Tail Call Recursion

Article summary

Recursion Limitations
Understanding Tail Calls
Tail Recursion Deep Dive
Details

I have been familiar with the concept of tail recursion for some time, though I had never explored it in depth. In this post I’ll look at an example of what happens when a tail recursive function is optimized on x86 targeting OS X.

Recursion Limitations

Recursion is common in software, especially in functional programming languages. Unlike iteration (e.g., while or for loops), recursion has the unfortunate side effect of growing the execution stack (a finite region of memory that a program uses during its execution). This can cause problems when many repeated recursive calls are made, as the stack may run out of space. It’s also slower than iteration due to the extra stack frame creation and tear down.

Luckily, certain types of recursive calls can be re-written to execute in an iterative manner where they use constant stack space. Whether or not this optimization happens depends on the structure of the recursive call, the language, the compiler, and compiler options used.

Understanding Tail Calls

Let’s start by talking about tail calls. According to Wikipedia, a tail call is a function call that is made as the final action of a procedure. If the function is a recursive call, it is said to be tail recursive. Let’s look at a few examples.

Example 1

// Example 1
int foo(int x) {
  return bar(y + 1);  // this is a tail call
}

Example 2

int foo_recursive(int x) {
  if (x > 0)
    return foo_recursive(x - 1);  // this is a tail recursive call
}

Example 3

int foo_not_tail_call(int x){
  if (x > 0)
    return 1 + foo_not_tail_call(x - 1);  // this is not a tail call
}

In Example 1, the function call to bar is a tail call. In Example 2, foo_recursive is a recursive tail call, making it an example of tail recursion. In Example 3, foo_not_tail_call is not a tail call because there is an addition operation (+ 1) that happens after the call returns. I’ve found that some compilers can optimize the third scenario, probably by re-writing it to look more like Example 2, though for the sake of this post let’s ignore the Example 3 scenario. General tail call optimization is a complex subject. To keep things simple we’ll focus on tail recursion in this post (e.g., Example 2 from above).

Tail Recursion Deep Dive

With that general idea in mind, let’s look at an example and see how it gets optimized. Below is a factorial function that calls a recursive factorial_iterator function. The factorial_iterator function calls itself with a decreasing n value and an accumulator. This accumulator model is a common technique used to make a function tail recursive (and eligible for optimization).

int factorial(int n) {
    return factorial_iterator(n, 1);
}

int factorial_iterator(int n, int accumulator) {
    if(n <= 1){
        return accumulator;
    } else {
        accumulator = accumulator * n;
        return factorial_iterator(n - 1, accumulator);
    }
}

I compiled this example using clang/LLVM on OS X Mavericks using no compiler optimizations, and have included a disassembly of the code below. I've annotated what each line is doing with some comments. For a more detailed description of each x86 instruction, refer to Volume 2 of the Intel Software Developers Manual. It's important to note that depending on the compiler you're using, the platform you're targeting, its calling conventions, etc., your disassembly might look quite a bit different than mine.

0000000100000f20 <_factorial>:                // Input: edi contains n
 100000f20:   push   rbp                      // save stack base ptr
 100000f21:   mov    rbp,rsp                  // setup frame base ptr
 100000f24:   sub    rsp,0x10                 // move stack ptr to end of frame
 100000f28:   mov    esi,0x1                  // store 1 in esi
 100000f2d:   mov    DWORD PTR [rbp-0x4],edi  // move edi (accumulator variable) to the stack
 100000f30:   mov    edi,DWORD PTR [rbp-0x4] 
 100000f33:   call   100000ed0                // save instruction ptr, jmp to factorial_iterator
 100000f38:   add    rsp,0x10                 // collapse stack frame
 100000f3c:   pop    rbp                      // restore stack base ptr
 100000f3d:   ret                             // return, result is in eax
 
0000000100000ed0 <_factorial_iterator>:       // Input: edi (n), esi (accumulator)
 100000ed0:   push   rbp                      // save previous stack base pointer
 100000ed1:   mov    rbp,rsp                  // setup frame base ptr
 100000ed4:   sub    rsp,0x10                 // adjust sack ptr to end of stack frame
 100000ed8:   mov    DWORD PTR [rbp-0x8],edi  // move edi (holds input param n) to the stack
 100000edb:   mov    DWORD PTR [rbp-0xc],esi  // move esi (holds input param accumulator) to stack
 100000ede:   cmp    DWORD PTR [rbp-0x8],0x1  // compare input param n with 1 (to set flags)
 100000ee5:   jg     100000ef6                // if n > 1 jump to address 0x100000ef6
 100000eeb:   mov    eax,DWORD PTR [rbp-0xc]  // else move accumulator to eax
 100000eee:   mov    DWORD PTR [rbp-0x4],eax  
 100000ef1:   jmp    100000f15                // jmp to tear down
 100000ef6:   mov    eax,DWORD PTR [rbp-0xc]  // move accumulator into eax
 100000ef9:   imul   eax,DWORD PTR [rbp-0x8]  // accumulator = accumulator * n, result in eax
 100000efd:   mov    DWORD PTR [rbp-0xc],eax  // put result on stack
 100000f00:   mov    eax,DWORD PTR [rbp-0x8]
 100000f03:   sub    eax,0x1                  // n = n -1
 100000f08:   mov    esi,DWORD PTR [rbp-0xc]  // move current accumulator value into esi
 100000f0b:   mov    edi,eax                  // move current n value into edi
 100000f0d:   call   100000ed0                // save instruction ptr on stack and jmp
 100000f12:   mov    DWORD PTR [rbp-0x4],eax  // return from recursive call
 100000f15:   mov    eax,DWORD PTR [rbp-0x4]  // move the result to eax
 100000f18:   add    rsp,0x10                 // collapse stack frame
 100000f1c:   pop    rbp                      // restore stack base ptr
 100000f1d:   ret                             // return, result is in eax

There are a few nuances that are important to understand in the above disassembly. First, all input parameters and return values are passed in registers (not on the stack). The input parameter n is passed into _factorial via the edi register. Similarly, n is also passed into _factorial_iterator via edi, and accumulator is passed in via esi. All returns values are returned in the eax register.

The _factorial_iterator function makes recursive calls to itself. This can be observed via the call 1000000ed0 line. Each call creates a new stack frame (also called activation record) for the newly-called function to use for its execution.

Non-optimized example

To help visualize how the stack grows and what happens to the stack and register state when the above code is run, I've created an animation that computes 4 factorial (e.g., n = 4). The animation steps through _factorial_iterator and displays the stack and register state after each instruction. Recall that the input n, which is 4 for this example, is passed in via the edi register.

I tried to make the above animation as accurate as possible while still capturing the fundamentals of how the register state and stack changes as the code executes. The most important take away is that the stack grows for each recursive call. The animation computes 4!, and results in four stack frames being produced. You can imagine how larger inputs would result in the stack consuming a lot of memory. For sufficiently large inputs, the stack can easily run out of space, resulting in a stack overflow.

Optimized example

Now that we've seen what happens under normal (non-optimized) operating conditions, let's look at an optimized example. Enabling tail call optimization for clang can be achieved via the -O1 optimization level flag. The process is similar for gcc. The optimized disassembly looks like this.

0000000100000f40 <_factorial>:          // Input: edi contains n
 100000f40:   push   rbp                // save stack base ptr
 100000f41:   mov    rbp,rsp            // setup frame base ptr
 100000f44:   mov    esi,0x1            // move 1 to esi (initialize accumulator)
 100000f49:   pop    rbp                // restore stack base ptr
 100000f4a:   jmp    100000f10          // jmp directly to factorial_iterator

0000000100000f10 <_factorial_iterator>: // Input: edi contains n, esi contains accumulator
 100000f10:   push   rbp                // save stack base ptr
 100000f11:   mov    rbp,rsp            // setup frame base ptr
 100000f14:   cmp    edi,0x2            // compare n to 2
 100000f17:   jl     100000f2d          // if n < 2 jmp to 0x10000f2d
 100000f19:   nop
 100000f20:   imul   esi,edi            // accumulator = accumulator * n
 100000f23:   cmp    edi,0x2            // compare n and 2
 100000f26:   lea    eax,[rdi-0x1]      // eax = n - 1
 100000f29:   mov    edi,eax            // n = eax
 100000f2b:   jg     100000f20          // if n > 2 jmp to 0x10000f20 (flags still set)
 100000f2d:   mov    eax,esi            // move n to eax
 100000f2f:   pop    rbp                // restore base ptr
 100000f30:   ret                       // return, result is in eax

This time around the assembly code looks a lot different. Notice that we don't even call factorial_iterator from factorial. Instead, we just jmp directly to it. Similarly, there are no call instructions in factorial_iterator. In fact, there is no stack writing/reading at all!. Instead, all computations are performed via register operations that amount to something that looks like a while loop. For completeness, here is the much simplified optimized animation.

As we would expect, there is no stack activity. The total number of instructions between the original code example and the optimized example is almost reduced by a factor of four.

Details

For those interested, all the examples in this post were compiled with clang/LLVM on OSX Mavericks. The LLVM version is:

Apple LLVM version 6.0 (clang-600.0.51) (based on LLVM 3.5svn)
Target: x86_64-apple-darwin14.0.0
Thread model: posix

I used gobjdump to disassemble the compiled binaries and display them using Intel assembly syntax.

Article summary

Recursion Limitations

Understanding Tail Calls

Example 1

Example 2

Example 3

Tail Recursion Deep Dive

Non-optimized example

Optimized example

Details

Related Posts

Embracing Mainline Development: Beyond Feature Branches

ChatGPT and the Value of a Computer Science Education

Inspired by Nature: An Introduction to Genetic Algorithms

Keep up with our latest posts.

Tell Us About Your Project