Visualizing Garbage Collection Algorithms

by: Ken Fox

September 3, 2014

Article summary

Cleanup At The End: aka No GC
Reference Counting Collector
Mark-Sweep Collector
Mark-Compact Collector
Copying Collector

Most developers take automatic garbage collection for granted. It’s just another amazing feature provided by our language run-times to make our jobs easier.

But if you try to peek inside a modern garbage collector, it’s very difficult to see how they actually work. There are thousands of implementation details that will confuse you unless you already have a good understanding of what it’s trying to do and how they can go fantastically wrong.

I’ve built a toy with five different garbage collection algorithms. Small animations were created from the run-time behavior. You can find larger animations and the code to create them at github.com/kenfox/gc-viz. It surprised me how much a simple animation reveals about these important algorithms.

Cleanup At The End: aka No GC

The simplest possible way of cleaning up garbage is to just wait until a task is done and dispose of everything at once. This is a surprisingly useful technique, especially if you have a way of breaking up a task into pieces. The Apache web server, for example, creates a small pool of memory per request and throws the entire pool away when the request completes.

The small animation to the right represents a running program. The entire image represents the program’s memory. Memory starts out colored black, which means it isn’t used. Areas that flash bright green or yellow are memory reads or writes. The color decays over time so you can see how memory was used, but also see current activity. If you watch carefully, you can see patterns emerge where the program begins to ignore some memory. Those areas have become garbage — they are not used and not reachable by the program. Everything else that isn’t garbage is “live”.

The program easily fits in memory without needing to worry about cleaning up garbage while the program is running. I’m going to stick with this simple program for the rest of the examples.

Reference Counting Collector

Another simple solution is to keep a count of how many times you are using a resource (an object in memory, in this case) and dispose of it when the count drops to zero. This is the most common technique that developers use when they add garbage collection to an existing system —it’s the only garbage collector that easily integrates with other resource managers and existing code bases. Apple learned this lesson after releasing a mark-sweep collector for Objective-C. It caused so many problems that they retracted the feature and replaced it with an automated reference counting collector that works much better with existing code.

The animation shows the same program as above, but this time it will try to dispose of garbage by keeping a reference count on each object in memory. The red flashes indicate reference counting activity. A very useful property of reference counting is that garbage is detected as soon as possible — you can sometimes see a flash of red immediately followed by the area turning black.

Unfortunately reference counting has a lot of problems. Worst of all, it can’t handle cyclic structures. These are very common — anything with a parent or reverse reference creates a cycle which will leak memory. Reference counting also has very high overhead — you can see in the animation that red flashes are constantly happening even when memory use is not increasing. Arithmetic is fast on a modern CPU, but memory is slow, and the counters are being loaded and saved to memory often. All these counter updates also make it difficult to have read-only or thread-safe data.

Reference counting is an amortized algorithm (the overhead is spread over the run-time of the program), but it’s an accidentally amortized algorithm that can’t guarantee response times. For example, say a program is working with a very large tree structure. The last piece of the program that uses the tree will trigger the disposal of the entire tree, which Murphy will guarantee happens when the user least desires the delay. None of the other algorithms here are amortized either though, so accidentally amortized may be a feature depending on your data. (All of these algorithms do have concurrent or partially-concurrent variations, but those are beyond the capabilities of my toy program to demonstrate.)

Mark-Sweep Collector

Mark-sweep eliminates some of the problems of reference count. It can easily handle cyclic structures and it has lower overhead since it doesn’t need to maintain counts.

It gives up being able to detect garbage immediately. You can see that in the animation where there’s a period of activity without any red flashes, then suddenly a bunch of red flashes indicate where it is marking live objects. After marking is finished, it sweeps over all of memory and disposes of garbage. You can see that in the animation too — several areas turn black all at once instead of more spread out over time in the reference counting approach.

Mark-sweep requires more implementation consistency than reference counting and is more difficult to retrofit into existing systems. The mark phase requires being able to traverse all live data, even data encapsulated within an object. If an object doesn’t provide traversal, it’s probably too risky to attempt to retrofit mark-sweep into the code. The other weakness of mark-sweep is the fact the sweep phase must sweep over all of memory to find garbage. For systems that do not generate much garbage, this is not an issue, but modern functional programming style generates enormous amounts of garbage.

Mark-Compact Collector

One thing you may have noticed in the previous animations is that objects never move. Once an object is allocated in memory, it stays in the same place even if memory turns into a fragmented sea of islands surrounded by black. The next two algorithms change that, but with completely different approaches.

Mark-compact disposes of memory, not by just marking it free, but by moving objects down into the free space. Objects always stay in the same memory order — an object allocated before another object will always be lower in memory — but gaps caused by disposed objects will be closed up by objects moving down.

The crazy idea of moving objects means that new objects can always just be created at the end of used memory. This is called a “bump” allocator and is as cheap as stack allocation, but without the limitations of stack size. Some systems using bump allocators don’t even use call stacks for data storage, they just allocate call frames in the heap and treat them like any other object.

Another benefit, sometimes more theory than practice, is that when objects are compacted like this, programs have better memory access patterns that are friendly to modern hardware memory caches. It’s far from certain you will see this benefit, though — the memory allocators used in reference counting and mark-sweep are complex, but also very well debugged and very efficient.

Mark-compact is a complex algorithm requiring several passes over all allocated objects. In the animation you can see the red flashes of live object marking followed by lots of reads and writes as destinations are computed, objects are moved and finally references are fixed to point to where objects have moved. The main benefit of all this complexity is operating under extremely low memory overhead. Oracle’s Hotspot JVM uses several different garbage collection algorithms. The tenured object space uses mark-compact.

Copying Collector

The last algorithm I’ve animated is the foundation of most high-performance garbage collection systems. It’s a moving collector like mark-compact, but it’s incredibly simple. It uses two memory spaces and simply copies live objects back and forth between them. In practice, there are more than two spaces and the spaces are used for different generations of objects. New objects are created in one space, get copied to another space if they survive, and finally copied to a tenured space if they are very long-lived. If you hear a garbage collector described as generational or ephemeral, it’s usually a multi-space copy collector.

Other than simplicity and flexibility, the main advantage of this algorithm is that it only spends time on live objects. There is no separate mark phase that must be later swept or compacted. Objects are immediately copied during live object traversal, and object references are patched up by following a broken-heart reference where the object used to be.

In the animation, you can see there are several collections where almost all the data is copied from one space to the other. This is a terrible situation for this algorithm and shows one of the reasons why people talk about tuning garbage collectors. If you can size your memory and tune your allocations so that most objects are dead when the collection begins, then you get a fantastic combination of safe functional programming style and high performance.

Conversation

John Fisher says:

September 3, 2014

This is very interesting- great job with the visualizations!

Soner Gönül says:

September 3, 2014

Does .NET Framework use one of this algorithms?

Ken Fox says:

September 3, 2014

Yes. The CLR has a generational collector. It’s like the copy collector visualization above, but a lot more sophisticated. The main complication is that it can run concurrently with your code–the collector runs in another thread and it uses a write barrier to make sure that your application code always sees a consistent view of memory.

I don’t know any of the details behind Microsoft’s algorithms though, sorry.

Tom says:

September 3, 2014

“Arithmetic is fast on a modern CPU, but memory is slow, and the counters are being loaded and saved to memory often.”

This is true, but a GC algorithm can choose where to put the refcount data. OS X puts all refcounts in a hash table (not on the objects themselves), and with the huge CPU caches that x64 has these days, it’s probably already in the fastest memory it can be. iOS on ARM64 takes it a step further and puts the refcount (or at least the first 19 bits of it) right in the isa pointer, so you get the first half-a-million references almost for free.

In general “memory” is slow but if it’s already in L1 there’s essentially no penalty (it’s about as fast to read from L1 as it is to execute an instruction). If it’s in L2, it’s only a 10x penalty. Using memory isn’t inherently bad. What matters is whether the algorithm is putting the most frequently used data in the fastest tiers of the memory hierarchy.

“All these counter updates also make it difficult to have read-only or thread-safe data.”

Well, yeah, threads are a dangerous tool, and thread safety is a hard problem no matter what strategy you use. That’s probably why all the cool multiprocessing languages that are newer than C (Erlang, Clojure, etc.) use completely different concurrency primitives. :-)

“it’s an accidentally amortized algorithm that can’t guarantee response times.”

True, but remember that most memory management systems in the world don’t guarantee response times — even good ol’ malloc/free, and even if you’re not using virtual memory! Real-time systems are a whole other can of worms. A fun can of worms, for sure, but not one that’s relevant to most programmers using most GC systems.

Ken Fox says:

September 3, 2014

There’s no question that Apple has put a lot of effort into optimizing their reference counting system and I think that says a lot about how poorly a naive reference count algorithm performs. The external ref count table is a nice hack. They also avoid adding counts to the table when an object is constructed–it’s an optimization based on the weak generational hypothesis that most objects die young. Now that they’ve moved to ARC, Apple can do a lot more optimizations, but I haven’t been watching very closely since they announced the 64-bit object ISA field.

A lot of Apple’s runtime is Open Source, but it’s not very easy to put all the pieces together.

Christopher Smith says:

September 4, 2014

I’d argue that if anything copying collectors are more friendly to modern memory architectures. They do a reasonably good job of organizing related data based on time of allocation (which tends to correlated with access patterns), which tends to make caching work pretty well… effectively amortizing the costs of badly organized allocators in to “copies”, and leaving thing quite well optimized in between.

Kushagra Sinha says:

September 4, 2014

Thanks for this great post!

However, I disagree with “Unfortunately reference counting has a lot of problems. Worst of all, it can’t handle cyclic structures.”

PHP (>=5.3.0) uses reference counting but is still able to handle cyclic references. Details.

Another solution is to use weak_ptr like in C++.

Ken Fox says:

September 4, 2014

There are many extensions to reference counting–researchers keep looking at hybrid algorithms to improve GC performance. That PHP implementation is interesting and I hadn’t seen it before, thanks! PHP has a broken link to the paper, but you can find it in CiteSeer as “Concurrent cycle collection in reference counted systems” by Bacon and Rajan. (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.132.2464) There are surprisingly few references to the paper.

The extension to reference counting in the paper uses a tracing algorithm to find cycles. One of the practical implementation advantages of ref counts is that it doesn’t require tracing, so I’d almost give this a whole new category somewhere closer to mark-sweep.

If you are interested in reference counting, there’s an even more advanced hybrid published by Microsoft last year. It was discussed on Lambda the Ultimate: http://lambda-the-ultimate.org/node/4825.

On the subject of weak references, they can be added to any GC system. They are vital in ref count systems because of the cycle problem. Some people have spun them as describing object “ownership”, but I think that gives them a little too much importance. It’s a cool hack that gives extra information to the collector. If a programmer has to worry about them constantly, it’s a sign of a weak collector algorithm. (Non-existent in the case of C++!)

Joseph says:

September 4, 2014

By the way, were would you put Azul C4 (http://www.azulsystems.com/technology/c4-garbage-collector ) ?

Ken Fox says:

September 4, 2014

I’ve never used Azul’s JVM, so this is the first I’ve read about it–it looks very interesting. C4 is a generational concurrent mark-compact collector. Anytime you get into concurrent collectors the complexity goes way way up. The mark-compact animation above can help you understand logically how the algorithm works, but Azul is compacting memory by changing the CPU’s TLB memory map. My animation makes it look like copying is expensive and that may not be true for this collector.
- Joseph says:
  
  September 4, 2014
  
  Well, their previous garbage collector was pause less for large heaps, so quite something. They used to have to produce their own hardware, but since CPU makers (intel and AMD) did the changes they were asking for. AFAIK they were trying to get the proper “primitives” in the JVM in order to be able to stop producing their own, but it’s not the case yet.
  
  Over, their new C4 is unknown to me and seems just full of magic, hence the question. Thanks for your reponse :)

Wyatt says:

September 5, 2014

Thank you for not claiming ARC isn’t GC; I’m getting tired of telling people they need to read Jones & Lins. ;)

(Better yet, you properly cover the time-cost drawbacks of the increment and decrement, as well as the freeing of dependent chains, which most people seem content to sweep under the rug! Bravo!)

Rob says:

September 8, 2014

Re: “Cleanup At The End: aka No GC”

There are different kinds of “No GC”, and “Cleanup At The End” isn’t really a realistic one.
Maybe you could add a more realistic “No GC” annimation for something like C++ RAII, what is probably the most viable real-live No-GC option in modern day programming.

Ken Fox says:

September 8, 2014

Isn’t C++ RAII a good example of cleanup at the end? Allocations happen (usually upon entering scope) and then everything is destroyed when the scope exits. Inside the scope there is no GC. C++ also uses reference count techniques with shared_ptr and unique_ptr (which is a degenerate form of reference counting since there can be only 0 or 1 references).

Aaron Stump says:

September 12, 2014

Hello, Ken. I enjoyed reading this post! I have one quibble, though: you say that if the reference count of the first record R in a large, otherwise unreferenced data structure falls to zero, then we will have to spend some unbounded amount of time reclaiming all the records in that structure. Thus we will get an unbounded pause time, just like tracing collectors. I disagree with this point, because we can defer decrementing the reference counts of the children of the record R until we decide to satisfy some subsequent allocation request by giving back R for the requested new memory. At that moment, we can decrement the reference counts of R’s children. This really will amortize the time to reclaim a big data structure over the subsequent allocation requests that consume the records from that structure. So we can get bounded pause times with reference counting, contrary to what you were arguing.
Aaron

Ken Fox says:

September 14, 2014

Thanks! The response has been great and I appreciate the comments. Several people have complained about my presentation of reference counting. I’m not sure if this is because people are more familiar with it, or if I’m just biased towards showing the problems because it’s so popular.

There are extensions to simple reference counting that improve it in many ways, but damage its popular strengths. With deferred reference counts you lose eager destruction (so it has the same problems as other GCs for expensive resources like file handles). Amortizing reference counting the way you describe also seems to require a fair amount of knowledge in the allocator about what’s on the free list, so it might also be difficult to integrate with an uncooperative allocator (like a system malloc). Do you end up using extra memory to manage the free list?

There are a bunch of concurrent ref count algorithms that might even have better performance on today’s hardware than an amortized algorithm.
- Aaron Stump says:
  
  September 15, 2014
  
  I appreciate your desire to question commonly accepted ideas. Funnily, it is this same impulse which leads to my interest in reference counting! Within academia, reference counting is widely considered inferior to tracing collection. That explains the title of this inspiring International Symposium on Memory Management (ISMM) 2012 paper: “Down for the count? Getting reference counting back in the ring”. If you google this title you will find a PDF. The paper argues that reference counting can be made as performant as tracing collection, despite the well-known problem, which you also mentioned, that reference counting nominally requires a lot of memory traffic to increment and decrement the reference counts all the time.
  
  By the way, some researchers have tried to decrease this traffic in various ways. I have a paper, “Resource Typing in Guru” (2010, available from the Papers section of my web page) that shows how you can use static resource typing to tell the compiler that one reference R to a data value is derived from another such reference R’, in such a way that R cannot possibly be dropped before R’. In this case, we do not need to modify reference counts for R at all, resulting in a significant performance improvement.
  
  The biggest problem, in my opinion, with reference counting is the inability to collect cycles. If we have to run a tracing collector alongside reference counting, we are incurring all the engineering complexity of a tracing collector. I have started working recently on ways to use the type system to rule out cyclic data structures statically. This would let use reference counting without the need for a back-up tracing collector. (Of course, we would also be prevented from using cyclic data structures, but my hope is that we could break the cycles ourselves with intermediary arrays.)

stackvoid says:

October 13, 2014

This is nice job

Weris C. says:

November 6, 2014

It’s just an amazing thought of creating a technical device to remove the garbage. Believe me, I have never heard about the robotic garbage collector before it.The rates of the recycling and disposal professional services are going high day to day. So, this trick would save the money and time of every person. Thanks Ken, it’s a wonderful experience to read your post.

Jon Harrop says:

March 4, 2015

Nice animations. Couple of issues.

Firstly, you said that .NET’s GC works like your copying collector. Your collector looks like Cheney semi-space which is no longer used and is very inefficient because it repeatedly copies mature objects. .NET’s GC is just generational with mark-sweep for the old generation which is quite common (OCaml does this too). You bump allocate into a nursery (the nursery does not move) and you copy survivors into an old generation where you do mark-sweep and they typically never move again. The main difference between that and what you’ve done is that objects generally only move once and not repeatedly as yours do.

Secondly, you wrote “A very useful property of reference counting is that garbage is detected as soon as possible” which is a common misconception. For example, objects referred to by the stack frame of a caller cannot be dereferenced during the execution of a callee even if that object will never be referenced again. Everything reachable from any global variable is never collected even if the program never refers to it again. And so on. Like every other memory management solution, reference counting keeps objects alive longer than is strictly necessary in the general case, aka “floating garbage”. According to folklore, tracing garbage collectors keep a lot more floating garbage around but I am not aware of any study that has actually quantified this.

Ken Fox says:

March 5, 2015

Thanks for the detail on the .Net GC. I didn’t cover it in the main post because I’m not familiar with the implementation. (It’s Open Source now so I should fix that!) When I said it was a generational copying collector, I didn’t mean that it used a naive copy collector like what my animation showed. Sorry if that mislead anyone. Lots of systems use a generational copying collector with a non-copying algorithm for the tenured generation. I should add that to the visualizations since it’s not hard to demo.

I have to disagree with you on ref counting though. Your distinction is not useful to people trying to understand the strengths and weaknesses of different algorithms. We can always come up with scenarios where there’s a lag between a resource becoming garbage and actually released, but ref counts shorten that lag enough that it is very common to put expensive resources (not just memory) under ref count control. Files and DLLs for example are often ref count managed.