Decentralized Garbage Collection

When I visited Washington D.C. for ASPLOS/VEE, I was working on my paper about cascade failure, and had distributed systems on my mind. Some of the papers at the joint conference were about garbage collection. One of the more interestring had an approach called leak pruning, for solving logical leaks. After awhile, I had an interesting idea that merged a bit of these two fields.

You see, distributed systems and garbage collection are both about graph problems. In the case of distributed systems, it’s usually about minimizing traffic along the edges, or performing some kind of global computation by communication with neighbors. In garbage collection, we’re concerned about reference cycles and reachability. But what if you ask: Is it possible for an object (node in graph) to decide, for itself, when it has been disconnected from a root node? In answering this question I came up with a very simple approach. I’ve been sitting on it for awhile, but, just yesterday, the paper was typed up in a hurry and submitted to the Fun Ideas and Thoughts session of PLDI.

The primary idea is this: If root nodes are tagged with a high value, and heap nodes are tagged with a low value, and each object acts as a node performing a distributed computation, then the average value of any subgraph that becomes disconnected from the roots will continuously decay. When it falls below a threshold value, that whole cluster can safely be collected.

There are some implementation details though. It’s going to be easiest to process these updates in an event queue in a separate garbage collection thread, rather than as a real distributed computation. When updating a single node, if we add to the queue all the points-to nodes, then the queue will rapidly grow without bound. But, we actually don’t care too much about how quickly the average worthiness is computed, so we can easily get away with adding only those nodes that are not already present in the queue.

The second problem is that references are directional. So, if we rely only the queue to hold references to objects, then it is possible for a set of nodes to become stranded (referenced neither by the queue, nor by the program) For example, the deleted branch of a subtree. We’d initially process the root of the branch, and proceed down to the leaves, but since it has no cycles we will loose all references to it, because the processed nodes will not be re-entered into the queue. (the program asked for it to be deleted, and the garbage collector does not maintain a list of all allocation sites) But this problem only happens for acyclic structures, so it can be solved by reference counting. When we go to update a node, it is safe to collect it if its reference count is zero, regardless of its worthiness value.

The third problem is that the average worthiness value of the program-reachable cluster of nodes will go down with the number of objects held by the program. This might limit the program size, but I think that the natural range of an unsigned double should be enough to solve this problem. Root nodes can have the value MAX_DOUBLE, and even if the program holds 10^6 objects, the average is still very high.

A fourth problem is the decay rate of unreachable clusters. This will be a function of both the computed worthiness at the time of disconnect, and the number of nodes in the cluster. I will probably have to adjust the formula given in my paper so that it has a faster decay rate. Unfortunately, this and the third problem above are in conflict with each other. Having high valued root nodes allows programs with more objects, but also extends the decay time.

I don’t claim this approach to be efficient, but I do claim that it’s simple to implement, and should scale nicely. Also, it might be useful for process collection in a distributed system. I’ll know if the idea was too crazy when review committee results are released on May 29.