madhadron, a programmer's miscellany

Progress by pruning trees

Episode Summary

There's a pattern of progress that's repeated itself several times in the history of programming: take a common problem that forms a graph, and find a useful way to prune it to a tree. Today we explore three such cases.

Episode Transcription

Today I want to talk about a particular pattern of progress that I've seen happen multiple times in the history of programming. It's a really simple idea: take a problem which forms a graph, and figured out a useful way to prune it into a tree.

This is how the most familiar graph algorithms we all learn work. Consider depth first traversal. Depth first traversal on a tree is very straightforward. I start with the root node and consider all its children. Then I pick one of those children, and consider all its children. I keep going until I hit a leaf node. Then I backtrack up one level and take the next child until I have exhausted all the children at that level. Then I backtrack up again, and continue until I have reached every node in the tree.

When I do the same thing on a directed graph, I can end up visiting nodes twice. So as we traverse the graph, we keep a set of all the nodes we have already visited and ignore any edge we find that goes to an already visited node.

What this does is prune the graph to be a tree. We are taking the various paths to a node, choosing one path, and removing edges to cut all the others. If we did breadth first traversal we would choose a different path to keep among all of them. Dijkstra's algorithm and the A* algorithm choose which paths to keep based on weights on the edges.

So, what important graphs in programming have we pruned into trees to good effect?

The most important one was when our field switched to structured programming. Before that, control flow in a program was done with GOTO. We annotated each line of a program with a number, and any other line could GOTO that number.

We turn this into a graph by breaking up the program into chunks of lines where only the last line of the chunk contains a GOTO. So for each chunk we accumulate lines without GOTOs until we get to one that has a GOTO. We add that last one, then start a new chunk.

For edges, we draw an edge from each chunk to each possible next chunk. If the GOTO is conditional, then this would be two edges, one to the next chunk in the program if the conditional is not true, and one to wherever the GOTO jumps if it is.

This graph gets complicated. The usual description is "spaghetti." Any reasoning you might do about the state of the program, someone else can destroy with a few well placed GOTOs.

Structured programming replaces arbitrary GOTOs among chunks with nested blocks of conditionals and loops. A program that enters a block is guaranteed to exit that block at the end before entering another block at its same level. It may enter other blocks inside, but those will be exited before the outer block exits.

This makes control flow into a tree. Each block is a node, and all blocks it contains are children of that node. Reasoning about the state of the program is vastly simpler because you know the limited set of paths it came from.

So, that's the first and most important example of graph pruning as progress in programming.

The next one is quite a bit different. Let's consider build systems. The classic build system is make. In a Makefile you define a set of targets with commands to build those targets and dependencies on other targets that must be built first. It is explicitly a directed, acyclic graph of targets.

In practice, we have a set of conventions for some named targets we expect to find in a Makefile, like 'all' and 'test' and 'clean.' But there's no guarantee that those exist, and even if they do, there's no guarantee that they do anything sensible. We would expect a test target in a Java project to compile the code then run JUnit. If it doesn't? Too bad.

It would make much more sense for all Java projects to share their basic behavior, to all have a 'compile' target that build bytecode, and a 'test' target that ran all unit tests, and a 'package' target that built a jar.

Inevitably, though, we will have local differences, such as "Oh, we need to add this option when we run JUnit" or "There's an extra step here where we run a linter."

If we can specify common behavior plus local differences, that makes it much easier for someone coming into the project. A quick glance at the base behavior in use tells them the main targets they care about, and the local differences they need to understand are all clearly separated from the base behavior.

So the task is to specify a directed, acyclic graph and arbitrary patches on it. Just like with GOTOs, our patches must work correctly no matter what path we took through the graph to reach them. Plus, arbitrary diffs of graphs are a hard mathematical problem.

What happens if we prune these graphs? When we carry through our pruning, we find that having each target depend on one other target leads us to start thinking of targets as more abstract stages as opposed to handling particular tasks. Our build graph turns into a set of linear sequences of targets.

Handily, it turns out that diffs on a linear sequence are really easy. They are a combination of:

- making a change to the configuration of a target, such as adding an option when you run Junit,- adding a bit of behavior to happen right before a target runs, such as linting code before compiling it or running a preprocessor, or- adding a bit of behavior to happen right after a target runs, such as cleaning up detritus created by the preprocessor and used during compilation.

Since we are only going through linear sequences, there is only one path that led to any of our local differences, so we can easily reason about whether it will behave correctly or not.

We can specify our common behavior as a set of linear sequences, so that the targets 'compile' and 'test' and 'package' are all defined and all share the same basic behavior. And we can specify our local changes to them in a clear way.

This is what Maven did. Unfortunately, they hid this clarity behind a mess of XML and weird implementations in Java so it was decidedly nontrivial to implement a set of sequences, and local differences were spread among XML and bits of implementation floating around in Java. That pretty much killed the idea in the Java community, and the next systems like Gradle, all went back to directed, acyclic graphs of targets.

I don't think any other build system has tackled this idea again, which is a shame.

The last example of graph pruning I want to talk about is from Rust, specifically, Rust's answer to deciding when to release a piece of memory that a program was using.

There are three classical answers to this problem.

The first, epitomized by C, is to make it the programmer's problem. You malloc all chunks of memory, and you free them, and if you get it wrong, what will happen is undefined. The program might crash saying that you accessed free memory. It might just pretend the memory is still there and potentially collide with something else. It might launch nuclear weapons. You cannot know and it's your fault.

For some reason, this approach has enjoyed widespread popularity.

The second approach, which began in Lisp way back at the beginning, is to make it the computer's problem. As a programmer, I have my data. I have variables in scope that refer to some of them. Some of the data refers to other parts of the data. And every so often, the computer will stop the world, traverse the graph from my variables in scope through all the data, and delete anything that doesn't still have a path to it in that graph. There are other approaches, but this one, called "mark and sweep" garbage collection, is pretty typical.

This is far better for programmer productivity and happiness, and in most cases for program correctness, but it has two problems. First, you have the overhead of the garbage collection work. In latency sensitive applications, this can make your program unable to meet its requirements.

Second, it's really easy to accidentally miss removing a reference to a piece of data. That bit of data now never gets garbage collected. If that miss happens in a recurring process, like handling HTTP requests, then you steadily leak memory into this category of data you are still retaining but don't actually want. This is such a common problem that it is typical to restart long running programs that use garbage collection every day or every week just to clear this leaked memory.

The third approach is a variation of garbage collection. Instead of tracing out the graph of data to find what is still visible, we attach a number to every piece of data. When something references that data, we increment the number. When that reference is removed, we decrement the number. When the number reaches zero, we delete the data. This is how Apple's Cocoa libraries generally work.

We again pay an overhead for reference counting, since we need to increment, decrement, and check the reference count, but that task is local and doesn't require us to traverse the whole graph of values in memory. On the other hand, we can still easily end up with dangling references and leak memory.

So let's take stock of this. We have a graph with variables in scope and values in memory as nodes, and with references or pointers from one value to another as edges. What happens if we prune this into a tree? Now every value has at most one reference to it. Reference counting becomes trivial. It's also really hard to leave accidental references hanging around, so memory leakage becomes vastly less likely.

We can go further. If our values form a graph, then when we look at some point in our program, we have no way of knowing what references there are to the data in scope at that point. We can make no decisions about whether we can free any of that memory without lots of other information.

If our values form a tree, with at most one reference, then if we have a value in scope, that is, we have a reference to it, we must have the only reference. We can decide if we can free it or not. If we're a little careful we can probably even make that decision at compile time.

Now, how do we actually do that? Only being able to refer to a value in one place seems pretty limiting. Rust's answer was to instead pass that one reference from place to place. So if I call a function with this value, I lose the reference to it, and the variable in the function now holds it. We might want it back, of course, so we can also tell the function to borrow it instead, which means the reference will get returned to its original holder when the function exits.

You give up some things when you do this. For example, Rust won't let you make a circular linked list without opening up unsafe escape hatches and forcing the issue. Linked lists are notoriously hard to write because it turns out that different approaches to linked lists have very different semantics about how references are passed around.

On the other hand, you don't play Russian roulette like you do in C. You don't pay the overhead for garbage collection or automatic reference counting. And you don't need to worry about trailing references leaking memory, so you don't have to restart your program every night.

So, here we have three examples where the same gambit paid off in interesting ways. We started with a problem that formed a graph, figured out how to prune it to a tree, and saw what we would gain by it. In some cases like structured programming, the gain is incredible. In others, like Rust's memory management, it's merely impressive. I would like to say that it's too early to judge the impact of the Maven case, but it's probably really an historical curiosity at this point. I'm sure there are many more examples where there was no payoff and so we don’t know about them.

But if you find yourself with a knotty graph, it's probably worth trying to prune it.