madhadron, a programmer's miscellany

Steady state and perturbations

Episode Summary

Programmers, very correctly, don't generally think about their programs as dynamical systems. But there are a few mental models from physics that we can apply to software that are useful for designing programs. This episode goes over the idea of a steady state with perturbations to it and how to use it as a design tool.

Episode Transcription

Today I'm going to talk about one of the mental structures I brought into software with me from physics. Programmers don't usually think about their programs as if they were physical systems in motion.

Indeed, programming has spent years figuring out better ways to think about dynamic computations as frozen pieces of logic. We replaced the jumps among instructions of goto with structured programming and froze the dynamic location of a program into a tree. We replaced reading and writing to a flat space of memory with stack frames and structured allocation and deallocation and froze the dynamic nature of what might be behind a variable into a limited set of possibilities defined by the Hoare logic of a program fragment. We imposed type systems instead of deciding each time how to read a piece of memory and froze the question of whether we had done that right into a logical procedure. Even a sequence of statements in a program is given meaning, not by the dynamic response of a machine to instructions, but by Hoare logic.

But there are times when we really do want to think about a system dynamically. If our web server briefly receives a large spike of traffic, what happens? How does it recover afterwards? When we say that it recovers, we must mean that there is some state it recovers to. What is that state and how do we define it?

This situation works really nicely in a model from physics of a steady state with applied perturbations.

First, we assume that the system has a steady state. The iconic example of a steady state is high pressure steam flowing through a turbine. Nothing about this system is at rest. The turbine is whirling. The steam is flowing. But the rate of steam going through and the rate at which the turbine spins is steady.

What happens if we speed up the flow of steam and the rate of the turbine spinning? If we suddenly jam a shockwave of higher pressure through it, nothing good will come of it. We may explode the turbine. But if we increase the rate the turbine spins much more slowly, what physicists call a "quasi-static change," we can have it be pretty much in steady state at each moment during that change.

In practice, quasi-static changes can be pretty fast. Jet turbines spin 100 to 400 times a second. If we go from 100 rotations per second to 400 rotations per second over the course of ten seconds, that's an increase of one rotation per second every 30th of a second. In that time the turbine has undergone at least three rotations. This would still count as quasi-static.

Similarly when we have programs that do ongoing work in response to outside requests, we can think of them being in a steady state. Our web server's traffic may increase and decrease steadily over the course of the day, but these are quasi-static changes to the steady state.

Even before we start talking about perturbations, the idea of steady state is already useful for programmers. As we put a constant flow of traffic across the system its properties should remain stable. The memory allocated should remain the same. The size of caches should remain the same. The size of object pools and the number of worker threads should remain the same.

So when we are designing a server, we need to consider policies that will remain in steady state. Let's consider a few aspects as examples.

First, are we leaking memory? If we run a constant amount of traffic through our server, does its heap size steadily increase? Do we have some way to guarantee via our program's organization that this will not happen? Not planning for this leads to the common operational hack of restarting Java servers every night instead of figuring out why they are leaking memory.

Second, when we have a pool like an object pool or connection pool, do we stop allocating new members of the pool after an initial startup period?

Third, if we have a cache or map or other structure we accumulate over time as traffic flows through the system, does it end up in a steady state? For a cache, does it stay at a stable size after an initial startup period? Does the hit rate of the cache become steady? If we are accumulating state over traffic, does that state converge?

Finally, if we have queues between stages in our program, are they all empty almost all of the time in steady state? If it's not empty most of the time, then the next stage is unable to keep up and its depth will steadily grow until we either run out of memory or we reach the maximum size of the queue, at which point it applies back pressure on previous stages until the whole system slows down. So in steady state, all queues are either empty or completely full. In a healthy steady state, they will all be empty.

Queueing theory provides a more formal description of the properties of systems in steady state, but it's often less revealing than just thinking of the situation intuitively as a physical system with dynamics.

So, we can use steady states as a design tool for our programs. But we also need to know how programs respond when they leave steady state. In physics we speak of applying a perturbation to a system, such as that shock wave of higher pressure entering the turbine we mentioned before.

We want to know how our system will respond to the perturbation while it's happening, and how it will relax again afterwards.

Now, with a big enough perturbation, our system may not go back to its steady state. A big enough pressure wave can damage our turbine or break open the steam pipes. Even if the steam flow goes back to what it was, the system isn't going back to the same steady state after that. When someone engineers a turbine, they have a range of perturbations that they plan for. The turbine blades may be engineered to be strong enough to withstand shockwaves up to a certain point, or there may be escape valves which open and shed steam when the pressure passes a certain point.

Coming back to our programs, we want to know what happens during the perturbation and what we must do in order for our server to relax back to the original steady state after the perturbation passes. In analogy with the turbine, we engineer the system to be able to take a certain amount of additional traffic by provisioning hardware for it that isn't completely utilized during steady state operation, and we put in escape valves such as load shedding where, when traffic passes a threshold, the system, instead of processing request, immediately returns a code telling clients that the server is overloaded and to back off.

A perturbation could also be in the size of a request, such as a client trying to post a terabyte of data. Again, for modest increases in size we should have capacity in our system to handle the perturbation. For enormous increases, we need escape valves. We could have our code that reads requests abort when it passes some threshold and send back an error.

As a general principle, in order to be able to relax back to the original steady state after a perturbation, any traffic must hit limits where the server starts shedding load and returning errors instead of compromising itself.

We can use the notion of perturbation to very systematically design the response of our program. We take the traffic and look at all the ways it can vary. First, the number of requests per second. Then when we look at each kind of request, how many places can its contents vary in length? For each number in the request, what happens if it becomes extremely large?

So, to reiterate, we can usefully conceive of our program as a dynamical system that runs in steady state, possibly with quasi-static changes taking it to other steady states, with occasional perturbations. We look at the resources in our program and ensure that they all achieve a steady state instead of leaking or being oversubscribed. Then we look at our workload and list out all the ways it can experience perturbations, whether in amount of traffic or size of traffic or ridiculous size of some numerical parameter. For each of these we design our program and provision our hardware to handle modest perturbations and build escape valves for larger ones.

Our discussion of steady states assumed that our system would relax back to one. Another option is that, after a perturbation, the system will keep oscillating. For example, we might get a spike of traffic that causes the server to shed load and recover, then all the clients that backed off call back and overload the server again, and this repeats. Or, a perturbation causes a cache to flush most of its usual contents, and after the perturbation the time to refill the cache slows the system enough where its traffic handling shifts and, after some time, causes the cache to flush again.

In interlinked services, these kinds of oscillating cascades can slosh back and forth until someone steps in to manually dampen them. Thinking about how to dampen oscillations is the next level of handling perturbations, but it usually cannot be done at the level of a single server. Dampening depends on the system more broadly.

Oscillations are not the only option. Physicists study even more exotic things like strange attractors. They are describing systems they find in nature, however. We don't tend to see these in describing our programs because, if we did, the first thing we would do is go add damping and effects to make it stop behaving so exotically. Complex and intricate exotica make great science but lousy engineering.