madhadron, a programmer's miscellany

How computing is different for scientists

Episode Summary

How scientists approach programs and how programmers approach programs are very different based on their needs, and that is reflected in the tools each need.

Episode Transcription

Before I ended up back in software, I spent some years in physics and then in biology. I don't mean that I worked as a programmer in a lab. I was working on my PhD in biology, growing bacteria, working with microscopes, and occasionally wearing a biohazard suit. At some point I may spend a few episodes laying out something about that science, but today I'm going to talk about the key difference between computing for programmers and computing for scientists, and what that means for their tools.

The very short version is that the output of a programmer's work is a program text. The output of a scientist's computational work is an execution of a program and its results.

That's not particularly helpful unless you've spent years as a programmer and as a scientist, so let's go back and try it again a bit more slowly.

Consider a programmer writing a program. They type some statements into a text file, then compile it and run it. Perhaps they write some tests and run those. The tests describe an ideal state that the program is supposed to maintain. Once that ideal state is reached, it is checked into source control. When other programmers get involved, the tooling is designed so that they can share that text and exercise it in the same way.

The underlying worldview in all of this is that the task is to produce a single program and evidence of its correctness. The programmer may decide that something else is the correct form and alter the program to meet that, but the program itself is the object of interest.

Contrast this to a scientist running a simulation. They type some statements into a text file, then compile and run it. They set up some inputs for it where they know how it should behave. If it doesn't behave, they alter the program until it does. Then they try it on a condition where they don't know how it will behave, and then vary the inputs in additional runs to see if the results are robust or a numerical accident. Then they change the behavior of one part of the simulation, and see how that affects the robustness. Over time they change their code and run it on a variety of inputs as a form of running experiments.

The underlying worldview for the scientist is quite different. The program is a means to an end in the present moment. They care about the experiment, the combination of input, program, and results. When they collaborate with other scientists, those scientists want to be able to run a particular combination of input and program and get the same output, and vary the program to explore other possibilities.

So on one hand we have programmers who are focused on the production of programs. On the other we have scientists who are focused on the production of experimental results as part of a research program.

What happens when scientists try to use tools meant for programmers? The usual flow is they write the program, have some files defining an input for it, run it to get an output file, and then do it again with some other inputs and outputs. Very quickly they end up with a giant mess of files, typically with a partial knowledge of which inputs were used to generate which outputs, and no record of what particular version of a program generated a particular output.

It's easy to say, "Oh, they should source control every version" or come up with other disciplined approaches, but in practice they don't hold up.

Now, when you look at something like a hospital lab, you need to know what inputs produced what outputs with what program. In this setting, though, the program doesn't vary much. The lab assumes that there is a right way to run the analysis of a particular sample and want a program that does that. If they learn something that changes what they think the right way is, they'll change the program. But the program here feels much more like the programmer's conception of it.

In these systems, the program gets labelled a "workflow" and the inputs and outputs are handled by a LIMS, which is an acronym for Laboratory Information Management System. That's a fancy name for a database that tracks which files were produced from which other files by which program.

There are a number of systems like this. When I was part of the Swiss Institute of Bioinformatics for a few years after I ceased to work as a wetlab biologist, one of the groups involved was writing such a system called KNIME (spelled K-N-I-M-E).

KNIME got used when there were established workflows, like running a standard tools to map reads from sequencing DNA. It was the equivalent to a preprocessor but for common kinds of data. But no one working as a scientist or directly collaborating with scientists used it. The team kept giving talks and asking what features they needed to add, but they were largely greeted with apathy and most people kept running their bash scripts and trying to figure out which file came from which.

The key problem was that KNIME still treated programs as fixed things. Hooking a program into KNIME was a couple hours of work if you knew the system well. If you wanted to change the program, it wasn't as long, but it was a significant amount of time to get the change deployed in KNIME.

I muddied the waters further by writing another workflow manager. It was about 500 lines of Python that you used as a library in your program. It had a few functions that used a SQLite database and a directory as a LIMS (which I named MiniLIMS). When you started a workflow, it created a temporary directory and created read only links to the input files you specified from the LIMS. You could call another function to capture an output file back to the LIMS. Each run had an execution id and the LIMS tracked the executions along with their input files and the files they capture from their output. The execution could have arbitrary text logged to it as well, so you could annotate what it was doing.

A little bit of Python magic made it nearly trivial to bind Unix shell programs into something that behaved nicely, but that was an artifact of bioinformatics where everything is provided as a Unix shell program instead of a library. But that is a rant for another time.

People doing science liked this workflow manager quite a bit. There was no more trying to handle a mass of files in a directory and it did a way with a lot of drudgery around piping programs together. Over ten years later, I still haven't thought of or found an improvement on most of its design.

Not all, though. There is still one major weakness. The execution captured the name of the program to the MiniLIMS, but not its contents. Scientists using it learned to copy their program before editing it and change a text annotation they logged in it to describe what this particular version was doing. So you would have directories containing analysis1.py, analysis2.py, analysis3.py, and so on. And inevitably people forgot to copy something and lost the code that produced a particular result.

Stated this way, the solution is obvious: executions must capture not only inputs and outputs but the exact source code used as well, and that triad of input, program, and output becomes the thing that scientists browse in MiniLIMS.

It's obvious, but now we're really at cross purpose with the tools programmers build for themselves. My fifty lines of Python depend on a Python runtime and whatever libraries I import. If I change the runtime or update a library to a new version, the program potentially doesn't do the same thing anymore. So I need to specify the environment along with my program. Do I put that in the program itself somehow? Do I create a new editor that exposes the environment in a sidebar alongside the program? This is not how programmers envision tools they write to set up development environments being used.

We can come back to that very short version now: the output of a programmer's work is a program text. The output of a scientist's computational work is an execution of a program and its results. It's rather like the difference between someone who uses explosives to produce precisely shaped holes in the ground and someone who uses them to produce fireworks shows. The tools best suited for one are not at all the tools best suited for the other.