Why readable, documented code is especially important for scientists (and a three-step plan for getting there)
During my most recent teaching engagement I spent some time talking specifically about code readability and documentation. As often happens, presenting these ideas to a roomful of novice programmers helped to crystallize my thoughts on the topic, and made me realize that I'd never written about it – plus I thought that it would be an ideal topic for first post of the new year, since documentation is something that many programmers constantly resolve to do better at!
There's no shortage of articles and book chapters explaining the general importance of documenting your code – if you learned to program from a book or an online tutorial then it will almost certainly have been mentioned. The arguments in favour of documentation are well-rehearsed: it makes it easier for you to work on your own code over a long period of time, it makes it easier for others to contribute fixes and features, it forces you to think about the purpose of each section, etc. In this post, I want to talk about why documentation is particularly important for you – somebody who is using programming for carrying out scientific work. The basis of my argument is that for us as scientists, code serves two important features over and above simply being executed: it acts as both the ultimate supplementary methods, and it's the way in which you express your original ideas.
Code as supplementary methods
It's probably fair to say that most users of most programs don't care about how they actually work, as long as they do what they're supposed to. When you fire up an image editing program to brighten up a dull digital photo or rotate one where the horizon isn't straight, you probably aren't interested in exactly what transformation is being applied to the RGB values, or what trigonometry is being used to straighten the image – you're only interested in the end result.
Scientific software is different: the end-users are often extremely interested in how the program works internally, since understanding that is a part of understanding the results. And the ultimate way to resolve questions or disagreements about what a program is doing is to examine the source code. This is a great advantage we have when working in bioinformatics. For wetlab work, there is only so much information you can give in the pages of a journal about how an experiment was carried out. Using supplementary information can help, but even then you're limited to what the authors thought important enough to write down. For a bioinformatics experiment, however, one can always see exactly where the data came from and what happened to it, providing one has access to the source code. You can read about a piece of bioinformatics software in a journal, listen to a talk on it, discuss it with the authors, but at the end of the day if you still have questions about how it works, you can always go back to the source code.
The vast majority of programmers don't have to worry about their users wanting to read the source, but we do – so we should make readability and documentation a priority to make sure that it's as useful as possible.
Code as a way of expressing original ideas
The vast majority of software projects don't implement any ideas that are particularly original. This isn't a problem, it's just a reflection of the fact that many pieces of software do very similar things to other pieces of software, and do them in similar ways. There are fairly standard ways of writing a blog engine, a stock management program, an image manipulation program etc. We could make an argument, therefore, that for those categories of software it's not super-important that the code is really well documented, since it's unlikely to be doing anything surprising, and a reader can probably work out what's going on in each section by referring to other software that carries out the same task.
Scientific software is different. Yes, we tend to write scripts to carry out tedious everyday tasks like tweaking file formats and correcting sequence headers, but we also use it for implementing entirely new ideas about how to assemble genomes, or how to correct frameshift mutations, or how to pick species for phylogenetic analysis. We're far more likely than other programmers to write code that does something entirely new. As such our programs (at least the ones that do something interesting) are going to be harder to understand than yet another text editor or chat program.
As programmers, we're very lucky in that the language we use to implement our original ideas – code – is also an excellent way to communicate them to other researchers. But the usefulness of that language depends on whether we write it in a readable way and document it well.
Three steps to readable, documented code
Documentation is a skill that is learned over the course of a career, but here's an exercise that I often have my students do. Using a framework like this can make documenting your code less daunting if you've no idea where to start.
Step one: make sure your variable and function names are meaningful
Programmers are fond of talking about self-documenting code – i.e. code that doesn't require external documentation to be understood. A large part of this is using meaningful variable names. Examples of bad variable and function/method names include:
- Single-letter names e.g.
f(with the exception of variable names that follow common conventions such as
yfor co-ordinates or
ifor an index)
- Names that describe the type of data rather than the contents e.g.
- Names that are extremely generic e.g.
- Names that come in multiples e.g.
- Names that are excessively shortened e.g.
- Multiple names that are only distinguished by case or punctuation e.g.
- Names that are misspelled – the computer does not care about spelling but your readers might
Go through your code and look for any instances of the above, and replace them with good names. Good variable names tell us the job of the variable or function. This is also a good opportunity to replace so-called magic numbers – constants that appear in the code with no explanation – with meaningful variable names e.g. 64 might be replaced by
Example: we want to define two variables which hold the DNA sequence for a contig and a frame, then pass them to a method which will carry out protein translation and store the result. Here's how not to do it, even though the code is perfectly valid Python:
a = 2 b = 'ATGCGATTGGA' c = do_stuff(a, b)
This is much better:
frame = 2 contig_dna_seq = 'ATGCGATTGGA' contig_protein_seq = translate(frame, contig_dna_seq)
Step two: write brief comments explaining the reasoning behind particularly important or complex statements
For most programs, it's probably true to say that the complexity lies in a very small proportion of the code. There tends to be a lot of straightforward code concerned with parsing command-line options, opening files, getting user input, etc. The same applies to functions and methods: there are likely many statements that do things like unpacking tuples, iterating over lists, and concatenating strings. These lines of code, if you've followed step one above, are self-documenting – they don't require any additional commentary to understand, so there's no need to write comments for them.
This allows you to concentrate your documentation efforts on the few lines of code that are harder to understand – those whose purpose is not clear, or which are inherently difficult to understand. Here's one such example – this is the start of a function for processing a DNA sequence codon-by-codon (e.g. for producing a protein translation):
for codon_start in range(0, len(dna)-2, 3): codon_stop = codon_start+3 codon = dna[codon_start:codon_stop] ...
The first line is not trivial to understand, so we want to write a comment explaining it. Here's an example of how not to do it:
# iterate over numbers from zero to the length of # the dna sequence minus two in steps of three for codon_start in range(0, len(dna)-2, 3): ...
The reason that this is a bad comment is that it simply restates what the code does – it doesn't tell us why. Reading the comment leaves us no better off in knowing why the last start position is the length of the DNA sequence minus two. This is much better:
# get the start position for each codon # the final codon starts two bases before the end of the sequence # so we don't get an incomplete codon if the length isn't a multiple of three for codon_start in range(0, len(dna)-2, 3): ...
Now we can see from reading the comment that the reason for the -2 is to ensure that we don't end up processing a codon which is only one or two bases long in the event that there are incomplete codons at the end of the DNA sequence.
Go through your code and look for lines whose function isn't obvious just from reading them, and add explanations
Step three: add docstrings to your functions/methods/classes/modules
Functions and methods are the way that we break up our code into discrete, logical units, so it makes sense that we should also document them as discrete, logical units. Everything in this section also applies to methods, classes and modules, but it keep things readable I'll just refer to functions below.
Python has a very straightforward convention for documenting functions: we add a triple-quoted string at the start of the function which holds the documentation e.g.
def get_at_content(dna): """return the AT content of a DNA string. The string must be in upper case. The AT content is returned as a float""" length = len(dna) a_count = dna.count('A') t_count = dna.count('T') at_content = float(a_count + t_count) / length return at_content
This triple-quoted line is called a docstring. The advantage of including function documentation in this way as opposed to in a comment is that, because it uses a standard format, the docstring can be extracted automatically. This allows us to do useful things like automatically generate API documentation from docstrings, or provide interactive help when running the Python interpreter in a shell.
There are various different conventions for writing docstrings. As a rule, useful docstrings need to describe the order and types of the function arguments and the description and type of the return value. It's also helpful to mention any restrictions on the argument (for instance, as above, that the DNA string must be in upper case). The example above is written in a very unstructured way, but because triple-quoted strings can span multiple lines, we could also adopt a more structured approach:
def get_at_content(dna): """return the AT content of a DNA string. Arguments: a string containing a DNA sequence. The string must be in upper case. Returns: the AT content as a float""" ...
If you think it's helpful, you can also give examples of how to use the function in the docstring. Notice that we're not saying anything in the docstring about how the function works. The whole point of encapsulating code into functions is that we can change the implementation without worrying about how it will affect the calling code!
These three steps represent the minimum amount of work that you should do on any code that you plan on keeping around for more than a few weeks, or that you plan on showing to anybody else.