Archive | Philosophy

Why readable, documented code is especially important for scientists (and a three-step plan for getting there)

During my most recent teaching engagement I spent some time talking specifically about code readability and documentation. As often happens, presenting these ideas to a roomful of novice programmers helped to crystallize my thoughts on the topic, and made me realize that I’d never written about it – plus I thought that it would be an ideal topic for first post of the new year, since documentation is something that many programmers constantly resolve to do better at!

There’s no shortage of articles and book chapters explaining the general importance of documenting your code – if you learned to program from a book or an online tutorial (such as Python for Biologists) then it will almost certainly have been mentioned. The arguments in favour of documentation are well-rehearsed: it makes it easier for you to work on your own code over a long period of time, it makes it easier for others to contribute fixes and features, it forces you to think about the purpose of each section, etc. In this post, I want to talk about why documentation is particularly important for you – somebody who is using programming for carrying out scientific work. The basis of my argument is that for us as scientists, code serves two important features over and above simply being executed: it acts as both the ultimate supplementary methods, and it’s the way in which you express your original ideas.

Code as supplementary methods

It’s probably fair to say that most users of most programs don’t care about how they actually work, as long as they do what they’re supposed to. When you fire up an image editing program to brighten up a dull digital photo or rotate one where the horizon isn’t straight, you probably aren’t interested in exactly what transformation is being applied to the RGB values, or what trigonometry is being used to straighten the image – you’re only interested in the end result.

Scientific software is different: the end-users are often extremely interested in how the program works internally, since understanding that is a part of understanding the results. And the ultimate way to resolve questions or disagreements about what a program is doing is to examine the source code. This is a great advantage we have when working in bioinformatics. For wet-lab work, there is only so much information you can give in the pages of a journal about how an experiment was carried out. Using supplementary information can help, but even then you’re limited to what the authors thought important enough to write down. For a bioinformatics experiment, however, one can always see exactly where the data came from and what happened to it, providing one has access to the source code. You can read about a piece of bioinformatics software in a journal, listen to a talk on it, discuss it with the authors, but at the end of the day if you still have questions about how it works, you can always go back to the source code.

The vast majority of programmers don’t have to worry about their users wanting to read the source, but we do – so we should make readability and documentation a priority to make sure that it’s as useful as possible.

Code as a way of expressing original ideas

The vast majority of software projects don’t implement any ideas that are particularly original. This isn’t a problem, it’s just a reflection of the fact that many pieces of software do very similar things to other pieces of software, and do them in similar ways. There are fairly standard ways of writing a blog engine, a stock management program, an image manipulation program etc. We could make an argument, therefore, that for those categories of software it’s not super-important that the code is really well-documented, since it’s unlikely to be doing anything surprising, and a reader can probably work out what’s going on in each section by referring to other software that carries out the same task.

Scientific software is different. Yes, we tend to write scripts to carry out tedious everyday tasks like tweaking file formats and correcting sequence headers, but we also use it for implementing entirely new ideas about how to assemble genomes, or how to correct frameshift mutations, or how to pick species for phylogenetic analysis. We’re far more likely than other programmers to write code that does something entirely new. As such our programs (at least the ones that do something interesting) are going to be harder to understand than yet another text editor or chat program.

As programmers, we’re very lucky in that the language we use to implement our original ideas – code – is also an excellent way to communicate them to other researchers. But the usefulness of that language depends on whether we write it in a readable way and document it well.

Three steps to readable, documented code

Documentation is a skill that is learned over the course of a career, but here’s an exercise that I often have my students do. Using a framework like this can make documenting your code less daunting if you’ve no idea where to start.

Step one: make sure your variable and function names are meaningful

Programmers are fond of talking about self-documenting code – i.e. code that doesn’t require external documentation to be understood. A large part of this is using meaningful variable names. Examples of bad variable and function/method names include:

  • Single-letter names e.g. a, b, f (with the exception of variable names that follow common conventions such as x and y for co-ordinates or i for an index)
  • Names that describe the type of data rather than the contents e.g. my_list, dict
  • Names that are extremely generic e.g. process_file(), do_stuff(), my_data
  • Names that come in multiples e.g. file1, file2
  • Names that are excessively shortened e.g. gen_ref_seq_uc
  • Multiple names that are only distinguished by case or punctuation e.g. input_file and inputfile, DNA_seq and dna_seq
  • Names that are misspelled – the computer does not care about spelling but your readers might

Go through your code and look for any instances of the above, and replace them with good names. Good variable names tell us the job of the variable or function. This is also a good opportunity to replace so-called magic numbers – constants that appear in the code with no explanation – with meaningful variable names e.g. 64 might be replaced by number_of_codons.

Example: we want to define two variables which hold the DNA sequence for a contig and a frame, then pass them to a method which will carry out protein translation and store the result. Here’s how not to do it, even though the code is perfectly valid Python:

a = 2
b = 'ATGCGATTGGA'
c = do_stuff(a, b)

This is much better:

frame = 2
contig_dna_seq = 'ATGCGATTGGA'
contig_protein_seq = translate(frame, contig_dna_seq)

Step two: write brief comments explaining the reasoning behind particularly important or complex statements

For most programs, it’s probably true to say that the complexity lies in a very small proportion of the code. There tends to be a lot of straightforward code concerned with parsing command-line options, opening files, getting user input, etc. The same applies to functions and methods: there are likely many statements that do things like unpacking tuples, iterating over lists, and concatenating strings. These lines of code, if you’ve followed step one above, are self-documenting – they don’t require any additional commentary to understand, so there’s no need to write comments for them.

This allows you to concentrate your documentation efforts on the few lines of code that are harder to understand – those whose purpose is not clear, or which are inherently difficult to understand. Here’s one such example – this is the start of a function for processing a DNA sequence codon-by-codon (e.g. for producing a protein translation):

for codon_start in range(0, len(dna)-2, 3):
codon_stop = codon_start+3
codon = dna[codon_start:codon_stop]
    ...

The first line is not trivial to understand, so we want to write a comment explaining it. Here’s an example of how not to do it:

# iterate over numbers from zero to the length of
# the dna sequence minus two in steps of three
for codon_start in range(0, len(dna)-2, 3):
...

The reason that this is a bad comment is that it simply restates what the code does – it doesn’t tell us why. Reading the comment leaves us no better off in knowing why the last start position is the length of the DNA sequence minus two. This is much better:

# get the start position for each codon
# the final codon starts two bases before the end of the sequence
# so we don't get an incomplete codon if the length isn't a multiple of three
for codon_start in range(0, len(dna)-2, 3):
...

Now we can see from reading the comment that the reason for the -2 is to ensure that we don’t end up processing a codon which is only one or two bases long in the event that there are incomplete codons at the end of the DNA sequence.

Go through your code and look for lines whose function isn’t obvious just from reading them, and add explanations

Step three: add docstrings to your functions/methods/classes/modules

Functions and methods are the way that we break up our code into discrete, logical units, so it makes sense that we should also document them as discrete, logical units. Everything in this section also applies to methods, classes and modules, but it keep things readable I’ll just refer to functions below.

Python has a very straightforward convention for documenting functions: we add a triple-quoted string at the start of the function which holds the documentation e.g.

def get_at_content(dna):
  """return the AT content of a DNA string.
     The string must be in upper case.
     The AT content is returned as a float"""
  length = len(dna)
  a_count = dna.count('A')
  t_count = dna.count('T')
  at_content = float(a_count + t_count) / length
  return at_content

This triple-quoted line is called a docstring. The advantage of including function documentation in this way as opposed to in a comment is that, because it uses a standard format, the docstring can be extracted automatically. This allows us to do useful things like automatically generate API documentation from docstrings, or provide interactive help when running the Python interpreter in a shell (take a look at the chapter on testing and documentation in Advanced Python for Biologists for an in-depth look at how this works).

There are various different conventions for writing docstrings. As a rule, useful docstrings need to describe the order and types of the function arguments and the description and type of the return value. It’s also helpful to mention any restrictions on the argument (for instance, as above, that the DNA string must be in upper case). The example above is written in a very unstructured way, but because triple-quoted strings can span multiple lines, we could also adopt a more structured approach:

def get_at_content(dna):
  """return the AT content of a DNA string.

     Arguments: a string containing a DNA sequence.
                The string must be in upper case.

     Returns: the AT content as a float"""
  ...

If you think it’s helpful, you can also give examples of how to use the function in the docstring. Notice that we’re not saying anything in the docstring about how the function works. The whole point of encapsulating code into functions is that we can change the implementation without worrying about how it will affect the calling code!

Summary

These three steps represent the minimum amount of work that you should do on any code that you plan on keeping around for more than a few weeks, or that you plan on showing to anybody else. As always, if you have questions or suggestions, leave a comment.

1

What you have in common with the Wright brothers

Warning: vast historical oversimplification below in pursuit of a point 🙂

Famously, the Wright brothers built and flew the first aircraft capable of sustained, powered flight in 1903. Looking at the famous photos with eyes used to seeing modern aircraft, it looks pretty airworthy:

220px-1904WrightFlyer

 

There were plenty of other people working on heavier-than-air flying machines around that time, many with much more money and far more resources. So what was the key to the Wright brothers’ success? Did they invent a new type of engine? A new type of wing? Not really – their greatest invention was this:

wright_tunnel

This unprepossessing-looking box is a wind tunnel, which the Wright brothers – realizing that it was far too time-consuming to test wing designs by building them full scale – used to test their aeronautical designs using models. The innovation that prompted their break-through was not an improvement to aircraft, but an improvement in the process for designing aircraft. By using a wind tunnel, they were simply able to make their mistakes faster than anyone else, and to learn from them. Others had to learn by building, and crashing, full-size aircraft.

This is far from an original observation, but I think it has some connection with programming. The story of the Wright brothers illustrates the power of rapid iterative improvement – their approach would probably be called “agile” if it were being used today. The difference between the Wright brothers and their contemporary rivals mirrors one that I often see between the different approaches to writing code I see being used by my students.

On the one hand, you have people who favour small, incremental improvements when writing a program or a function, testing each bit of code as soon as possible and uncovering bugs and mistakes early. Students who program in this way end up with programs and functions that resemble the Wright Flyer pictured above: crude and primitive, perhaps, but certainly fit-for-purpose and relatively unlikely to result in broken bones.

On the other hand, you have people who try to write an entire program or function all in one go, never testing any bit of it until the whole thing is written. Students who program in this way end up with programs and function that resemble other products of early aviation:

images

 

As the picture above attests, this is a recipe for pain.

 

0

The role of instructors in programming training

I’ve been spending a bit of time recently arranging to run some instructor-led training courses early next year (see my training page if this is something you’re interested in), which has got me thinking about the role of the instructor in teaching/learning programming. This is a pretty important question – essentially, why is an instructor-led course better than self-directed learning? – so I decided to write down my thoughts on the matter.

Roadblocks to learning

I want to address the question in a kind of roundabout way, by first talking about a phenomenon I have tentatively called roadblocks. These are the moments that occur whenever you’re learning any new skill in which there’s something that you don’t quite understand, which prevents you from making further progress. When you’re learning a new skill and you hit a roadblock, it’s not necessarily a bad thing – indeed, the point when you overcome a roadblock by correcting your understanding is probably the instant we would point to if asked exactly when “learning” occurs. The nature of programming, though, means that these roadblocks tend to come thick and fast, especially for beginners, and are particularly hard to get around.

Programming is unintuitive

It’s very easy for those of us who have been programming for a long time to forget just how unintuitive programming is. When you’re first learning to write code, there’s no obvious reason why variables or loops or dictionaries behave the way they do – these things seem profoundly arbitrary. Ironically, it’s only once you dig deeper into programming, and understand something about how these things are implemented, that their behaviour seems to flow predictably from the way they work. Once you know about stack frames, it’s easy to see why scoping works the way it does – but that knowledge usually arrives too late to save you from having to struggle with scoping issues earlier in your education. This unintuitive behaviour means that roadblocks arise more frequently than in other fields of learning.

Programming is relentlessly progressive

The practise of programming revolves around building up small, simple pieces of functionality (like statements) into bigger, more complicated ones (like functions, objects and entire programmes). The process of learning to program follows the same pattern – you start by learning the most basic, atomic blocks (variable assignment, simple statements) then use this knowledge to bootstrap your ability to use more complicated things (loops, functions, etc.). The practical upshot of this is that you have to understand each of the simpler building blocks in order to make sense of the more complicated ones. In other words, when a roadblock comes along, you can’t continue to make progress by simply going around it and moving onto a different topic.

By way of analogy, imagine you’ve never cooked a meal in your life, and you decide to learn cooking from scratch using an online tutorial. You happily work your way through the various chapters with titles like Sauces and Pickles and Vegetable Soups until you come to the section on Bread Making. At this point, you run into a roadblock – none of your breads will rise, and you have no idea what you are doing wrong1 . No matter – you can just skip the bread section for now, and move onto the chapter on Rice and Pasta.

Contrast this with the situation where you’re learning programming from scratch. You make it through Variable Assignment and Printing Strings, and are making good progress until you encounter a mental roadblock in the chapter on Loops. Somehow you just can’t get your head around the way that the loop variables acquire different values in each loop iteration. Just like in the cooking example, you decide to skip the troublesome section and work on something else for a while, so you move onto the next chapter, Processing Files. Unfortunately, the very first example involves using a loop to parse each line of an input file, and you can’t understand the example because you don’t understand loops. You try again, and skip forward to the chapter on Dictionaries, but again, half of the examples in this chapter use loops to either construct or iterate over dictionaries. Your forward progress is stalled until you can go back and really get to grips with the concept of loops.

Because programming is progressive in a way that cooking is not (and this analogy is in no way meant to belittle cooking as a skill!), the roadblocks that beginners encounter are harder to get around.

It’s hard to ask the right questions

This is another aspect of programming that experience tends to render invisible: when you encounter a roadblock in programming and need to ask for help, very often it’s difficult to know how to phrase the question. The lack of understanding that causes the student to need help in the first place also ensures that they’re unlikely to know what question to ask, or the right way to phrase it. I’m certainly not suggesting that this problem is unique to programming – it’s found in pretty much every technical field – but the highly abstract nature of the things that we talk about (“method calls”, “return values”, “function pointers”) make it particularly tricky. This difficulty in communication explains why, even though the internet is bursting with forums, message boards and mailing lists populated by helpful people who are happy to assist novice programmers, it can take a long time to pin down the root cause of a student’s misunderstanding.

It’s hard to stay motivated

There are two main reasons why students find themselves on my courses – either they want to use programming to solve a problem, or they’ve been told to attend the course by some higher authority (an employer, a PhD supervisor). Overwhelmingly, it’s the ones who have a concrete problem to solve that tend to make better progress, not because of any intrinsic ability, but because the problem provides the motivation for them to persist with learning a difficult skill. Especially in the early stages, learning to program can seem a relatively thankless task, where the only payoff from successfully understanding a tricky new concept is the prospect of moving onto the next, even trickier one.

Of course, later on in the learning process it usually becomes clear how mindblowingly useful programming is (and I purposefully structure my courses to get students to this point sooner rather than later). Nevertheless, one of the biggest problems many students have when learning programming is simply running out of steam and becoming demoralized – a process that is usually triggered by encountering yet another roadblock.

Getting over roadblocks

The point that I am trying to make under the headings above is not simply that programming is hard, but rather that it’s hard in a particular way that is amenable to being solved by the presence of an instructor. The chief role of the instructor, as I see it, is to get students over these roadblocks as quickly and painlessly as possible. To illustrate the value of this, let’s consider a typical-case scenario facing the self-taught programmer…..

Imagine that you have set aside a week from your busy schedule to get started with learning to program, something that you’ve been meaning to do for ages. You sit down at your desk on Monday morning with your chosen learn-to-program book in hand, and start working your way through the exercises. Give that you’re fairly computer-savvy – you know how to use a text editor and a command line – you should have no problem getting a good grounding in the basics by the end of the week.

Just before lunchtime, you run into a roadblock – some concept or example that you can’t seem to figure out. You’ve obviously misunderstood something, because even when you look at the example solution to the exercise, it doesn’t make sense. You decide to take a break for lunch. When you get back from lunch you still don’t understand the example, and resolve to read the chapter again from the start to see if you’ve missed something important. When this doesn’t help, you decide it’s time to ask for help. You post a question on the mailing list for your language of choice, explaining your problem. Then you alt-tab over to Facebook and kill some time while you wait for a reply……

Sometime near the end of the day you get a response. Excellent! it’s from an experienced programmer who’s prepared to help you work through the problem. They’ve emailed you back with a couple of clarifying questions that will help to figure out exactly which bit of the code you don’t understand. Unfortunately, they live on the other side of the world, so you will only be able to communicate during the brief period when you’re both awake. By the time you go home, you’re feeling a bit demotivated; you had planned on getting through at least two chapters per day but you’re still stuck on the first one, and you have a feeling that there’ll be many more roadblock moments to come.

Now, admittedly this is a bit of a gloomy picture – things will not always be this bad! There may be experienced colleagues you can talk to, or you may be able to get over your roadblock with the help of a second tutorial that explains the concept in a slightly different way, etc. But the overall pattern will, I think, be familiar to anyone (myself included) who describes themselves as a self-taught programmer.

Contrast this with what might happen in an instructor-led course. Just as before, you encounter a roadblock in one of the exercises and you can’t understand why your code isn’t working. After puzzling over it for a couple of minutes, you stick your hand up and the instructor comes over. Because the instructor has taught this material many times in the past, they’ve probably seen this exact problem before, and rather than just fixing the code for you, they can use this experience to quickly figure out the cause of your confusion by asking a couple of questions. They can then write a couple of lines of code illustrating the problem that you’re having and explaining how to solve it, while simultaneously clearing up the original source of your confusion, while you watch and ask questions in real-time.

In this way, five minutes after you encountered the roadblock you’ve already overcome it and can move on to the next section and keep making progress. Rather than feeling demotivated that you’ve wasted a bunch of time, instead you feel like you understand the material better for having struggled with it, and are increasingly confident that this programming business might actually turn out to be quite tractable.

I have exaggerated the two scenarios above to make the point, but the central idea remains: the main job of the instructor is to ensure that when a student encounters a roadblock, they overcome it rapidly and don’t simply give up2 .

Having read this far, if you think that instructor-led training could be useful to your organization, get in touch.


  1. Let’s imagine, for the purposes of this analogy, that you are using an expired batch of yeast. 

  2. Of course, there’s lot of other stuff that instructors do: they choose which content to teach and in what order, create learning material, tailor examples to the audience, etc. 

0

Powered by WordPress. Designed by Woo Themes