Why have a programming book for biologists?
If you’re reading this course, then you probably don’t need to be convinced that programming is becoming an increasingly essential part of the tool kit for biologists of all types. You might, however, need to be convinced that a course like this one, developed especially for biologists, can do a better job of teaching you to program than a general-purpose introductory programming course. Here are a few of the reason why I think that is the case.
A biology-specific programming book allows us to use examples and exercises that use biological problems. This serves two important purposes: firstly, it provides motivation and demonstrates the types of problems that programming can help to solve. Experience has shown that beginners make much better progress when they are motivated by the thought of how the programs they write will make their life easier! Secondly, by using biological examples, the code and exercises throughout the book can form a library of useful code snippets, which we can refer back to when we want to solve real-life problems. In biology, as in all fields of programming, the same problems tend to recur time and time again, so it’s very useful to have this collection of examples to act as a reference – something that’s not possible with a general-purpose programming book.
A biology-specific programming book can also concentrate on the features of the language that are most useful to biologists. A language like Python has many features and in the course of learning it we inevitably have to concentrate on some and miss others out. The set of features which are important to us in biology are slightly different to those which are most useful for general-purpose programming – for example, we are much more interested in manipulating text (including things like DNA and protein sequences) than the average programmer. Also, there are several features of Python that would not normally be discussed in an introductory programming book, but which are very useful to biologists (for example, regular expressions and subprocesses). Having a biology-specific textbook allows us to include these features, along with explanations of why they are particularly useful to us.
A related point is that a textbook written just for biologists allows us to introduce features in a way that allows us to start writing useful programs right away. We can do this by taking into account the sorts of problems that repeatedly crop up in biology, and prioritising the features that are best at solving them. This book has been designed so that you should be able to start writing small but useful programs using only the tools in the first couple of chapters.
Let me start this section with the following statement: programming languages are overrated. What I mean by that is that people who are new to programming tend to worry far too much about what language to learn. The choice of programming language does matter, of course, but it matters far less than people think it does. To put it another ways, choosing the “wrong” programming language is very unlikely to mean the difference between failure and success when learning. Other factors (motivation, having time to devote to learning, helpful colleagues) are far more important, yet receive less attention.
The reason that people place so much weight on the “what language should I learn?” question is that it’s a big, obvious question, and it’s not difficult to find people who will give you strong opinions on the subject. It’s also the first big question that beginners have to answer once they’ve decided to learn programming, so it assumes a great deal of importance in their minds.
Secondly, learning a first programming language gets you 90% of the way towards learning a second, third, and fourth one. Learning to think like a programmer in the way that you break down complex tasks into simple ones is a skill that cuts across all languages – so if you spend a few months learning Python and then discover that you really need to write in C, your time won’t have been wasted as you’ll be able to pick it up much quicker.
Thirdly, the kinds of problems that we want to solve in biology are generally amenable to being solved in any language, even though different programming languages are good at different things. In other words, as a beginner, your choice of language is vanishingly unlikely to prevent you from solving the problems that you need to solve.
Having said all that, when learning to program we do need to pick a language to work in, so we might as well pick one that’s going to make the job easier. Python is such a language for a number of reasons:
- It has a mostly-consistent syntax, so you can generally learn one way of doing things and then apply it in multiple places
- It has a sensible set of built-in libraries for doing lots of common tasks
- It is designed in such a way that there’s an obvious way of doing most things
- It’s one of the most widely-used languages in the world, and there’s a lot of advice, documentation and tutorials available on the web
- It’s designed in a way that lets you start to write useful programs as soon as possible
- Its use of indentation, while annoying to people who aren’t used to it, is great for beginners as it enforces a certain amount of readability
Python also has a couple of points to recommend it to biologists and scientists specifically:
- It’s widely used in the scientific community
- It has a couple of very well-designed libraries for doing complex scientific computing (although we won’t encounter them in this book)
- It lend itself well to being integrated with other, existing tools
- It has features which make it easy to manipulate strings of characters (for example, strings of DNA bases and protein amino acid residues, which we as biologists are particularly fond of)
Python vs. Perl
For biologists, the question “what language should I learn” often really comes down to the question “should I learn Perl or Python?”, so let’s answer it head on. Perl and Python are both perfectly good languages for solving a wide variety of biological problems. However, after extensive experience teaching both Perl and Python to biologists, I’ve come the conclusion that Python is an easier language to learn by virtue of being more consistent and more readable.
An important thing to understand about Perl and Python is that they are incredibly similar (despite the fact that they look very different), so the point above about learning a second language applies doubly. Many Python and Perl features have a one-to-one correspondence, and so learning Perl after learning Python will be relatively easy – much easier than, for example, moving to Java or C.
How to use this course
Programming books generally fall into two categories; reference-type books, which are designed for looking up specific bits of information, and tutorial-type books, which are designed to be read cover-to-cover. This book is an example of the latter – code samples in later chapters often use material from previous ones, so you need to make sure you read the chapters in order. Exercises or examples from one chapter are sometimes used to illustrate the need for features that are introduced in the next.
There are a number of fundamental programming concepts that are relevant to material in multiple different chapters. In this book, rather than introduce these concepts all in one go, I’ve tried to explain them as they become necessary. This results in a tendency for earlier chapters to be longer than later ones, as they involve the introduction of more new concepts.
A certain amount of jargon is necessary if we want to talk about programs and programming concepts. I’ve tried to define each new technical term at the point where it’s introduced, and then use it thereafter with occasional reminders of the meaning.
Chapters tend to follow a predictable structure. They generally start with a few paragraphs outlining the motivation behind the features that it will cover – why do they exist, what problems do they allow us to solve, and why are they useful in biology specifically? These are followed by the main body of the chapter in which we discuss the relevant features and how to use them. The length of the chapters varies quite a lot – sometimes we want to cover a topic briefly, other times we need more depth. This section ends with a brief recap outlining what we have learned, followed by exercises and solutions (more on that topic below).
I’ve deliberately limited the scope of this course to introductory material, in order to keep the size manageable. As a result, there are lots of useful techniques and tools that I’ve had to leave out. The good stuff that I couldn’t fit into this book forms the basis of my second book, Advanced Python for Biologists.
There are several tools and techniques that are discussed only briefly in this course, but in much more depth in Advanced Python for Biologists. When we’re talking about these, I have mentioned the relevant chapters in the text. Hopefully this should allow you to easily find the corresponding bit in the advanced book when you want to read about a particular topic in more depth.
A couple of notes on typography: bold type is used to emphasize important points and italics for technical terms and file names. Where code is mixed in with normal text it’s written in a
mono-spaced font like this. Occasionally there are footnotes1 to provide additional information that is interesting to know but not crucial to understanding, or to give links to web pages.
Example code is highlighted thus:
Some example code goes here
and example output (i.e. what we see on the screen when we run the code) is highlighted like this:
Some output goes here
Often we want to look at the code and the output it produces together. In these situations, you’ll see a block of Python code immediately followed by a block of output. Both blocks of Python code and blocks of output have line numbers so that we can refer in the text to a particular line. Other blocks of text (usually file contents or typed command lines) don’t have any kind of border and look like this:
contents of a file
Exercises and solutions
The final part of each chapter is a set of exercises and solutions. The number and complexity of exercises differ greatly between chapters depending on the nature of the material. As a rule, early chapters have a large number of simple exercises, while later chapters have a small number of more complex ones. Many of the exercise problems are written in a deliberately vague manner and the exact details of how the solutions work is up to you (very much like real-life programming!) You can always look at the solutions to see one possible way of tackling the problem, but there are often multiple valid approaches.
I strongly recommend that you try tackling the exercises yourself before reading the solutions; there really is no substitute for practical experience when learning to program. I also encourage you to adopt an attitude of curious experimentation when working on the exercises – if you find yourself wondering if a particular variation on a problem is solvable, or if you recognize a closely-related problem from your own work, try solving it! Continuous experimentation is a key part of developing as a programmer, and the quickest way to find out what a particular function or feature will do is to try it.
The example solutions to exercises are written in a different way to most programming textbooks: rather than simply present the finished solution, I have outlined the thought processes involved in solving the exercises and shown how the solution is built up step-by-step. Hopefully this approach will give you an insight into the problem-solving mindset that programming requires. It’s probably a good idea to read through the solutions even if you successfully solve the exercise problems yourself, as they sometimes suggest an approach that is not immediately obvious.
Getting in touch
One of the most convincing arguments for presenting a course like this one in the form of an web page is that it can be continually updated and tweaked based on reader feedback. So, if you find anything that is hard to understand, or you think may contain an error, please get in touch – just drop me an email at firstname.lastname@example.org and I promise to get back to you.
Setting up your environment
All that you need in order to follow the examples and exercises in this book is a standard Python installation and a text editor. All the code in this book will run on either Linux, Mac or Windows machines. The slight differences between operating systems are explained in the text (mostly in chapter 9). If you have a choice of operating systems on which to learn Python, I recommend Linux, Mac OSX and Windows in that order, simply because the UNIX-based operating systems (Linux and OSX) are more amenable to programming in general.
The process of installing Python depends on the type of computer you’re running on. If you’re running a mainstream Linux distribution like Ubuntu, Python is probably already installed. To find out, open a terminal and type
If you see some output along these lines:
Python 2.7.3 (default, Apr 10 2013, 05:13:16) [GCC 4.7.2] on linux2 Type "help", "copyright", "credits" or "license" for more information. &gt;&gt;&gt;
Then you are ready to go. If your Linux installation doesn’t already have Python installed, try installing it with your package manager (the command will probably be either
sudo apt-get install python or
sudo yum install python). If this doesn’t work, then download the package from the Python download page.
Running Python programs
A Python program is just a normal text file that contains Python code. To run it we must first open up a command line. On Linux and Mac computers, the application to do this will be called something along the lines of “terminal”. On Windows, it is known as “command prompt”.
To run a Python program, we just type the path to the Python executable followed by the name of the file that contains the code we want to run2. On a Linux or Mac machine, the path will be something like:
On Windows, it will be something like:
To run a Python program, it’s generally easiest to be in the same folder as it. By convention, Python programs are given the extension
.py, so to run a program called test.py, we just type:
There are a couple of tricks that can be useful when experimenting with programs((Don’t worry if these two options make no sense to you right now – they will do so later on in the book, once you’ve learned what statements and variables actually are.)). Firstly, you can run Python in an interactive (or “shell”) mode by running it without the name of a program file. This allows you to type individual statements and see the result straight away.
Secondly, you can run Python with the
-i option, which will cause it to run your program and then enter interactive mode. This can be handy if you want to examine the state of variables after your code has run.
Python 2 vs. Python 3
As will quickly become clear if you spend any amount of time on the official Python website, there are two versions of Python currently available. The Python world is, at the time of writing, in the middle of a transition from version 2 to version 3. A discussion of the pros and cons of each version is well beyond the scope of this book3, but here’s what you need to know: install Python 3 if possible, but if you end up with Python 2, don’t worry – all the code examples in the book will work with both versions.
If you’re going to use Python 2, there is just one thing that you have to do in order to make some of the code examples work: include this line at the start of all your programs:
from __future__ import division
We won’t go into the explanation behind this line, except to say that it’s necessary in order to correct a small quirk with the way that Python 2 handles division of numbers.
Depending on what version you use, you might see slight differences between the output in this book and the output you get when you run the code on your computer. I’ve tried to note these differences in the text where possible.
Since a Python program is just a text file, you can create and edit it with any text editor of your choice. Note that by a text editor I don’t mean a word processor – do not try to edit Python programs with Microsoft Word, LibreOffice Writer, or similar tools, as they tend to insert special formatting marks that Python cannot read. When choosing a text editor, there is one feature that is essential((OK, so it’s not strictly essential, but you will find life much easer if you have it.)) to have, and one which is nice to have. The essential feature is something that’s usually called tab emulation. The effect of this feature at first seems quite odd; when enabled, it replaces any tab characters that you type with an equivalent number of space characters (usually set to four). The reason why this is useful is discussed at length in chapter 4, but here’s a brief explanation: Python is very fussy about your use of tabs and spaces, and unless you are very disciplined when typing, it’s easy to end up with a mixture of tabs and spaces in your programs. This causes very infuriating problems, because they look the same to you, but not to Python! Tab emulation fixes the problem by making it effectively impossible for you to type a tab character. The feature that is nice to have is syntax highlighting. This will apply different colours to different parts of your Python code, and can help you spot errors more easily.
On the web and elsewhere you may see references to Python IDEs. IDE stands for Integrated Development Environment, and they typically combine a text editor with a collection of other useful programming tools. While they can speed up development for experienced programmers, they’re not a good idea for beginners as they complicate things, so I don’t recommend you use them.
Reading the documentation
Part of the teaching philosophy that I’ve used in writing this book is that it’s better to introduce a few useful features and functions rather than overwhelm you with a comprehensive list. The best place to go when you do want a complete list of the options available in Python is the official documentation which, compared to many languages, is very readable.
Ready to get started? Move on to the next section – printing and manipulating text.
like this one ↩
When we refer to “a Python program” in this book, we are usually talking about the text file that holds the code. ↩
You might encounter writing online that makes the 2 to 3 changeover seem like a big deal, and it is – but only for existing, large projects. When writing code from scratch, as you’ll be doing when learning, you’re unlikely to run into any problems. ↩