Scientific Python antipatterns advent calendar day one
So there is a project that I’ve been meaning to do for ages, which is to compile a list of types of problematic code that I often see in scientific software.
Because of my consulting/training work, I get to see a lot of examples of code written by folks who are primarily researchers, not programmers, and since the same patterns show up again and again I thought it would be helpful to compile a bunch in one place.
I’ll post one tiny example per day with the intention that they should only take a couple of minutes to read.
If you want to read them all but can’t be bothered checking this website each day, sign up for the mailing list:
and I’ll send a single email at the end with links to them all.
Enough housekeeping; the first antipattern is….
Not using sets enough
A very common problem in data-driven programming is finding unique items in some collection. Let’s give ourselves some short alphanumeric strings:
import random, string# generate 1000 random strings of 3 charactersstrings = [''.join(random.choices(string.ascii_lowercase, k=2))for _ inrange(1_000)]# show the first 10strings[:10]
There are probably some duplicates in there. Here’s the antipattern:
unique_list = []for s in strings:if s notin unique_list: unique_list.append(s)
The code works, and we certainly get the list of unique strings we wanted:
len(unique_list)
534
but there are two reasons why this is an antipattern. Firstly, performance - let’s measure how long it takes to count the unique strings in a list of 1000 random strings. This time we will make the random strings 5 characters long, to reflect the common real-world case where duplicates are relatively rare.
%%timeitunique_list = []for s in strings:if s notin unique_list: unique_list.append(s)
303 ms ± 6.68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The second input list is ten times bigger than the first, so we might expect it to take ten times longer. However, the actual timing shows that in fact it takes around 100 times as long.
This is easily explained: for the second input list the loop for s in strings executes ten times more often, and the check s not in unique_list takes ten times as long, which results in a 100 times greater execution time.
This behaviour, where the execution time goes up with the square of the input size, is something we’d like to avoid, as it often results in code that works fine on small test datasets but is intolerably slow on real datasets.
The fix is to use Python’s built in set type, which is desiged explicity for this purpose:
9.45 μs ± 188 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
This is much faster in absolute terms than our original code, and also has much better scaling behavour as we can see if we repeat the test with the large input list:
257 μs ± 4.38 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Rather than the 100-fold increase in execution time that we saw with our original code, this is much closer to a linear increase with the size of the input set. As a bonus, this also solves the second problem of readability - the single call to the set function replaces the loop code, and makes it immediately obviousl what the job of that line is to the reader.
As a bonus, set objects have all sorts of useful methods for common problems like:
finding things that in two sets (intersection)
finding things that in one set but not another (difference)
combining two sets to make a larger set (union)
Note: a particular habit of people who learned to program in Perl is to use a dictionary to solve this problem:
unique_dict = {}for s in strings:if s notin unique_dict: unique_dict[s] =1len(unique_dict)
9995
This works, and has better performance than the list-based solution, but is totally unnecessary in Python - just use a set!
One more time; if you want to see the rest of these little write-ups, sign up for the mailing list: