Scientific Python antipatterns advent calendar day twenty one

For today, another of those patterns that starts off as a good idea, then becomes a hazard with real data. As a reminder, I’ll post one tiny example per day with the intention that they should only take a couple of minutes to read.

If you want to read them all but can’t be bothered checking this website each day, sign up for the mailing list:

and I’ll send a single email at the end with links to them all.

`print` vs `logging`

When we’re exploring data in a notebook, print is great for undertsanding what our code is doing - we have used printed output in many of the code examples in this series. It’s easy enough to illustrate how we might use use it: imagine that we have a dictionary storing sample counts, and we want to flag up any that are zero:

counts = {
    'apple' : 32,
    'banana' : 0,
    'strawberry': 12,
    'mango' : 0,
    # etc. etc.
}

for fruit, count in counts.items():
    if count == 0:
        print(f'Zero count for {fruit}')
    # continue processing data

Zero count for banana
Zero count for mango

Note that this is not quite the same as our error checking example from day 10. In this scenario we are imagining that zero counts are not necessarily invalid - just that we would like to record them somewhere so that we can check them.

We can see from the output above that the logic works perfectly, and identifies the zero counts in a way that’s very easy to read. But there are many situations where this approach will be less useful:

when we run this code on a real dataset, that may have millions of data points, even a small proportion of zeros will generate far too much output for a human to deal with
when we add more printed output to the code, the zero-warning messages will get mixed in and be hard to identify
if we run this code and then later in our analysis want to check if some samples had zero counts, it is too late - the printed output scrolls off the top of the screen
if this code ever gets executed as part of a pipeline, or script, or on a computer cluster, where there isn’t a human looking at the screen, no-one will see the zero warning messages.

All of these problems can be quite easily fixed by switching to logging. A very minimal setup:

with open('zero_counts.txt', 'w') as zeros:
    for fruit, count in counts.items():
        if count == 0:
            zeros.write(fruit + '\n')
        # continue processing data

is already much better. Now we will always have a record of which samples had zero counts than we can look at or analyse any time we want. For a real program, of course, we would probably want to generate a filename dynamically to avoid overwriting.

In situations like this it can still be useful to have some printed output as a summary:

zero_counts = 0
with open('zero_counts.txt', 'w') as zeros:
    for fruit, count in counts.items():
        if count == 0:
            zeros.write(fruit + '\n')
            zero_counts += 1
        # continue processing data

print(f'Processed {len(counts)} items, found {zero_counts} zeros')

Processed 4 items, found 2 zeros

There are a few problems that still exist with this code. It will sometimes be useful to

have more context about the warnings, like timestamps
have different levels of logging, to distinguish between info, warning, errors, etc.
be able to turn on logging at different levels

If any of those are necessary, then it’s probably a good idea to switch to using the logging module from the standard library. That would be too big an example for this post though, so we will save that for another time!

One more time; if you want to see the rest of these little write-ups, sign up for the mailing list:

print vs logging

`print` vs `logging`