Scientific Python antipatterns advent calendar day fifteen

For today, an antipattern that often only becomes obvious once your data gets big: reading an entire file into memory. As a reminder, I’ll post one tiny example per day with the intention that they should only take a couple of minutes to read.

If you want to read them all but can’t be bothered checking this website each day, sign up for the mailing list:

Sign up for the mailing list

and I’ll send a single email at the end with links to them all.

Reading whole files into memory

Imagine we have a comma separated values (CSV) file containing some fruit data:

!head fruits.csv
name,colour,size
Strawberry,red,small
Blueberry,blue,small
Raspberry,red,small
Blackberry,black,small
Cranberry,red,small
Gooseberry,green,small
Redcurrant,red,small
Blackcurrant,black,small
White currant,white,small

Let’s try to do some simple data processing. What are the names of all the pink fruits?

lines = open('fruits.csv').readlines()

for line in lines:
    name, colour, size = line.split(',')
    if colour == 'pink':
        print(name)
Grapefruit (Pink)
Apple (Pink Lady)
Dragon fruit (Pink skin)
Pitaya (Red flesh)
Pitaya (White flesh)
Rose apple
Peach (Donut)

This works pretty straightforwardly, but reads the whole file into memory as a single list:

len(lines)
345

For small test datasets, this will not be a problem. For real data, it may be. In this case, the code is trivially fixed by just getting rid of the readlines - we don’t even need to change the loop:

lines = open('fruits.csv')

for line in lines:
    name, colour, size = line.split(',')
    if colour == 'pink':
        print(name)
Grapefruit (Pink)
Apple (Pink Lady)
Dragon fruit (Pink skin)
Pitaya (Red flesh)
Pitaya (White flesh)
Rose apple
Peach (Donut)

as Python will happily iterate over lines in a file.

It becomes a bit harder to avoid readlines when we have some code that takes advantage of Python’s list features. For example, if we want to just process the first few lines, and skip the header line, it’s easily done with a list:

lines = open('fruits.csv').readlines()

for line in lines[1:10]:
    name, colour, size = line.rstrip('\n').split(',')
    print(f'a {size} {colour} {name.lower()}')
a small red strawberry
a small blue blueberry
a small red raspberry
a small black blackberry
a small red cranberry
a small green gooseberry
a small red redcurrant
a small black blackcurrant
a small white white currant

If we try to drop in the open file as a replacement:

lines = open('fruits.csv')

for line in lines[1:10]:
    name, colour, size = line.rstrip('\n').split(',')
    print(f'a {size} {colour} {name.lower()}')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[37], line 3
      1 lines = open('fruits.csv')
----> 3 for line in lines[1:10]:
      4     name, colour, size = line.rstrip('\n').split(',')
      5     print(f'a {size} {colour} {name.lower()}')

TypeError: '_io.TextIOWrapper' object is not subscriptable

it doesn’t work - we are not allowed to use slice syntax on files in Python.

A similar example would be getting every fiftieth line - easy with a list:

lines = open('fruits.csv').readlines()

for line in lines[::50]:
    name, colour, size = line.rstrip('\n').split(',')
    print(f'a {size} {colour} {name.lower()}')
a size colour name
a small orange mandarin
a medium yellow ataulfo mango
a medium yellow cowa (garcinia cowa)
a small green green fig
a large green pawpaw
a small red alpine strawberry

but doesn’t work with an open file:

lines = open('fruits.csv')

for line in lines[::50]:
    name, colour, size = line.rstrip('\n').split(',')
    print(f'a {size} {colour} {name.lower()}')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[39], line 3
      1 lines = open('fruits.csv')
----> 3 for line in lines[::50]:
      4     name, colour, size = line.rstrip('\n').split(',')
      5     print(f'a {size} {colour} {name.lower()}')

TypeError: '_io.TextIOWrapper' object is not subscriptable

Again, the slice syntax does not work.

The easiest fix for both of these is probably to use enumerate, which you might remember from day three. This gives us a paired list of numbers and lines. With a bit of rearrangement we can easily reproduce our code to get the first ten lines, minus the header:

for line_number, line in enumerate(open('fruits.csv')):
    if 0 < line_number < 10:
        name, colour, size = line.rstrip('\n').split(',')
        print(f'a {size} {colour} {name.lower()}')
a small red strawberry
a small blue blueberry
a small red raspberry
a small black blackberry
a small red cranberry
a small green gooseberry
a small red redcurrant
a small black blackcurrant
a small white white currant

We can even add in a little optimisation to skip the rest of the lines as soon as we reach the 11th line in the file - break tells Python to stop the loop:

for line_number, line in enumerate(open('fruits.csv')):
    if 0 < line_number < 10:
        name, colour, size = line.rstrip('\n').split(',')
        print(f'a {size} {colour} {name.lower()}')
    if line_number == 10:
        break
a small red strawberry
a small blue blueberry
a small red raspberry
a small black blackberry
a small red cranberry
a small green gooseberry
a small red redcurrant
a small black blackcurrant
a small white white currant

Another nice trick here is to read the first line and discard it with the next function, which makes the rest of the loop simpler:

lines = enumerate(open('fruits.csv'))
next(lines)

for line_number, line in lines:
    if line_number < 10:
        name, colour, size = line.rstrip('\n').split(',')
        print(f'a {size} {colour} {name.lower()}')
    else:
        break
a small red strawberry
a small blue blueberry
a small red raspberry
a small black blackberry
a small red cranberry
a small green gooseberry
a small red redcurrant
a small black blackcurrant
a small white white currant

Of course, once we have our enumeration set up we can do other things involving line numbers quite easily. Here’s how to reproduce our example that prints every fiftieth line:

for line_number, line in enumerate(open('fruits.csv')):
    if line_number % 50 == 0:
        name, colour, size = line.rstrip('\n').split(',')
        print(f'a {size} {colour} {name.lower()}')
a size colour name
a small orange mandarin
a medium yellow ataulfo mango
a medium yellow cowa (garcinia cowa)
a small green green fig
a large green pawpaw
a small red alpine strawberry

One more time; if you want to see the rest of these little write-ups, sign up for the mailing list:

Sign up for the mailing list