Archive | Python

Measuring memory usage in Python

Monitoring memory usage

When dealing with very large bioinformatics datasets, we might get worried about running out of memory. We can find out how much memory our program has used by importing the resource module and calling resource.getrusage(resource.RUSAGE_SELF).ru_maxrss

This will give us an answer in kilobyes, so normally we divide by 1000 to get MB.

We can use this to investigate the behaviour of different bits of code in python. For example, how much more memory is used
by a dict than a list?

import resource

list = []

# create a list with ten million elements

for i in range(0,10000000):
    list.append('abcdefg')
    if len(list) % 1000000 == 0:
        print(len(list), resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1000)

1000000 15.404
2000000 23.248
3000000 31.124
4000000 38.78
5000000 46.7
6000000 54.568
7000000 62.224
8000000 70.144
9000000 77.8
10000000 85.72

Now how about a dictionary:

import resource

# the same but with a dict
dict = {}
for i in range(0,10000000):
    dict[i] = 'abcdefg'
    if len(dict) % 1000000 == 0:
        print(len(dict), resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1000)

1000000 103.576
2000000 199.88
3000000 392.512
4000000 392.512
5000000 392.512
6000000 777.728
7000000 777.728
8000000 777.728
9000000 777.728
10000000 777.728

Interestingly, the dict uses 10 times more space to store roughly the same amount of information! When using a dict,
we make an explicit trade off between memory and lookup speed. Remember, the 'goal' of a dict is to allow very fast
lookup of values for keys even when the it gets very large.

Another interesting difference is the way in which the memory usage grows. The size of a list grows linearly, whereas the
size of a dict doubles, then stays the same for a while. This is an artefact of how dicts are stored internally - resizing
the hash is a computationally expensive process, so Python tries to do it as rarely as possible

Python is generally quite good at reclaiming unneeded space. For example, here is a script that creates two lists, each of
ten million elements:

import resource

list1 = []
list2 = []
for j in range(1,10):
    for i in range(0,1000000):
        list1.append('abcdefg')
    print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1000)

for j in range(1,10):
    for i in range(0,1000000):
        list2.append('abcdefg')
    print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1000)

9.488
13.508
17.344
21.304
25.132
29.092
33.052
37.012
40.972
44.872
48.584
52.544
56.364
60.324
64.284
68.244
72.116
76.076

As expected, it uses roughly twice the memory of the earlier one. Look what happens when we empty the first list before creating the second one:

import resource

list1 = []
list2 = []
for j in range(1,10):
    for i in range(0,1000000):
        list1.append('abcdefg')
    print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1000)
# set the first list to be an empty list
list1=[]
for j in range(1,10):
    for i in range(0,1000000):
        list2.append('abcdefg')
    print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1000)
output
9.488
13.508
17.344
21.304
25.132
29.092
33.052
37.012
40.972
40.972
40.972
40.972
40.972
40.972
40.972
40.972
40.972
40.972

Python is able to 'reclaim' the memory used by the first list, and re-use it for the second list, so the memory usage
is constant in the second half of the script. (This process is known as garbage collection and is a very interesting computer science problem in its own right).

Important : the number that is reported by getrusage is the maximum amount of memory used - i.e. it does not tell you how much memory is being used right now, but the maximum over the lifetime of the program so far. Most of the time that is actually the number we are interested in, as it tells us how much ram we need to run a program with a particular dataset.

0

Powered by WordPress. Designed by Woo Themes