Machine learning for biology part four

Category > article

Wed 20 October 2021

penguins seaborn pandas sklearn machine learning

Introduction¶

In the first three parts of this series, we saw how we could take a dataset of measurements of different species of penguins:

In [36]:

import pandas as pd
import seaborn as sns

df = (
    pd.read_csv(
    "https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv",
    ).dropna() # missing data will confuse things
)[['species', 'bill_length_mm', 'flipper_length_mm']]


sns.relplot(
    data = df,
    x = 'bill_length_mm',
    y = 'flipper_length_mm',
    hue = 'species',
    height=8,
    hue_order = ['Adelie', 'Gentoo', 'Chinstrap']
)

Out[36]:

<seaborn.axisgrid.FacetGrid at 0x7f9b49a88f70>

and write a function that would attempt to predict the species of a new penguin for which we only know the measurements:

In [37]:

def guess_species(bill_length, flipper_length, k):
    
    # calculate distances and add to the dataframe
    flipper_length_difference = training_data['flipper_length_mm'] - flipper_length
    bill_length_difference = training_data['bill_length_mm'] - bill_length
    overall_distance = np.sqrt(                             
        flipper_length_difference ** 2   
        + bill_length_difference ** 2    
    )
    training_data['distance to new point'] = overall_distance
    
    # find closest points and calculate most common species
    most_common_species = (
        training_data.sort_values('distance to new point') 
        .head(k)                               
        ['species']                             
        .mode()[0]                              
    )
    
    # the most common species is our guess
    return most_common_species

Our current version of the function, above, takes an extra argument in addition to the measurements: the number of penguins from the training dataset to compare it to. In order for it to work, we need to split the dataset into training and testing datasets - see part three for an explanation of this code, and see the Biological Data Exploration book for an explanation of all the syntax used:

In [38]:

import numpy as np

df['assignment'] = np.random.choice(['training', 'testing'], 333, p=[0.8, 0.2])
training_data = df[df['assignment'] == 'training']
test_data = df[df['assignment'] == 'testing']

The problem of scoring¶

The problem that we were considering at the end of part three was how to come up with a fair way of evaluating the performance of the classification function. This is necessary in order to pick the best value of k (i.e. how many nearest neighbours to consider). But having a reliable scoring system is also necessary for many other parts of the machine learning workflow:

deciding how much to trust the predictions of our classifier function
choosing between different machine learning algorithms
deciding which measurements to include in our list of features.

To recap, the problem with our exiting method of randomly assigning penguins to either training or testing is that the exact same parameters give different scores depending on the different random assignments. In part three, we illustrated this by running the same prediction code in a loop:

In [39]:

for _ in range(10):
    # randomly assign the rows to either training or test
    df['assignment'] = np.random.choice(['training', 'testing'], 333, p=[0.8, 0.2])
    training_data = df[df['assignment'] == 'training']
    test_data = df[df['assignment'] == 'testing']
    
    # calculate score
    predictions = test_data.apply(
        lambda p : 
            guess_species(p['bill_length_mm'], p['flipper_length_mm'], 10)
        , axis=1
    )

    score = (predictions == test_data['species']).value_counts(normalize=True)[True]
    print(score)

0.90625
0.9523809523809523
0.925
0.8412698412698413
0.9076923076923077
0.9310344827586207
0.9423076923076923
0.9180327868852459
0.9538461538461539
0.95

and noting that the scores varied quite a bit.

Averaging scores¶

One obvious way to get a more consistent score is to simply run this process lots of times and take an average. As long as there is some systematic difference between different values of k, the good and bad scores should even out and leave us with a fair estimation. We can get pandas to do the work of calculating the mean score over multiple random test/training data splits:

In [40]:

scores = []

# get the mean score for 100 random train/test splits
for _ in range(100):
    # randomly assign the rows to either training or test
    df['assignment'] = np.random.choice(['training', 'testing'], 333, p=[0.8, 0.2])
    training_data = df[df['assignment'] == 'training']
    test_data = df[df['assignment'] == 'testing']
    
    # calculate score
    predictions = test_data.apply(
        lambda p : 
            guess_species(p['bill_length_mm'], p['flipper_length_mm'], 10)
        , axis=1
    )

    score = (predictions == test_data['species']).value_counts(normalize=True)[True]
    scores.append(score)
    
pd.Series(scores).mean()

Out[40]:

0.9400861867364911

and for the 100 iterations in the code above, the score is very consistent beteween splits. The problem with this approach is the amount of time it will take. We're using a relatively simple algorithm here, so each iteration doesn't take too long to complete. But many machine learning algorithms are very computationally expensive, and remember that in this approach we are doing the complete training of the model 100 times to evaluate a single set of parameters. With more complex algorithms we might be doing a parameter search over lots of different parameters, and real world datasets will likely be more complex as well. So this simple approach isn't practical for performance reaons.

Cross validation¶

The solution that comes to rescue us from this computational complexity is called cross-validation. In this approach, rather than splitting our dataset up into training and testing data randomly, we will do it systematically. We will stick with our 80%/20% training/testing split and walk through how it works.

First, we divide our dataset up into 5 chunks. It's important that the data are randomly distributed between the chunks (more on that later) so we'll shuffle the rows in the dataset first:

In [41]:

# this is the easiest way to randomize row order
df = df.sample(frac=1)
df.head()

Out[41]:

	species	bill_length_mm	flipper_length_mm	assignment
169	Gentoo	49.2	221.0	training
129	Adelie	44.1	210.0	training
136	Adelie	35.6	191.0	testing
202	Gentoo	46.6	210.0	training
138	Adelie	37.0	185.0	training

To figure out the size of the chunks we will simply divide the number of rows by 5:

In [42]:

len(df) / 5

Out[42]:

66.6

and truncate to get an integer:

In [43]:

int(len(df) / 5)

Out[43]:

So at this point we know that each chunk of our dataset will contain 66 penguins. To split up the dataset into chunks we could do various clever things with loops, but to keep things super-simple we'll just do it manually:

In [44]:

chunk1 = df.iloc[0:66]
chunk2 = df.iloc[66:132]
chunk3 = df.iloc[132:198]
chunk4 = df.iloc[198:264]
chunk5 = df.iloc[264:330]

Next, we run the complete training/testing process five times, each time using one of the five chunks as the testing data and the remaining four as the training data. To make this code a bit simpler, we'll first modify the classifier function to take the training dataset as an argument:

In [45]:

def guess_species(bill_length, flipper_length, k, training_data):
    
    # calculate distances and add to the dataframe
    flipper_length_difference = training_data['flipper_length_mm'] - flipper_length
    bill_length_difference = training_data['bill_length_mm'] - bill_length
    overall_distance = np.sqrt(                             
        flipper_length_difference ** 2   
        + bill_length_difference ** 2    
    )
    training_data['distance to new point'] = overall_distance

    # find closest points and calculate most common species
    most_common_species = (
        training_data.sort_values('distance to new point') 
        .head(k)                               
        ['species']                             
        .mode()[0]                              
    )
    
    # the most common species is our guess
    return most_common_species

So this is how we would calculate the score using the first chunk as the test data (Chapter 16 of Biological Data Exploration will give you the details on pd.concat):

In [46]:

# the training data is made up of chunks 2,3,4,5
training_data = pd.concat([chunk2,chunk3,chunk4,chunk5])

# only use chunk 1 for testing
predictions = chunk1.apply(
        lambda p : 
            guess_species(p['bill_length_mm'], p['flipper_length_mm'], 10, training_data)
        , axis=1
    )

score = (predictions == chunk1['species']).value_counts(normalize=True)[True]
score

Out[46]:

0.9545454545454546

To stop the code for the complete cross validation from becoming too cumbersome, we'll write a quick helper function that will calculate the score for a single testing chunk and a list of training chunks. Notice that this function needs to take k as an argument so that it can pass it on to guess_species:

In [47]:

def score_single_chunk(testing_chunk, training_chunks, k):
    # the training data is made up of chunks 2,3,4,5
    training_data = pd.concat(training_chunks)

    # only use chunk 1 for testing
    predictions = testing_chunk.apply(
            lambda p : 
                guess_species(p['bill_length_mm'], p['flipper_length_mm'], k, training_data)
            , axis=1
        )

    score = (predictions == testing_chunk['species']).value_counts(normalize=True)[True]
    return score

Now the cross-validation process is simply a case of feeding the chunks into this function in the different permutations. There are many elegant ways to do this, but for our purposes this is ugly but readable:

In [48]:

all_scores = []
all_scores.append(score_single_chunk(chunk1, [chunk2, chunk3, chunk4, chunk5], 10))
all_scores.append(score_single_chunk(chunk2, [chunk1, chunk3, chunk4, chunk5], 10))
all_scores.append(score_single_chunk(chunk3, [chunk1, chunk2, chunk4, chunk5], 10))
all_scores.append(score_single_chunk(chunk4, [chunk1, chunk2, chunk3, chunk5], 10))
all_scores.append(score_single_chunk(chunk5, [chunk1, chunk2, chunk3, chunk4], 10))

all_scores

Out[48]:

[0.9545454545454546,
 0.9696969696969697,
 0.8787878787878788,
 0.9393939393939394,
 0.9393939393939394]

Once we have our collection of five scores, we can summarize them however we like:

In [49]:

np.mean(all_scores)

Out[49]:

0.9363636363636363

Although this single number isn't a very exciting outcome, it's an important step in being able to fairly measure the performance of a classifier. In cross-validation we have an approach that is both deterministic - i.e. doesn't involve any randomness - and reasonably efficient.

Actually, that first bit is not quite true. Remember that at the start of the process, we shuffled the dataframe to get the rows in a random order. If we run the process again, we'll get a different random order, and hence a different division of the rows into chunks, and hence slightly different scores. If we do the experiment:

In [52]:

for _ in range(10):
    df = df.sample(frac=1)
    chunk1 = df.iloc[0:66]
    chunk2 = df.iloc[66:132]
    chunk3 = df.iloc[132:198]
    chunk4 = df.iloc[198:264]
    chunk5 = df.iloc[264:330]
    
    all_scores = []
    all_scores.append(score_single_chunk(chunk1, [chunk2, chunk3, chunk4, chunk5], 10))
    all_scores.append(score_single_chunk(chunk2, [chunk1, chunk3, chunk4, chunk5], 10))
    all_scores.append(score_single_chunk(chunk3, [chunk1, chunk2, chunk4, chunk5], 10))
    all_scores.append(score_single_chunk(chunk4, [chunk1, chunk2, chunk3, chunk5], 10))
    all_scores.append(score_single_chunk(chunk5, [chunk1, chunk2, chunk3, chunk4], 10))
    
    print(np.mean(all_scores))

0.9424242424242426
0.9424242424242426
0.9424242424242426
0.9333333333333333
0.9393939393939394
0.9393939393939394
0.9393939393939394
0.9393939393939394
0.9393939393939394
0.9424242424242424

We'll see that while the scores do vary a bit, it's not nearly as much as with random testing/training assignment, and the larger the dataset, the less this will happen.

The importance of row order¶

While we're talking about the effects of order, it's a good time to illustrate why we have to make sure our dataset isn't sorted before carrying out cross validation. The logic is simple: if our penguins are sorted by, for example, bill length, then each chunk will contain penguins with bill lengths very different to the others.

With our randomly shuffled dataframe, the bill lengths of penguins in the first and last chunks are very similar:

In [53]:

chunk1['bill_length_mm'].mean(), chunk5['bill_length_mm'].mean()

Out[53]:

(44.16515151515151, 43.366666666666674)

But if we sort the dataframe first, they end up very different:

In [54]:

sorted_df = df.sort_values('bill_length_mm')
chunk1 = sorted_df.iloc[0:66]
chunk2 = sorted_df.iloc[66:132]
chunk3 = sorted_df.iloc[132:198]
chunk4 = sorted_df.iloc[198:264]
chunk5 = sorted_df.iloc[264:330]

chunk1['bill_length_mm'].mean(), chunk5['bill_length_mm'].mean()

Out[54]:

(36.44545454545455, 51.06515151515151)

With such different data points in the training and testing set, it's no suprise that if we try to carry out cross validation on our sorted dataframe:

In [55]:

all_scores = []
all_scores.append(score_single_chunk(chunk1, [chunk2, chunk3, chunk4, chunk5], 10))
all_scores.append(score_single_chunk(chunk2, [chunk1, chunk3, chunk4, chunk5], 10))
all_scores.append(score_single_chunk(chunk3, [chunk1, chunk2, chunk4, chunk5], 10))
all_scores.append(score_single_chunk(chunk4, [chunk1, chunk2, chunk3, chunk5], 10))
all_scores.append(score_single_chunk(chunk5, [chunk1, chunk2, chunk3, chunk4], 10))

np.mean(all_scores)

Out[55]:

0.9242424242424242

We get a worse mean score (9.24 rather than 9.4).

Picking the best value for k¶

Now that we have our cross validation function set up, we can try to solve the problem set out at the begining of this article: picking the best value for k. Because we've already done the hard work, this turns out to be quite straightforward - a simple loop and range does the job:

In [56]:

for k in range(1,20):
    all_scores = []
    all_scores.append(score_single_chunk(chunk1, [chunk2, chunk3, chunk4, chunk5], k))
    all_scores.append(score_single_chunk(chunk2, [chunk1, chunk3, chunk4, chunk5], k))
    all_scores.append(score_single_chunk(chunk3, [chunk1, chunk2, chunk4, chunk5], k))
    all_scores.append(score_single_chunk(chunk4, [chunk1, chunk2, chunk3, chunk5], k))
    all_scores.append(score_single_chunk(chunk5, [chunk1, chunk2, chunk3, chunk4], k))

    score_for_k = np.mean(all_scores)
    print(k, score_for_k)

1 0.9151515151515153
2 0.8939393939393941
3 0.9212121212121213
4 0.9212121212121213
5 0.9242424242424242
6 0.9333333333333333
7 0.9272727272727274
8 0.9272727272727274
9 0.9272727272727274
10 0.9242424242424242
11 0.9242424242424242
12 0.9212121212121213
13 0.9242424242424242
14 0.9242424242424242
15 0.9242424242424242
16 0.9242424242424242
17 0.9242424242424242
18 0.9212121212121213
19 0.9242424242424242

The output here shows us that the optimum value for k for our dataset is 6, but also that the problem doesn't seem particularly sensitive to the value we pick. Values for k between about 3 and 20 give roughly similar scores.

Summary¶

At the conclusion of part four of our machine learning mini series, we have covered a lot of ground. In part one, we introduced the idea of a classifier function that would assign penguins to species based on their measurements. In part two, we wrote a function that used pandas to implement the K-nearest-neighbours algorithm and took at look at its behaviour. In part three, we saw how we could get slightly different results by varying the value of k, and learned about splitting our dataset into training and testing sets. And in this part, we introduced the idea of cross validation as a better way to evaluate the performance of our classification function.

If you've been following this series of articles, you might have noticed that we haven't used any machine learning specific libraries so far. Instead, we've used a combination of pandas and numpy to implement the various tools of machine learning for ourselves. This is deliberate; by looking at how each tool works in detail rather than treating them as black boxes we have gained a lot of insight into the logic behind them.

In the next part, we'll switch from writing our own classification and cross-validation functions to using the sklearn package. This will allow us to reproduce the work we've done so far with much less code, and then to dive into some more complicated machine learning problems.

If you don't want to miss it, sign up for the Python for Biologists newsletter just below this article, and follow @pythonforbiolg on twitter.