Machine learning for biology part three

Introduction

In parts one and two we considered a dataset consisting of bill and flipper length for penguins:

In [1]:
import pandas as pd

df = (
    pd.read_csv(
    "https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv",
    )
    .dropna() # missing data will confuse things
)[['species', 'bill_length_mm', 'flipper_length_mm']]

import seaborn as sns

sns.relplot(
    data = df,
    x = 'bill_length_mm',
    y = 'flipper_length_mm',
    hue = 'species',
    height=8,
    hue_order = ['Adelie', 'Gentoo', 'Chinstrap']
)
Out[1]:
<seaborn.axisgrid.FacetGrid at 0x7f34cb567be0>

and implemented a function that could guess the species of a new penguin based on its measurements by finding the penguins with most similar measurements:

In [14]:
import numpy as np

def guess_species(bill_length, flipper_length):
    
    # calculate distances and add to the dataframe
    flipper_length_difference = df['flipper_length_mm'] - flipper_length
    bill_length_difference = df['bill_length_mm'] - bill_length
    overall_distance = np.sqrt(                             
        flipper_length_difference ** 2   
        + bill_length_difference ** 2    
    )
    df['distance to new point'] = overall_distance
    
    # find closest points and calculate most common species
    most_common_species = (
        df.sort_values('distance to new point') 
        .head(10)                               
        ['species']                             
        .mode()[0]                              
    )
    
    # the most common species is our guess
    return most_common_species

guess_species(45, 211)
Out[14]:
'Gentoo'

Before we start with anything new, we should introduce a few specific terms to bring our descriptions into line with machine learning terminology:

  • the function that we wrote is a classifier, since its job is to start with measurements and generate a label
  • the output of the function is its prediction (this sounds a bit more scientific than guess)
  • the collection of penguin measurements that the function uses is the training set
  • the process of inputting the training set (which in this case we do by simply having the function use our existing dataframe) is called fitting
  • the training set consists of two parts: the measurements which we call features, and the species names which we call classes. (I know this is confusing since in biology class is a taxonomic level, but we will ignore that for now)

It might be interesting at this point to note that although we've implemented this function to work specifically for our dataset, the general approach - to predict the class of a new data point, count the classes of the closest points and pick the class that appears most often - seems like it would work for other pairs of measurements. And if we can figure out a way to calculate distances between points in more than 2 dimensions, it might even work for data points with more than two measurements. We will come back to this idea later in this series....for now, back to the function that is in front of us.

The most interesting bit of terminology is the name of the algorithm that we have implemented in the function - K nearest neighbors. I will use the American spelling for neighbors as that's how the function name is spelled in various machine learning libraries.

The K bit of the name is because there is a parameter that we can tweak in this algorithm: the number of closest points that we want to consider when making our prediction. In the version of the function above we have used 10, but that's an arbitrary choice, and we can easily rewrite the function so that K is an argument:

In [16]:
def guess_species(bill_length, flipper_length, k):
    
    # calculate distances and add to the dataframe
    flipper_length_difference = df['flipper_length_mm'] - flipper_length
    bill_length_difference = df['bill_length_mm'] - bill_length
    overall_distance = np.sqrt(                             
        flipper_length_difference ** 2   
        + bill_length_difference ** 2    
    )
    df['distance to new point'] = overall_distance
    
    # find closest points and calculate most common species
    most_common_species = (
        df.sort_values('distance to new point') 
        .head(k)                               
        ['species']                             
        .mode()[0]                              
    )
    
    # the most common species is our guess
    return most_common_species

We can now fairly easy find sets of measurements for which we get different predictions depending on which value of K we pick:

In [13]:
print("k=2 : ", guess_species(44, 210, 2))
print("k=5 : ", guess_species(44, 210, 5))
k=2 :  Adelie
k=5 :  Gentoo

This raises an obvious question: which value of K should we pick to get the most accurate results from our classifier function? To answer it we need to look in a bit more detail about scoring a classifier. We have already seen one way to do this in part two: we wrote some code to run the prediction function on all of the real life penguin measurements, then compare the predictions to the true species and ask what fraction of the time they matched:

In [22]:
# generate predicted species for k=10
predictions = df.apply(
    lambda p : 
        guess_species(p['bill_length_mm'], p['flipper_length_mm'], 10)
    , axis=1
)

# compare predictions to true species and get the fraction that match
(predictions == df['species']).value_counts(normalize=True)[True]
Out[22]:
0.9519519519519519

So it seems like a trivial matter to try this for a bunch of different values of k and just pick the highest score:

In [24]:
for k in range(1, 10):
    # generate predicted species 
    predictions = df.apply(
        lambda p : 
            guess_species(p['bill_length_mm'], p['flipper_length_mm'], k)
        , axis=1
    )

    # compare predictions to true species and get the fraction that match
    score = (predictions == df['species']).value_counts(normalize=True)[True]
    print(k, score)
1 1.0
2 0.963963963963964
3 0.9669669669669669
4 0.963963963963964
5 0.960960960960961
6 0.960960960960961
7 0.960960960960961
8 0.954954954954955
9 0.948948948948949

However, under this scoring scheme, a k value of 1 will always give a perfect score. It's easy to see why when we think about the test data: because each point we are testing is actually in the dataset, the closest neighbor will always be the point itself. So by setting k to 1 the only point that will be used in the prediction will be the same one we are testing. In other words, we are testing the function by measuring how well it can predict the classes of data points that it has already seen.

In machine learning, this is a big mistake. After all, the whole goal of building a classifier is to get reliable predictions for new points (in our case, new penguins). So to fairly measure the performance, we should test the function by using it to generate predictions for measurements where we know the correct species, but which are not in the training set.

The easist way to do this is to randomly assign each penguin (i.e. each row in our dataframe) to be used either for training or testing, but not both. One way to do this is by adding a new column to the dataframe, which will be 'training' 80% of the time and 'testing' 20% of the time. numpy can do most of the work for us:

In [70]:
df['assignment'] = np.random.choice(
    ['training', 'testing'],    # these are the items to randomly choose
    333,                        # this is the number of choices we want
    p=[0.8, 0.2]                # these are the probabilities of picking each item 
)
df
Out[70]:
species bill_length_mm flipper_length_mm distance to new point assignment
0 Adelie 39.1 181.0 20.302955 training
1 Adelie 39.5 186.0 16.077624 training
2 Adelie 40.3 195.0 10.344564 testing
4 Adelie 36.7 193.0 14.396180 training
5 Adelie 39.3 190.0 13.520725 training
... ... ... ... ... ...
339 Chinstrap 55.8 207.0 NaN training
340 Chinstrap 43.5 202.0 7.803204 training
341 Chinstrap 49.6 193.0 5.035871 training
342 Chinstrap 50.8 210.0 12.014991 training
343 Chinstrap 50.2 198.0 NaN training

333 rows × 5 columns

and if we do value_counts on the resulting column we can see the actual numbers:

In [53]:
df['assignment'].value_counts()
Out[53]:
training    269
testing      64
Name: assignment, dtype: int64

Since the assignments are generated randomly we'll get slightly different absolute numbers each time we run the code, but the split will always be roughly 80%/20%. These percentages are abitrary of course; we could instead choose to use 10% or 30% of the data for testing.

Now we can use the new column to split the penguins up into a training dataset and a testing dataset, which we can store in variables for convenience:

In [55]:
training_data = df[df['assignment'] == 'training']
test_data = df[df['assignment'] == 'testing']
len(training_data), len(test_data)
Out[55]:
(269, 64)

These new dataframes are simply subsets of the original rows. Now we are prepared to do a fairer test of the function. We'll replace the reference to df with a reference to training_data, which will make sure that the function can only "see" the species of the points in the training set:

In [57]:
def guess_species(bill_length, flipper_length, k):
    
    # calculate distances and add to the dataframe
    flipper_length_difference = training_data['flipper_length_mm'] - flipper_length
    bill_length_difference = training_data['bill_length_mm'] - bill_length
    overall_distance = np.sqrt(                             
        flipper_length_difference ** 2   
        + bill_length_difference ** 2    
    )
    training_data['distance to new point'] = overall_distance
    
    # find closest points and calculate most common species
    most_common_species = (
        training_data.sort_values('distance to new point') 
        .head(k)                               
        ['species']                             
        .mode()[0]                              
    )
    
    # the most common species is our guess
    return most_common_species

And when we run our testing code, we will only use the test data:

In [59]:
# generate predicted species for k=10
predictions = test_data.apply(
    lambda p : 
        guess_species(p['bill_length_mm'], p['flipper_length_mm'], 10)
    , axis=1
)

# compare predictions to true species and get the fraction that match
(predictions == test_data['species']).value_counts(normalize=True)[True]
Out[59]:
0.90625

In this version of the test, the training data and testing data are completely separate - we use one set of data points to train the classifier (i.e. to find the nearest neighbors) and an entirely different set of data points to test it. In other words, we are now asking how well the function can predict the species of a penguin it has never seen before, which is a much closer scenario to real world use.

Let's now put the new testing code in a loop, and calculate a score for each value of k between 1 and 10:

In [65]:
for k in range(1, 10):
    # generate predicted species 
    predictions = test_data.apply(
        lambda p : 
            guess_species(p['bill_length_mm'], p['flipper_length_mm'], k)
        , axis=1
    )

    # compare predictions to true species and get the fraction that match
    score = (predictions == test_data['species']).value_counts(normalize=True)[True]
    print(k, score)
1 0.921875
2 0.90625
3 0.9375
4 0.921875
5 0.953125
6 0.9375
7 0.9375
8 0.921875
9 0.921875

Finally, it seems like we have a clear answer: for this particular problem, we get the most accurate predictions with k=5 (i.e. by considering the five closest points when trying to classify a new point), with worse predictions if we make k either higher or lower.

Unfortunately, it's a bit more complicated than that. The result we have obtained is only true for the particular split of data into training and testing subsets. Remember: there's nothing special about the particular split we have obtained here - we got it by randomly assigning each row to either the training or test data. To illustrate this, look what happens when we repeat the splitting process ten times and then calculate the score for each different random split. Notice that in this loop we are keeping k fixed at 10, and just doing the same experiment repeatedly:

In [68]:
for _ in range(10):
    # randomly assign the rows to either training or test
    df['assignment'] = np.random.choice(['training', 'testing'], 333, p=[0.8, 0.2])
    training_data = df[df['assignment'] == 'training']
    test_data = df[df['assignment'] == 'testing']
    
    # calculate score
    predictions = test_data.apply(
        lambda p : 
            guess_species(p['bill_length_mm'], p['flipper_length_mm'], 10)
        , axis=1
    )

    score = (predictions == test_data['species']).value_counts(normalize=True)[True]
    print(score)
0.9516129032258065
0.9642857142857143
0.9558823529411765
0.9743589743589743
0.9242424242424242
0.9636363636363636
0.9538461538461539
0.9130434782608695
0.9583333333333334
0.9365079365079365

The scores vary a lot, from a near-perfect 97% accuracy to a much less impressive 91% accuracy. Thinking about it logically, we can see why this must be the case. In some iterations, due to random chance, we will get a training set and test set that are very similar, in which case the function will perform well. But in some iterations, the training and test sets will be very different, leading to a lower score. If we are very unlucky, we might get a split where most of the Gentoo penguins end up in the training set, but most of the Adelie penguins end up in the test set. Due to the simple species-counting logic that our function uses, it will very rarely predict Adelie.

We will finish this part of the series here. In the next part, we'll pick up the idea of scoring a classifier and look at a few ways to get round the problem of randomness.