Machine learning for biology part three
Category > article
Wed 20 October 2021import pandas as pd
import seaborn as sns
df = (
pd.read_csv(
"https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv",
).dropna() # missing data will confuse things
)[['species', 'bill_length_mm', 'flipper_length_mm']]
sns.relplot(
data = df,
x = 'bill_length_mm',
y = 'flipper_length_mm',
hue = 'species',
height=8,
hue_order = ['Adelie', 'Gentoo', 'Chinstrap']
)
and implemented a function that could guess the species of a new penguin based on its measurements by finding the penguins with most similar measurements:
import numpy as np
def guess_species(bill_length, flipper_length):
# calculate distances and add to the dataframe
flipper_length_difference = df['flipper_length_mm'] - flipper_length
bill_length_difference = df['bill_length_mm'] - bill_length
overall_distance = np.sqrt(
flipper_length_difference ** 2
+ bill_length_difference ** 2
)
df['distance to new point'] = overall_distance
# find closest points and calculate most common species
most_common_species = (
df.sort_values('distance to new point')
.head(10)
['species']
.mode()[0]
)
# the most common species is our guess
return most_common_species
guess_species(45, 211)
If you haven't read parts one and two of this series, or if it's been a while, then I recommend at least skimming them so that you will be able to jump in here. And as a reminder, all of the pandas/seaborn magic that we are using for data processing and visualization are explained in the Biological Data Exploration book.
Naming things¶
Before we start with anything new, we should introduce a few specific terms to bring our descriptions into line with machine learning terminology:
- the function that we wrote is a classifier, since its job is to start with measurements and generate a label
- the output of the function is its prediction (this sounds a bit more scientific than guess)
- the collection of penguin measurements that the function uses is the training set
- the process of inputting the training set (which in this case we do by simply having the function use our existing dataframe) is called fitting
the training set consists of two parts: the measurements which we call features, and the species names which we call classes. (I know this is confusing since in biology class is a taxonomic level, but we will ignore that for now)
It might be interesting at this point to note that although we've implemented this function to work specifically for our dataset, the general approach - to predict the class of a new data point, count the classes of the closest points and pick the class that appears most often - seems like it would work for other pairs of measurements. And if we can figure out a way to calculate distances between points in more than 2 dimensions, it might even work for data points with more than two measurements. We will come back to this idea later in this series....for now, back to the function that is in front of us.
The most interesting bit of terminology is the name of the algorithm that we have implemented in the function - ~K nearest neighbors. I will use the American spelling for neighbors as that's how the function name is spelled in various machine learning libraries.
Choosing parameters¶
The K bit of the name is because there is a parameter that we can tweak in this algorithm: the number of closest points that we want to consider when making our prediction. In the version of the function above we have used 10, but that's an arbitrary choice, and we can easily rewrite the function so that k
is an argument:
def guess_species(bill_length, flipper_length, k):
# calculate distances and add to the dataframe
flipper_length_difference = df['flipper_length_mm'] - flipper_length
bill_length_difference = df['bill_length_mm'] - bill_length
overall_distance = np.sqrt(
flipper_length_difference ** 2
+ bill_length_difference ** 2
)
df['distance to new point'] = overall_distance
# find closest points and calculate most common species
most_common_species = (
df.sort_values('distance to new point')
.head(k)
['species']
.mode()[0]
)
# the most common species is our guess
return most_common_species
We can now fairly easy find sets of measurements for which we get different predictions depending on which value of K we pick:
print("k=2 : ", guess_species(44, 210, 2))
print("k=5 : ", guess_species(44, 210, 5))
This raises an obvious question: which value of K should we pick to get the most accurate results from our classifier function? To answer it we need to look in a bit more detail about scoring a classifier. We have already seen one way to do this in part two: we wrote some code to run the prediction function on all of the real life penguin measurements, then compare the predictions to the true species and ask what fraction of the time they matched:
# generate predicted species for k=10
predictions = df.apply(
lambda p :
guess_species(p['bill_length_mm'], p['flipper_length_mm'], 10)
, axis=1
)
# compare predictions to true species and get the fraction that match
(predictions == df['species']).value_counts(normalize=True)[True]
Chapter 13 of the Biological data exploration book has more details on how apply
works.
So it seems like a trivial matter to try this for a bunch of different values of k
and just pick the highest score:
for k in range(1, 10):
# generate predicted species
predictions = df.apply(
lambda p :
guess_species(p['bill_length_mm'], p['flipper_length_mm'], k)
, axis=1
)
# compare predictions to true species and get the fraction that match
score = (predictions == df['species']).value_counts(normalize=True)[True]
print(k, score)
However, under this scoring scheme, a k
value of 1 will always give a perfect score. It's easy to see why when we think about the test data: because each point we are testing is actually in the dataset, the closest neighbor will always be the point itself. So by setting k
to 1 the only point that will be used in the prediction will be the same one we are testing. In other words, we are testing the function by measuring how well it can predict the classes of data points that it has already seen.
In machine learning, this is a big mistake. After all, the whole goal of building a classifier is to get reliable predictions for new points (in our case, new penguins). So to fairly measure the performance, we should test the function by using it to generate predictions for measurements where we know the correct species, but which are not in the training set.
Training and testing¶
The easist way to do this is to randomly assign each penguin (i.e. each row in our dataframe) to be used either for training or testing, but not both. One way to do this is by adding a new column to the dataframe, which will be 'training' 80% of the time and 'testing' 20% of the time. numpy
can do most of the work for us with its random.choice
function:
df['assignment'] = np.random.choice(
['training', 'testing'], # these are the items to randomly choose
333, # this is the number of choices we want
p=[0.8, 0.2] # these are the probabilities of picking each item
)
df
and if we do value_counts
on the resulting column we can see the actual numbers:
df['assignment'].value_counts()
Since the assignments are generated randomly we'll get slightly different absolute numbers each time we run the code, but the split will always be roughly 80%/20%. These percentages are abitrary of course; we could instead choose to use 10% or 30% of the data for testing.
Now we can use the new column to split the penguins up into a training dataset and a testing dataset, which we can store in variables for convenience:
training_data = df[df['assignment'] == 'training']
test_data = df[df['assignment'] == 'testing']
len(training_data), len(test_data)
These new dataframes are simply subsets of the original rows. Now we are prepared to do a fairer test of the function. We'll replace the reference to df
with a reference to training_data
, which will make sure that the function can only "see" the species of the points in the training set:
def guess_species(bill_length, flipper_length, k):
# calculate distances and add to the dataframe
flipper_length_difference = training_data['flipper_length_mm'] - flipper_length
bill_length_difference = training_data['bill_length_mm'] - bill_length
overall_distance = np.sqrt(
flipper_length_difference ** 2
+ bill_length_difference ** 2
)
training_data['distance to new point'] = overall_distance
# find closest points and calculate most common species
most_common_species = (
training_data.sort_values('distance to new point')
.head(k)
['species']
.mode()[0]
)
# the most common species is our guess
return most_common_species
And when we run our testing code, we will only use the test data:
# generate predicted species for k=10
predictions = test_data.apply(
lambda p :
guess_species(p['bill_length_mm'], p['flipper_length_mm'], 10)
, axis=1
)
# compare predictions to true species and get the fraction that match
(predictions == test_data['species']).value_counts(normalize=True)[True]
Take a look at chapter 3 of the Biological data exploration book for a bit more on value_counts
.
In this version of the test, the training data and testing data are completely separate - we use one set of data points to train the classifier (i.e. to find the nearest neighbors) and an entirely different set of data points to test it. In other words, we are now asking how well the function can predict the species of a penguin it has never seen before, which is a much closer scenario to real world use.
Let's now put the new testing code in a loop, and calculate a score for each value of k between 1 and 10:
for k in range(1, 10):
# generate predicted species
predictions = test_data.apply(
lambda p :
guess_species(p['bill_length_mm'], p['flipper_length_mm'], k)
, axis=1
)
# compare predictions to true species and get the fraction that match
score = (predictions == test_data['species']).value_counts(normalize=True)[True]
print(k, score)
Finally, it seems like we have a clear answer: for this particular problem, we get the most accurate predictions with k=5
(i.e. by considering the five closest points when trying to classify a new point), with worse predictions if we make k
either higher or lower.
Unfortunately, it's a bit more complicated than that. The result we have obtained is only true for the specific split of data into training and testing subsets. Remember: there's nothing special about the particular split we have obtained here - we got it by randomly assigning each row to either the training or test data. To illustrate this, look what happens when we repeat the splitting process ten times and then calculate the score for each different random split. Notice that in this loop we are keeping k
fixed at 10, and just doing the same experiment repeatedly:
for _ in range(10):
# randomly assign the rows to either training or test
df['assignment'] = np.random.choice(['training', 'testing'], 333, p=[0.8, 0.2])
training_data = df[df['assignment'] == 'training']
test_data = df[df['assignment'] == 'testing']
# calculate score
predictions = test_data.apply(
lambda p :
guess_species(p['bill_length_mm'], p['flipper_length_mm'], 10)
, axis=1
)
score = (predictions == test_data['species']).value_counts(normalize=True)[True]
print(score)
The scores vary a lot, from a near-perfect 97% accuracy to a much less impressive 91% accuracy. Thinking about it logically, we can see why this must be the case. In some iterations, due to random chance, we will get a training set and test set that are very similar, in which case the function will perform well. But in some iterations, the training and test sets will be very different, leading to a lower score. If we are very unlucky, we might get a split where most of the Gentoo penguins end up in the training set, but most of the Adelie penguins end up in the test set. Due to the simple species-counting logic that our function uses, it will very rarely predict Adelie.
Another way to visualise this randomness is to draw multiple prediction charts for different random selections of training data. We can use the code from the previous part, and put it in a loop:
# ignore some pandas warnings that we don't care about
pd.options.mode.chained_assignment = None
# make 10 different prediction grids with the same value of k
bill_lengths = np.arange(35, 60, 1)
flipper_lengths = np.arange(170, 230, 1)
data = []
for iteration in range(10):
# randomly assign the rows to either training or test
df['assignment'] = np.random.choice(['training', 'testing'], 333, p=[0.8, 0.2])
training_data = df[df['assignment'] == 'training']
test_data = df[df['assignment'] == 'testing']
for bill_length in bill_lengths:
for flipper_length in flipper_lengths:
guess = guess_species(bill_length, flipper_length, 3)
data.append((bill_length, flipper_length, guess, iteration))
data = pd.DataFrame(data, columns=['bill_length', 'flipper_length', 'prediction', 'iteration'])
# plot the predictions on multiple charts
sns.relplot(
data = data,
x = 'bill_length',
y = 'flipper_length',
hue = 'prediction',
hue_order = ['Adelie', 'Gentoo', 'Chinstrap'],
col='iteration',
col_wrap = 5,
height=2,
edgecolor = 'none'
)
Here I've used k=3
to make the effect easier to see in small charts. Notice how the outlines of the prediction regions are slightly different in each iteration, due to the random sampling of data points in the training set, even though we are keeping k
exactly the same.
Chapter 8 of Biological data exploration has details of how to create these kind of small multiple plots.
We will finish this part of the series here. In the next part, we'll pick up the idea of scoring a classifier and look at a few ways to get round the problem of randomness. If you don't want to miss it, sign up for the Python for Biologists newsletter just below this article.