Machine learning for biology training course
Overview
Recent advances in deep neural networks have led to increased interest from researchers about both the cutting edge deep learning models and classical machine learning techniques, which have historically been very underused in biology. In this course we will start with the very basics of machine learning and, using biological examples throughout, use code and visualisations to explore all the components that go into making practical use of ML techniques for research. We will discuss how the various algorithms work at a high level rather than going into mathematical detail, and will focus on the practicalities of ML work: how to gather, clean, and preprocess datasets; how to choose between different ML tools; how to score and evaluate models; and how to take advantage of pre-trained models for deep learning tasks.
At the end of the course, participants will have built and trained several different types of model from scratch and be confident applying ML techniques to their own research problems.
Who the course is suitable for
The course is aimed at researchers at any career stage who have little or no experience of machine learning and want to learn how to apply ML to research problems. Though some Python knowledge is helpful it is not strictly required; we will use Python for all the code examples but Python is not the focus of the course and we will use generous visualisations to help understand what the code is doing. Similarly, it might be useful but not necessary to have a bit of understanding of the pandas/matplotlib/seaborn stack to understand the code behind some of the data processing steps. We will make use of example datasets that should be easily understood by anyone with a background in biology. If anyone is unsure about the suitability of the course I am more than happy to chat directly — just drop me an email at martin@pythonforbiologists.com.
Brief syllabus for sessions
- introduction, ai/ml/deep learning, classification/regression/clustering, data/hardware/software requirements, training/inference, ML workflows, laptop setup
- classification, scoring, visualisation, confusion matrices, K nearest neighbours, parameter searching, test/training data, cross validation
- sklearn/numpy/pandas data model, KNN in sklearn, stratified splitting, balanced/unbalanced data, sorted data, extra features, scaling, under/over fitting, feature selection
- binary classification problems, scoring recall/sensitivity/specificity, encoding categorical data, decision trees, support vector machines, interpretability
- regression, scoring and visualising, classification vs. regression view, feature count and interpretability, feature extraction and high-dimensional data
- unsupervised learning, clustering, homogeneity/completeness, cluster parameters, dimension reduction, PCA, scoring, dimension reduction in ML workflows
- artificial neural networks, perceptron model, weights/training/inference, backpropagation, ANN architecture, convergence/training loss, complexity/interpretability
- deep neural networks, convolutions/recurrence/transformers, embeddings, data/hardware requirements, deploying pre-trained models, fine-tuning existing models 9 & 10. ML workshop, complete workflows, data scraping/cleaning/merging, student’s own datasets, case studies with evaluation
Detailed syllabus for sessions
1 - Intro, history, background and environment
In the first session we cover some background:
- the relationship between machine learning and ai
- classical machine learning vs. deep learning
- the essential concept of learning from data
We discuss some of the ways that we can organise ML approaches:
- supervised vs. unsupervised
- regression vs classification
and start to consider the spectrum between simple and complicated methods.
We point out how a number of properties of ML methods:
- parameter count
- computational requirements
- data requirements and interpretability
tend to scale together.
Next, we turn to the practicalities of using ML methods for research - how will datasets be obtained, the difference between training and inference, the use of pre-training and fine-tuning. This leads to an overview of universal issues when using ML - how to score and evaluate models; how to choose between models, how to visualise their behaviour, how feature engineering and selection fit into a ML workflow.
We’ll use a chunk of time in this first session to describe the setup/environment - the datasets, packages, programming environment etc. - and make sure the everyone’s computers are set up for smoothly running the code and exercises for the rest of the course.
2 - Core concepts of classification
In this session we dive straight in to a simple one-feature classification problem. We start off writing a manual classifier which allows us to get used to the core concepts of features/classes. We can use this very simple example to address two of the most important questions for understanding ML models:
- how can we score them, and
- how can we visualise their behaviour
We also look at the concept of a confusion matrix - this will be important later when talking about different scoring metrics.
Taking a detailed look at our manual classifier reveals that this approach will not scale for a number of reasons, so we give an intuitive explanation of the K-Nearest-Neighbours algorithm. Pandas allows us to write a simple implementation, and we can use the tools we’ve already built to contrast this with our manual classifier. Now that we have a parameterised algorithm we can discuss the idea of systematic parameter searching that will form the basis of training more complex models.
At this point we can cover in detail an incredibly important point: the division of data into training/test sets - why this is necessary and how to do it. We also introduce the idea of cross validation.
Exercise: building a KNN classifier for a new dataset, including parameter optimisation, visualisation and scoring. At the end, having two working examples lets us pinpoint what is common to all classification problems and get an intuitive sense of how this will apply to more complicated models.
3 - Sklearn and adding features
In this session we have two main goals:
- explaining the architecture of the sklearn package
- getting started with the idea of feature engineering
We start by quickly recapping the role of sklearn/numpy/pandas/etc in the Python ML ecosystem, before diving into the practicalities of how to use sklearn. Talking a little bit about the data model, how models are represented, etc. now will save a lot of time later on. This allows us to quickly reproduce the workflow from the previous session with a fraction of the code.
We can point out explicitly how using existing implementations of ML algorithms will let us focus on higher level concerns and briefly cover a few ideas and best practises that are more easily explained using sklearn:
- stratified splitting
- balanced/unbalanced datasets
- potential pitfalls of sorted data
We can now show that something that would be hard with our manual implementation, but is very easy with sklearn, and introduce extra features to our classification problem. We use the visualisation approaches that we’ve already learned to see the effect of additional features, and illustrate the importance of scaling - another aspect of feature engineering. This is a natural time to mention a few other feature engineering types.
Now that we have a two-dimensional classifier, we can really investigate the effect of of different parameters on the classifiers behaviour, and see how it fits into the universal issue of over/under fitting. We finish on the idea of feature selection; given a dataset with many potential features, how to scalably pick useful ones. There are some intuitive approaches like sequential selection, and some univariate ones that leverage existing statistical knowledge.
Exercise: taking a new dataset with many features, experiment with feature selection to optimise a classification model, being sure to pay attention to scaling and overfitting.
4 - Binary classification and new models
In this session we are going to cover two main topics: binary classification (as opposed with the multiclass problems we have been tackling so far) and some other algorithms. First, we will introduce binary classification as a particularly common and useful type of classification tool. Contrasting it with the kind of classification problem we’ve been looking at so far, we see that it makes some aspects of our workflow easier, and some (particularly scoring) harder.
We can refer back to our previous discussion of confusion matrices to explain recall/sensitivity/specificity/true positives/negatives and the trade-offs between them. At this point we can also discuss how to represent categorical data as features. The subtleties of different types of encoding (ordinal, one-hot, etc) and the practicalities of how to create them are another chance to touch on the importance of feature engineering to avoid bias.
We will also go into a fair amount of detail on two completely new algorithms:
- support vector machines (SVN)
- decision trees
With three entirely different classification algorithms in front of us we can use visualisation tools to compare their behaviour and to think about their interpretability. We can also start some benchmarking to directly measure their computational requirements, and some scoring to explicitly compare them and begin to answer an overarching question of many ML projects: how do we choose which type of model to use?
Exercise: with a complex dataset with mixed data types, carry out preprocessing with feature engineering/selection then compare KNN/decision tree/SVM. This exercise will show how different criteria for success in terms of recall/precision will sometimes lead to different model choices.
5 - Regression
Having spent a good deal of time discussing and solving classification problems, we now turn our attention to the other common type of ML problem: regression. After introducing a simple example, we can point out similarities to and differences with classification, particularly with regard to our practical workflow. Visualisation and scoring will be very different, but many ideas around feature engineering/selection and parameter searching will be the same.
Despite the classification/regression dichotomy, a few examples will make it clear that many problems can be stated in both regression/classification terms, and that many algorithms work for both with slight tweaks. The behaviour of regression models with categorical features leads to particularly interesting visualisation which allow us to check our intuitive understanding of how the algorithms work. We look at the effect of feature count on interpretability, and think about another way to organise ML methods based on whether prediction changes are linear or stepwise.
At this point we will introduce a new, large, unstructured dataset to use as an example of feature extraction, for which domain-specific knowledge will be useful. This example demonstrates how it’s surprisingly easy to end up with many thousands of features, and allows us to recap ideas from session 3: which approaches to feature selection are viable for such large numbers.
Exercise: creating features from scratch using an unstructured dataset, finding ones that have a strong predictive value, and checking that results make intuitive sense. We will get comfortable using ML techniques to identify patterns that are impossible to see with simple visualisation, and with interpreting a confusion matrix with many classes.
6 - Unsupervised learning
This session covers the use of a few tools that are widely used in science, but often not regarded as machine learning.
By waiting until we have gone over other types of problem in depth, we can more easily point out the analogies between clustering and classification:
- the same tension between over- and under-fitting
- the choice and use of scoring criteria
- the process of finding good parameters
We can also identify the fundamental differences when working with unlabelled data, and the proxy measurements we have when lacking a ground truth. We cover a few common clustering algorithms and, as usual, rely on visualisation to get an intuitive sense of how they behave. When discussing scoring we can draw out the analogy between precision/recall from binary classification and homogeneity/completeness in clustering. A particular source of trouble in clustering is when the number of clusters becomes part of the parameter search - the extremes of this approach can be easily seen once we have discussed homogeneity and completeness.
We follow up with a survey of dimensionality reduction, its role in visualisation, and its slightly different role in larger ML workflows as an alternative to feature extraction. We will explicitly point out a few common misconceptions about PCA:
- despite being plotted with labels, they are not used in the algorithm
- there is nothing special about 2 dimensions
This is a natural place to point to some methods that do make use of labelled data.
We finish up with an explanation of how dimensionality reduction as a parameter that can be tuned for model performance. For both parts of this session we will make use of datasets that we are already familiar with - a great shortcut to understanding - and focus more on the intuitive explanation of the algorithms than their mathematics and implementation.
Exercise: take a novel unlabelled dataset and try to reconstruct clusters using various algorithms, before using the labels to evaluate clusterings after the fact. Use PCA to preprocess our very-high-dimension dataset from the previous session and investigate the tradeoff between accuracy and computational complexity with different numbers of dimensions.
7 - Artificial neural networks
We start this session with a description (and very simple code implementation) of artificial neurons (perceptrons) then quickly move on to making single-layer neural networks using sklearn. We experiment with various different ways of assigning weights and running inference with them.
Once we have an intuitive grasp of the behaviour, we are ready to look at a high-level overview of the backpropagation algorithm and how it determines the way that ANNs are trained, covering concepts with confusing names like:
- epochs
- batches
- iterations/learning rate
- convergence
- training loss
Playing about with even a simple neural network will make it clear that these models have far more parameters than anything else, and that strategies for parameter searching are of paramount importance.
With some simple visualisation tools, we can explore the effect of neural network architecture, and see how this allows for customisation of ANNs to a degree not seen in other models. We can now lay out the great divide in the way that ANNs will be used in scientific projects. Some projects will involve designing and training neural networks from scratch; an endeavour that requires very large datasets, specialised hardware and dedicated libraries that use extensive parallelism (i.e. not sklearn).
Alternatively, some will use existing huge, complex pre-trained models that would be impossible for a researcher to design/train, but which can be fine-tuned with modest datasets and modest hardware.
Exercise: define, train and evaluate a NN model for a new, complex scientific dataset, exploring the effect of different architectures. This will be complex enough to make clear the point that we sacrifice a great deal of interpretability for the power of even single-layer neural networks.
8 - Pre-trained models and fine-tuning
In this session we go into the various ideas and components that go into the current crop of very large, deep neural networks. These break down into a few categories. Clever architectural features:
- convolutional network
- recurrent neural networks
- transformers
will take the simple networks from the previous session and customise them for particular tasks. Also key is the idea of embeddings, a kind of feature engineering in which complex inputs are represented as positions in very high dimensional space. Lastly are the contributions of sheer scale - truly enormous datasets and the incredible hardware required to use them.
These ideas are all explored at a very high level, since the implementation is irrelevant to researchers wanting to actually use them. We spend more time discussing the practical side: since we will not be building or training these models from scratch, it’s more important to know how to download and employ existing models.
We’ll next look at a few deep learning models that are plausibly useful for scientific work. We will walk through the code and dependencies necessary to do this for a couple of different examples, pointing out the similarities with the much simpler models that we have discussed in earlier sessions. This is a good point at which to explicitly lay out what tasks we are likely to spend our time doing for scientific ML projects:
- assembling/cleaning data
- feature engineering and selection
- comparing models
- parameter searching
- model evaluation
Exercise: download and deploy a pre-trained image model, then fine tune on a biological dataset. How many examples are required for good performance? Experiment with making synthetic randomised data and test the effect with model evaluation.
9 & 10 - ML workshop
The last two sessions are set aside for students to work on complete ML workflows, involving:
- data gathering/merging/cleaning
- feature extraction/engineering
- feature selection
- model selection
- parameter searching and model evaluation
With real-world datasets we will likely have to write custom code to e.g. scrape data from websites, merge multiple datasets, clean and filter human-curated data files, etc. Students are encouraged to bring their own datasets, but suitable examples can easily be sourced for those who do not have them yet.
Depending on students’ particular interest either the students or the trainer will likely present some case studies at the very end of the course, showing the application of tools and ideas from the course to real scientific problems.