11 ML model selection with MESS simulations
11.1 Key questions
- What do I do with all these MESS simulations?
- How do I perform ML inference using MESS simulations in R?
11.2 Lesson objectives
After this lesson, learners should be able to…
- Use MESS simulations and RandomForest ML to perform community assembly model selection (classification)
- Evaluate uncertainty in classification accuracy using confusion matrices
- Apply ML Classification to empirical data and interpret results in terms of story
- Brainstorm applications to other systems/empirical datasets
11.3 Planned exercises
- Lecture: How does ML work with MESS simulations?
- Motivating community assembly model classification
- Get the MESS simulations and do some set-up
- Overview of major steps of machine learning process
- Implement ML assembly model classification
- Hands-on Exercise: Predicting
assembly_model
of mystery simulations
11.3.1 Lecture: How does ML work with MESS simulations?
Lecture: How does machine learning work with MESS simulations?
11.3.2 Motivating community assembly model classification
The first step is now to assess the model of community assembly that best fits the data. The three models are neutral
, in which all individuals are ecologically equivalent; competition
, in which species have traits, and differential survival probability is weighted by distance of traits from the trait mean in the local community (closer to the trait mean == higher probability of death); and filtering
, in which there is an environmental optimum, and proximity of trait values to this optimum is positively correlated with survival probability.
Basically we want to know, are individuals in the local community ecologically equivalent, and if not are they interacting more with each other or more with the local environment. To accomplish this, we are going to use simulation-based supervised machine learning, which is what the rest of this lesson is all about.
11.3.3 Run MESS simulations
Ok, so now we know what we want to know: Are the “data” (the Hill # summary statistics for abundance, traits, and genetics) sufficient to distinguish among community assembly models? How do we go about evaluating this question? Well, the first step is to generate a large number simuations under each of the three assembly models, so we have a representative sample of patterns in the data characteristic of each assembly model.
Coding exercise: Create a new MESS$Core
object and use a for
loop to run 2 simulations for each of the three community assembly models: neutral
, competition
, and filtering
.
11.3.3.1 Download the pre-baked simulations
Since it can take quite some time to run a number of simulations sufficient for model selection and parameter estimation we will use a suite of pre-baked simulations generated ahead of time. We want to practice good organizational habits for our simulations and .R scripts, so as a challenge, before downloading the simulations file, **make a new directory in your home directory called “MESS-inference” and change directory into it.*
Once you are in the /home/rstudio/MESS-inference
directory, you may fetch the simulations from the workshop site using wget
:
wget https://github.com/role-model/process-models-workshop/raw/main/data/MESS-SIMOUT.csv.gz
gunzip MESS-SIMOUT.csv.gz wc -l MESS-SIMOUT.txt
MESS-SIMOUT.csv.gz 100%[====================>] 35.24M 1.12MB/s in 32s
2023-06-15 14:31:54 (1.12 MB/s) - ‘MESS-SIMOUT.csv.gz’ saved [36954693/36954693]
3000 SIMOUT.txt
11.3.4 Implement ML assembly model classification
11.3.4.1 Overview of major steps of simulation-based machine learning process
Before we dive in, it’s maybe useful to have an idea of the roadmap of what we’ll be doing. The major steps of implementing simulation-based machine learning are:
- Load the simulations and do some set-up
- Transform the simulations
- Train the ML model
- Evaluate how well it trained
- Apply it to real data
- Clean up and you’re done
11.3.4.2 Do some set-up
Looking in the “Files” tab, double click on the new “MESS-inference” folder and you should see your new “MESS-SIMOUT.csv.gz” file. Now you can create a “New Blank File->Rscript” in this directory by clicking on the small green plus. When prompted to enter a file name, use “MESS-classification.R”.
Now we will install a couple necessary packages: randomForest
and caret
, two machine learning packages for R.
## install.packages only has to happen ONCE within your R environment, you don't
## need to install.packages if you create a new file, you can skip straight
## to loading the libraries.
install.packages("randomForest")
install.packages(pkgs = "caret", dependencies = c("Depends", "Imports"))
library(randomForest)
library(caret)
## Make sure you're _in_ the MESS-inference working directory
setwd("/home/rstudio/MESS-inference")
11.3.4.3 Load the pre-baked simulations
Simulation results are composed of ‘parameters’ and ‘data’. Parameters are what goes in to the simulation model and data is what comes out.
Before, we had used core$load_sims()
to load the simulations generated by a specific MESS Core
object, but here we have a file that we got from the internet. There was a Core
objec that created this somewhere, but we don’t have access to it. Luckily, there is a utility function (MESS$load_local_sims()
) which allows us to load data from an external file. Let’s use it now:
= MESS$load_local_sims("MESS-SIMOUT.csv")[[1]] simdata
The simulated data is in the form that we have already looked at previously, so we don’t need to look at it again here now, there’s just a lot more of it.
11.3.4.4 Transform the simulations
In some cases parameters/data must be reshaped in some way to make it more appropriate for the ML approach.
One constraint of the randomForest
classification process is that the “target” variable (or “response” variable) must be an R factor
, which is essentially a categorical variable. We can convert the assembly_model
column in our simdata
data.frame to a factor, and then count the number of elements in each level
with table()
.
$assembly_model <- as.factor(simdata$assembly_model)
simdatatable(simdata$assembly_model)
competition filtering neutral 1000 1000 999
11.3.4.5 Split the data into training/testing sets
If you train a machine learning model with all of your data, then you won’t have any way to really evaluate its performance. It’s very common to split data into ‘training’ and ‘testing’ sets, where the training data used to train the model and the testing data is used to evaluate how well it performs.
## 70/30 test/train split
<- sample(2, nrow(simdata), replace = TRUE, prob = c(0.7, 0.3))
tmp <- simdata[tmp==1,]
train <- simdata[tmp==2,] test
11.3.4.6 Train the model on the training set
Learn a mapping between patterns in the data and community assembly model.
The randomForest()
function takes two important arguments:
- The
data
upon which to operate, in this case thetrain
set of training data - A
formula
describing the desired model of the relationship between dependent variables (on the left of the~
) and independent variables (on the right of the~
).
The formula below indicates that we wish to express assembly_model
as a function of local_S
, pi_h1
, abund_h1
, and trait_h1
. We chose these 4 predictor variables to demonstrate an ML model which considers all the axes of biodiversity, without being overly complex and including all the Hill numbers for each axis of data. There’s no reason to believe this is the best model, and you could certainly experiment with manipulating the independent variables in this formula.
## Experiment with results for different axes of data!
<- randomForest(assembly_model ~ local_S + pi_h1 + abund_h1 + trait_h1, data=train, proximity=TRUE)
rf print(rf)
Call:
randomForest(formula = assembly_model ~ local_S + pi_h1 + abund_h1 + trait_h1, data = train, proximity = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 2
OOB estimate of error rate: 19.41%
Confusion matrix:
competition filtering neutral class.error
competition 538 33 124 0.2258993
filtering 24 571 89 0.1652047 neutral 61 71 560 0.1907514
11.3.4.7 Test the model on the testing set
How well does the trained model recover the ‘known’ assembly model in simulations that it hasn’t seen yet.
Now we have trained the rf
model and it’s time to evaluate how well it can classify simulations that it hasn’t previously seen. This is exactly what the testing dataset is for! We can use the testing data, which the rf model has not encountered, to determine accuracy of classification on new data.
We use the built-in R function predict()
to feed the test
data to the trained rf
model, and it will record classification predictions for each simulation in the training set.
<- predict(rf, test)
test_predictions <- confusionMatrix(test_predictions, test$assembly_model) cm
Confusion Matrix and Statistics
Reference
Prediction competition filtering neutral
competition 233 13 32
filtering 23 270 28
neutral 49 33 247
Overall Statistics
Accuracy : 0.8082
95% CI : (0.7814, 0.833)
No Information Rate : 0.3405
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.7122
Mcnemar's Test P-Value : 0.08011
Statistics by Class:
Class: competition Class: filtering Class: neutral
Sensitivity 0.7639 0.8544 0.8046
Specificity 0.9278 0.9167 0.8680
Pos Pred Value 0.8381 0.8411 0.7508
Neg Pred Value 0.8892 0.9242 0.8998
Prevalence 0.3287 0.3405 0.3308
Detection Rate 0.2511 0.2909 0.2662
Detection Prevalence 0.2996 0.3459 0.3545 Balanced Accuracy 0.8459 0.8855 0.8363
The simplest way to evaluate classification accuracy is to look at the “Confusion matrix”. In the testing set we know the true assembly model the data was generated under, and we want to know how often the trained RF model is able to recover the known assembly model from the held out testing data. The confusion matrix reports the contingency table of reference vs predicted assembly model outcomes. With perfect classification you would expect 100% agreement between the ‘Reference’ model and the ‘Prediction’ model, with all values in the confusion matrix falling along the diagonal. Off-diagonal elements represent the frequency of mis-classification and the category of the mis-classified prediction (e.g. predictions of ‘neutral’ when the known assembly model was ‘competition’).
Already we can see that classification accuracy is relatively high, with the majority of testing simulations being accurately classified, and few (~15%) being mis-classified in one way or the other.
It can sometimes be useful to visualize confusion matrices using the heatmap
function in base R, which produces a plot representing exactly the data in the contingency table above.
heatmap(cm$table, Rowv=NA, Colv=NA, margins=c(12, 12), xlab="Prediction", ylab="Reference")
11.3.4.8 Getting prediction probabilities for one individual simulation
When we call predict(rf, test)
we are using the classifier to make predictions for all the simulations in the testing set. This takes the single assembly model with the highest prediction probability as the predicted value (essentially it is a point estimate). Often it is useful (particularly with empirical data) to have a more comprehensive view of the prediction probabilities across classes. We can pass in just one simulated dataset and ask predict
to give us the type="prob"
, which will return exactly these.
# Take the first row of testing data as just one simulation to predict
# `type="prob" indicates that we wish to return the percent probability of
# each assembly_model, rather than the point estimate.
<- predict(rf, test[1, ], type="prob")
p_emp p_emp
competition filtering neutral 1 0.016 0.04 0.944
Activity: Spend a few moments making predictions for several other of the simulated datasets by changing the 1
above to 2
or 3
or some other integer values, just to get the hang of it.
11.3.5 Predict assembly model of empirical data
And “Embrace the uncertainty in the model classification.
Now we will do a brief analysis of some empirical data, so you can see that the classification prediction process for empirical data is identical to that of simulated data: training the model and calling predict
.
For this exercsie we will use genetic data from a soil arthropod metabarcoding study published by Noguerales et al (2021). The data is comprised of soil micro-arthropod COI sequences sampled from four different habitat types in the montane forests of Cyprus. Fetch the empirical data from Noguerales et al 2021 (which we have again preprocessed and stored on github):
In the terminal wget
the data for one habitat type from the workshop repository:
## Be sure you are in the right working directory inside the terminal
cd /home/rstudio/MESS-Inference wget https://github.com/role-model/process-models-workshop/raw/main/empirical_data/Qa_hills.txt
Now go back into the R console and load the data for this sample. It might also help to take a quick look at what the data for this sample looks like, so also print it to the screen.
<- read.csv("/home/rstudio/MESS-inference/Qa_hills.txt")
Qa_emp Qa_emp
local_S pi_h1 pi_h2 pi_h3 1 156 48.5767 41.83594 37.69666
The empirical data was generated from a metabarcoding dataset, and for this exercise we provide the data already collapsed into summary statistics: the number of species in the local community (local_S
) and the first three Hill numbers of genetic diversity (pi_h1
, pi_h2
, pi_h3
). Earlier when we fit the randomForest model we used the first Hill numbers of abundance, traits, and genetic data, so before we can predict()
community assembly model with the real data we need to re-fit the rf classifier so it’s predictor variables match those of the empirical data.
# Fit the model including only genetic Hill numbers to the training set
<- randomForest(assembly_model ~ local_S + pi_h1 + pi_h2 + pi_h3, data=train, proximity=TRUE)
rf predict(rf, Qa_emp, type="prob")
competition filtering neutral 1 0.172 0.742 0.086
11.3.6 Hands-on Exercise: Predicting assembly_model
of mystery simulations
Time allowing, you may wish to try to classify some simulated multi-dimensional biodiversity data with unknown (to you) assembly_model
values. There exists a new simulated data file which contains 20 mystery simulations with random assembly_model
values. Choose one of these and try to re-run the classification procedure we just completed.
First go into your terminal and download the mystery simulations:
cd /home/rstudio/MESS-inference
wget https://github.com/role-model/process-models-workshop/raw/main/data/Mystery-data.csv
Now see if you can load the mystery simulations and do the classification inference on one or a couple of them. Try to do this on your own, but if you get stuck you can check the hint here:
A link to the key containing the true assembly_model
values is hidden below. Don’t peek until you have a guess for your simulated data! How close did you get?
11.3.7 More to explore with R randomForest
package
This is just a tiny taste of the randomForest
R package, and a tiny molecule of everything possible with machine learning, but we hope it gives the flavor still.
11.4 Key points
- Machine learning models can be used to “classify” categorical response variables given multi-dimensional input data.
- Major components of machine learning inference include gathering and transforming data, splitting data into training and testing sets, training the ml model, and evaluating model performance.
- ML inference is inherently uncertain, so it is important to evaluate uncertainty in ml prediction accuracy and propagate this into any downstream interpretation when applying ml to empirical data.