12 ML parameter estimation with MESS simulations
12.1 Key questions
- How do I estimate parameter values for continuous MESS model parameters?
- How do I evaluate the uncertainty of ML parameter estimates?
12.2 Lesson objectives
After this lesson, learners should be able to…
- Use simulations and randomForest ML to perform parameter estimation for key MESS model parameters.
- Evaluate uncertainty in parameter estimation accuracy using cross-validation simulations.
- Apply ML parameter estimation to empirical data and interpret results in terms of story
- Brainstorm applications to real data
12.3 Planned exercises
- Motivating MESS process-model parameter estimation
- Implement ML parameter estimation
- Hands-on Exercise: Predicting MESS parameters of mystery simulations
- Posterior predictive simulations (if time)
12.3.1 Motivating MESS process-model parameter estimation
Now that we have identified the neutral model as the most probable, we can estimate parameters of the emipirical data given this model. Essentially, we are asking the question “What are the parameters of the model that generate summary statistics most similar to those of the empirical data?”
12.3.2 Implement ML parameter estimation
Let’s start with a new clean Rscript, so create a new file and save it as “MESS-Regression.R”, then load the necessary libraries and simulated data.
12.3.2.1 Load libraries, setwd, and load the simulated data
library(randomForest)
library(caret)
library(reticulate)
setwd("/home/rstudio/MESS-inference")
= MESS$load_local_sims("MESS-SIMOUT.csv")[[1]] simdata
12.3.2.2 Extract sumulations generated under the ‘filtering’ assembly model
Let’s pretend that the most probable model from the classification procedure was ‘environmental filtering’. If our goal is to estimate parameters under this model then we want to exclude simulations from the ml inference that do not fit our most probable model. This can be achieved by selecting rows in the simdata
data.frame that have “filtering” as the assembly_model
.
= simdata[simdata$assembly_model == "filtering", ]
simdata simdata
12.3.2.3 Split train and test data as normal
Again, we’ll split the data into training and testing sets.
<- sample(2, nrow(simdata), replace = TRUE, prob = c(0.7, 0.3))
tmp <- simdata[tmp==1,]
train <- simdata[tmp==2,] test
12.3.2.4 Train the ML regressor as normal
Train the ml model to perform regression, as our focal dependent variable takes continuous values. The randomForest
package auto-detects if the dependent variable is continuous or categorical, so there’s nothing more to do there. We will use the same formula as before, specifying local_S
and the first Hill number on each axis of biodiversity as the predictor variables.
<- randomForest(colrate ~ local_S + pi_h1 + abund_h1 + trait_h1, data=train, proximity=TRUE)
rf rf
12.3.2.5 Plot results of predictions for colrate of held-out training set
For a regression analysis we can’t use a confusion matrix because the response is continuous, so instead we evaluate prediction accuracy by making a scatter plot showing predictions as a function of known simulated values. With perfect prediction accuracy the points in the scatter plot would fall along the identity line.
<- predict(rf, test)
preds plot(preds, test$colrate)
12.3.2.6 Predict colrate
of test simulation and plot distribution of predictions
Now we can practice making predictions for a single simulation and looking at uncertainty in the prediction. Like we did before we can select one simulation by using the test [#, ]
row selection strategy on our test
data.frame.
When we ask for predict.all=TRUE
this will return the prediction value for each tree in the rf, and we can plot the aggregation of these predictions using hist
to show a histogram.
<- predict(rf, test[2, ], predict.all=TRUE)
preds # The predicted value is element [[1]]
print(preds[[1]])
# A vector of predictions for each tree in the forest [[2]]
hist(preds[[2]])
0.001191852
What can we say about the uncertainty on parameter estimation for this one simulation?
12.3.3 Hands-on Exercise: Predicting MESS parameters of mystery simulations
Now see if you can load the mystery simulations and do the regression inference on one or a couple of them. Try to do this on your own, but if you get stuck you can check the hint here:
A link to the key containing the true colrate
values is hidden below. Don’t peek until you have a guess for your simulated data! How close did you get?
12.4 Key points
- Machine learning models can be used to “estimate parameters” of continuous response variables given multi-dimensional input data.
- Major components of machine learning inference include gathering and transforming data, splitting data into training and testing sets, training the ml model, and evaluating model performance.
- ML inference is inherently uncertain, so it is important to evaluate uncertainty in ml prediction accuracy and propagate this into any downstream interpretation when applying ml to empirical data.