# 12 ML parameter estimation with MESS simulations

## 12.1 Key questions

- How do I estimate parameter values for continuous MESS model parameters?
- How do I evaluate the uncertainty of ML parameter estimates?

## 12.2 Lesson objectives

After this lesson, learners should be able to…

- Use simulations and randomForest ML to perform parameter estimation for key MESS model parameters.
- Evaluate uncertainty in parameter estimation accuracy using cross-validation simulations.
- Apply ML parameter estimation to empirical data and interpret results in terms of story
- Brainstorm applications to real data

## 12.3 Planned exercises

- Motivating MESS process-model parameter estimation
- Implement ML parameter estimation
- Hands-on Exercise: Predicting MESS parameters of mystery simulations
- Posterior predictive simulations (if time)

### 12.3.1 Motivating MESS process-model parameter estimation

Now that we have identified the neutral model as the most probable, we can estimate parameters of the emipirical data given this model. Essentially, we are asking the question “What are the parameters of the model that generate summary statistics most similar to those of the empirical data?”

### 12.3.2 Implement ML parameter estimation

Let’s start with a new clean Rscript, so create a new file and save it as “MESS-Regression.R”, then load the necessary libraries and simulated data.

#### 12.3.2.1 Load libraries, setwd, and load the simulated data

```
library(randomForest)
library(caret)
library(reticulate)
setwd("/home/rstudio/MESS-inference")
= MESS$load_local_sims("MESS-SIMOUT.csv")[[1]] simdata
```

#### 12.3.2.2 Extract sumulations generated under the ‘filtering’ assembly model

Let’s pretend that the most probable model from the classification procedure was ‘environmental filtering’. If our goal is to estimate parameters under this model then we want to *exclude* simulations from the ml inference that do not fit our most probable model. This can be achieved by selecting rows in the `simdata`

data.frame that have “filtering” as the `assembly_model`

.

```
= simdata[simdata$assembly_model == "filtering", ]
simdata simdata
```

#### 12.3.2.3 Split train and test data as normal

Again, we’ll split the data into training and testing sets.

```
<- sample(2, nrow(simdata), replace = TRUE, prob = c(0.7, 0.3))
tmp <- simdata[tmp==1,]
train <- simdata[tmp==2,] test
```

#### 12.3.2.4 Train the ML regressor as normal

Train the ml model to perform regression, as our focal dependent variable takes continuous values. The `randomForest`

package auto-detects if the dependent variable is continuous or categorical, so there’s nothing more to do there. We will use the same formula as before, specifying `local_S`

and the first Hill number on each axis of biodiversity as the predictor variables.

```
<- randomForest(colrate ~ local_S + pi_h1 + abund_h1 + trait_h1, data=train, proximity=TRUE)
rf rf
```

#### 12.3.2.5 Plot results of predictions for colrate of held-out training set

For a regression analysis we can’t use a confusion matrix because the response is continuous, so instead we evaluate prediction accuracy by making a scatter plot showing predictions as a function of known simulated values. With perfect prediction accuracy the points in the scatter plot would fall along the identity line.

```
<- predict(rf, test)
preds plot(preds, test$colrate)
```

#### 12.3.2.6 Predict `colrate`

of test simulation and plot distribution of predictions

Now we can practice making predictions for a *single* simulation and looking at uncertainty in the prediction. Like we did before we can select one simulation by using the `test [#, ]`

row selection strategy on our `test`

data.frame.

When we ask for `predict.all=TRUE`

this will return the prediction value for each tree in the rf, and we can plot the aggregation of these predictions using `hist`

to show a histogram.

```
<- predict(rf, test[2, ], predict.all=TRUE)
preds # The predicted value is element [[1]]
print(preds[[1]])
# A vector of predictions for each tree in the forest [[2]]
hist(preds[[2]])
```

` 0.001191852`

**What can we say about the uncertainty on parameter estimation for this one simulation?**

### 12.3.3 Hands-on Exercise: Predicting MESS parameters of mystery simulations

Now see if you can load the mystery simulations and do the regression inference on one or a couple of them. Try to do this on your own, but if you get stuck you can check the hint here:

A link to the key containing the true `colrate`

values is hidden below. Don’t peek until you have a guess for your simulated data! How close did you get?

## 12.4 Key points

- Machine learning models can be used to “estimate parameters” of continuous response variables given multi-dimensional input data.
- Major components of machine learning inference include gathering and transforming data, splitting data into training and testing sets, training the ml model, and evaluating model performance.
- ML inference is
**inherently**uncertain, so it is important to evaluate uncertainty in ml prediction accuracy and propagate this into any downstream interpretation when applying ml to empirical data.