2.3 Simulating predictors at different hierarchical levels
As well as simulating continuous predictors at the level of the observation, we can also simulate predictors at different hierarchical levels. Let’s take the example of a situation where we have repeated measures of individuals. The individuals have traits that are consistently expressed, whilst the environment varies between observations. We can describe variation at these different hierarchical levels as:
\[ y_{ij} = \beta_0 + \boldsymbol{x}_{i} \boldsymbol{\beta}_x + \boldsymbol{u}_j \boldsymbol{\beta}_u + \epsilon_{ij} \] \[ \boldsymbol{x}_i \sim \mathcal{N}(\boldsymbol{\mu}_x, \Sigma_x) \] \[ \boldsymbol{u}_j \sim \mathcal{N}(\boldsymbol{\mu}_u, \Sigma_u) \] \[ \epsilon_i \sim \mathcal{N}(0,\sigma^2_\epsilon) \]
where \(U\) is a matrix of predictors that vary at the individual level (denoted by the subscript \(j\)), and \(X\) is a matrix of predictors at the observation level (denoted by the index \(i\)).
In order to simulate from this model, we need a data structure and parameters for each of these levels. To do this, we can either specify a data structure generated using make_structure
(outlined previously in Section 2.1), or a pre-existing data structure, to the simulate_population
function. We then add an item to the parameter list, the name of which matches on of the grouping factors in the data structure, and specify the parameters for predictors that vary at that level in the same way as outlined in the previous section (1). This is similar to the fixed factors above, but we are now assuming that the variable is drawn randomly from a distribution, rather than the effects at each level being fixed.
Lets imagine that we simulate behaviour, that is a functions of an individual’s size and physiology, and also varies in response to the environment, here temperature and rainfall:
<- simulate_population(
squid_data data_structure = make_structure(structure = "individual(500)", repeat_obs=2),
parameters = list(
individual = list(
names = c("size","physiology"),
beta = c(0.1,0.2)
),observation = list(
names=c("temperature","rainfall"),
beta = c(0.2,-0.1)
),residual = list(
vcov = 0.5
)
),response_names="behaviour"
)
<- get_population_data(squid_data)
data
coef(lm(behaviour ~ size + physiology + temperature + rainfall , data))
## (Intercept) size physiology temperature rainfall
## 0.04239715 0.08633998 0.18335983 0.20147836 -0.08308933
Here, we have simulated 4 predictors, ‘size’ and ‘physiology’ that vary at the level of the individual, and ‘temperature’ and ‘rainfall’ that vary at the level of the observation. To keep things simple, we will simulate them all as unit normal variables (mean=0 and variance=1). Note, the names of the different grouping factors in the parameter list (here ‘individual’) needs to exactly match those in the data structure. The order does not, however, have to be the same. There are circumstances in which we may want to simulate two sets of effects at the same hierarchical level (for example see permanent environment effects in Section 4.1), in this case we can call them different things in the parameter list, but link them back to the grouping factor, by providing a group
name. For example the following will produce the same simulation as above:
<- simulate_population(
squid_data data_structure = make_structure(structure = "individual(500)", repeat_obs=2),
parameters = list(
ind1 = list(
group="individual",
names = c("size"),
beta = c(0.1)
),ind2 = list(
group="individual",
names = c("physiology"),
beta = c(0.2)
),observation = list(
names=c("temperature","rainfall"),
beta = c(0.2,-0.1)
),residual = list(
vcov = 0.5
)
),response_names="behaviour"
)
It is also worth noting that predictors do not have to be simulated for every grouping factor in the data structure - in this way no variation at that level can be simulated.
2.3.1 Simulating ‘random’ effects
In essence, random effects (random intercepts) represent an unobserved/latent predictor (or group of predictors), which varies at a given hierarchical level. In a mixed effect model, the effect at each level of the grouping factor is unknown, and estimated by the model (and assumed to come from a normal distribution). When simulating this, however, we can simply simulate an additional predictor at a particular hierarchical level (\(z\)) with mean 0 and a given variance (\(\sigma^2_z\)).
\[ y_{ij} = \beta_0 + u_j + \epsilon_{ij} \] \[ u_j \sim \mathcal{N}(0,\sigma^2_u) \] \[ \epsilon_i \sim \mathcal{N}(0,\sigma^2_\epsilon) \]
For example, we can simulate some among-individual variation as follows:
<- simulate_population(
squid_data data_structure = make_structure(structure = "individual(500)", repeat_obs=2),
parameters = list(
individual = list(
vcov = 0.5
),residual = list(
vcov = 0.5
)
)
)
<- get_population_data(squid_data)
data head(data)
## y individual_effect residual individual squid_pop
## 1 -0.86084775 -0.09309309 -0.7677547 1 1
## 2 -1.12731207 -0.09309309 -1.0342190 1 1
## 3 1.75527564 1.62397047 0.1313052 2 1
## 4 1.88106328 1.62397047 0.2570928 2 1
## 5 0.08515369 -0.63170479 0.7168585 3 1
## 6 -1.01588408 -0.63170479 -0.3841793 3 1
library(lme4)
<- function(x) print(summary(x), correlation=FALSE, show.resids=FALSE, ranef.comp = c("Variance"))
short_summary
short_summary(lmer(y ~ 1 + (1|individual), data))
## Linear mixed model fit by REML ['lmerMod']
## Formula: y ~ 1 + (1 | individual)
## Data: data
##
## REML criterion at convergence: 2662.7
##
## Random effects:
## Groups Name Variance
## individual (Intercept) 0.5209
## Residual 0.4643
## Number of obs: 1000, groups: individual, 500
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 0.01823 0.03881 0.47
Note that here we haven’t specified any variable names. In this case the simulated predictors are named by the grouping factors (e.g. individual_effect).
2.3.2 Incorporating existing data structures
We could also use an existing data structure, taking the grouping factors and levels from an existing dataset and input them to simulate_population
. To demonstrate this, we can use the blue tit dataset provided with the MCMCglmm package. This is a dataset with some continuous variables (tarsus, back (coloration) and hatchdate), and some grouping factors (animal, dam, fosternest and sex), the latter providing a data structure from which to simulate.
library(MCMCglmm)
data(BTdata)
head(BTdata)
## tarsus back animal dam fosternest hatchdate sex
## 1 -1.89229718 1.1464212 R187142 R187557 F2102 -0.6874021 Fem
## 2 1.13610981 -0.7596521 R187154 R187559 F1902 -0.6874021 Male
## 3 0.98468946 0.1449373 R187341 R187568 A602 -0.4279814 Male
## 4 0.37900806 0.2555847 R046169 R187518 A1302 -1.4656641 Male
## 5 -0.07525299 -0.3006992 R046161 R187528 A2602 -1.4656641 Fem
## 6 -1.13519543 1.5577219 R187409 R187945 C2302 0.3502805 Fem
<- simulate_population(
squid_data data_structure = BTdata[,c("dam","fosternest")],
parameters = list(
dam = list(
vcov = 0.2
),fosternest = list(
vcov = 0.3
),residual = list(
vcov = 0.5
)
)
)
<- get_population_data(squid_data)
data head(data)
## y dam_effect fosternest_effect residual dam fosternest
## 1 -0.9961429 0.32127788 -0.5187692 -0.79865156 R187557 F2102
## 2 0.4241295 -0.25097001 0.8402955 -0.16519603 R187559 F1902
## 3 -1.3080452 0.14237053 -0.1642529 -1.28616283 R187568 A602
## 4 -0.5971299 -0.07265687 -0.7234984 0.19902540 R187518 A1302
## 5 -1.4897618 -0.93654434 -0.6329443 0.07972682 R187528 A2602
## 6 -1.3758694 0.02953667 0.4727818 -1.87818792 R187945 C2302
## squid_pop
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1