2.3 Simulating predictors at different hierarchical levels

As well as simulating continuous predictors at the level of the observation, we can also simulate predictors at different hierarchical levels. Let’s take the example of a situation where we have repeated measures of individuals. The individuals have traits that are consistently expressed, whilst the environment varies between observations. We can describe variation at these different hierarchical levels as:

\[ y_{ij} = \beta_0 + \boldsymbol{x}_{i} \boldsymbol{\beta}_x + \boldsymbol{u}_j \boldsymbol{\beta}_u + \epsilon_{ij} \] \[ \boldsymbol{x}_i \sim \mathcal{N}(\boldsymbol{\mu}_x, \Sigma_x) \] \[ \boldsymbol{u}_j \sim \mathcal{N}(\boldsymbol{\mu}_u, \Sigma_u) \] \[ \epsilon_i \sim \mathcal{N}(0,\sigma^2_\epsilon) \]

where \(U\) is a matrix of predictors that vary at the individual level (denoted by the subscript \(j\)), and \(X\) is a matrix of predictors at the observation level (denoted by the index \(i\)).

In order to simulate from this model, we need a data structure and parameters for each of these levels. To do this, we can either specify a data structure generated using make_structure (outlined previously in Section 2.1), or a pre-existing data structure, to the simulate_population function. We then add an item to the parameter list, the name of which matches on of the grouping factors in the data structure, and specify the parameters for predictors that vary at that level in the same way as outlined in the previous section (1). This is similar to the fixed factors above, but we are now assuming that the variable is drawn randomly from a distribution, rather than the effects at each level being fixed.

Lets imagine that we simulate behaviour, that is a functions of an individual’s size and physiology, and also varies in response to the environment, here temperature and rainfall:

squid_data <- simulate_population(
  data_structure = make_structure(structure = "individual(500)", repeat_obs=2),
  parameters = list(
    individual = list(
      names = c("size","physiology"),
      beta = c(0.1,0.2)
    ),
    observation = list(
      names=c("temperature","rainfall"),
      beta = c(0.2,-0.1)
    ),
    residual = list(
      vcov = 0.5
    )
  ),
  response_names="behaviour"
)

data <- get_population_data(squid_data)

coef(lm(behaviour ~ size + physiology + temperature + rainfall , data))

## (Intercept)        size  physiology temperature    rainfall 
## -0.03402370  0.07313193  0.17416616  0.24872731 -0.11561841

Here, we have simulated 4 predictors, ‘size’ and ‘physiology’ that vary at the level of the individual, and ‘temperature’ and ‘rainfall’ that vary at the level of the observation. To keep things simple, we will simulate them all as unit normal variables (mean=0 and variance=1). Note, the names of the different grouping factors in the parameter list (here ‘individual’) needs to exactly match those in the data structure. The order does not, however, have to be the same. There are circumstances in which we may want to simulate two sets of effects at the same hierarchical level (for example see permanent environment effects in Section 4.1), in this case we can call them different things in the parameter list, but link them back to the grouping factor, by providing a group name. For example the following will produce the same simulation as above:

squid_data <- simulate_population(
  data_structure = make_structure(structure = "individual(500)", repeat_obs=2),
  parameters = list(
    ind1 = list(
      group="individual",
      names = c("size"),
      beta = c(0.1)
    ),
    ind2 = list(
      group="individual",
      names = c("physiology"),
      beta = c(0.2)
    ),
    observation = list(
      names=c("temperature","rainfall"),
      beta = c(0.2,-0.1)
    ),
    residual = list(
      vcov = 0.5
    )
  ),
  response_names="behaviour"
)

It is also worth noting that predictors do not have to be simulated for every grouping factor in the data structure - in this way no variation at that level can be simulated.

2.3.1 Simulating ‘random’ effects

In essence, random effects (random intercepts) represent an unobserved/latent predictor (or group of predictors), which varies at a given hierarchical level. In a mixed effect model, the effect at each level of the grouping factor is unknown, and estimated by the model (and assumed to come from a normal distribution). When simulating this, however, we can simply simulate an additional predictor at a particular hierarchical level (\(z\)) with mean 0 and a given variance (\(\sigma^2_z\)).

\[ y_{ij} = \beta_0 + u_j + \epsilon_{ij} \] \[ u_j \sim \mathcal{N}(0,\sigma^2_u) \] \[ \epsilon_i \sim \mathcal{N}(0,\sigma^2_\epsilon) \]

For example, we can simulate some among-individual variation as follows:

squid_data <- simulate_population(
  data_structure = make_structure(structure = "individual(500)", repeat_obs=2),
  parameters = list(
    individual = list(
      vcov = 0.5
    ),
    residual = list(
      vcov = 0.5
    )
  )
)

data <- get_population_data(squid_data)
head(data)

##              y individual_effect   residual individual squid_pop
## 1  0.602356146         0.1594673  0.4428889          1         1
## 2  1.204485552         0.1594673  1.0450183          1         1
## 3 -0.681867088        -0.3816181 -0.3002490          2         1
## 4 -1.436016418        -0.3816181 -1.0543983          2         1
## 5  0.006081218         0.7306521 -0.7245709          3         1
## 6  0.177797638         0.7306521 -0.5528544          3         1

library(lme4)
short_summary <- function(x) print(summary(x), correlation=FALSE, show.resids=FALSE, ranef.comp = c("Variance"))

short_summary(lmer(y ~ 1 + (1|individual), data))

## Linear mixed model fit by REML ['lmerMod']
## Formula: y ~ 1 + (1 | individual)
##    Data: data
## 
## REML criterion at convergence: 2694
## 
## Random effects:
##  Groups     Name        Variance
##  individual (Intercept) 0.4741  
##  Residual               0.5103  
## Number of obs: 1000, groups:  individual, 500
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  0.03428    0.03819   0.898

Note that here we haven’t specified any variable names. In this case the simulated predictors are named by the grouping factors (e.g. individual_effect).

2.3.2 Incorporating existing data structures

We could also use an existing data structure, taking the grouping factors and levels from an existing dataset and input them to simulate_population. To demonstrate this, we can use the blue tit dataset provided with the MCMCglmm package. This is a dataset with some continuous variables (tarsus, back (coloration) and hatchdate), and some grouping factors (animal, dam, fosternest and sex), the latter providing a data structure from which to simulate.

library(MCMCglmm)
data(BTdata)
head(BTdata)

##        tarsus       back  animal     dam fosternest  hatchdate  sex
## 1 -1.89229718  1.1464212 R187142 R187557      F2102 -0.6874021  Fem
## 2  1.13610981 -0.7596521 R187154 R187559      F1902 -0.6874021 Male
## 3  0.98468946  0.1449373 R187341 R187568       A602 -0.4279814 Male
## 4  0.37900806  0.2555847 R046169 R187518      A1302 -1.4656641 Male
## 5 -0.07525299 -0.3006992 R046161 R187528      A2602 -1.4656641  Fem
## 6 -1.13519543  1.5577219 R187409 R187945      C2302  0.3502805  Fem

squid_data <- simulate_population(
  data_structure = BTdata[,c("dam","fosternest")],
  parameters = list(
    dam = list(
      vcov = 0.2
    ),
    fosternest = list(
      vcov = 0.3
    ),
    residual = list(
      vcov = 0.5
    )
  )
)

data <- get_population_data(squid_data)
head(data)

##            y    dam_effect fosternest_effect    residual     dam fosternest
## 1  0.3414224  0.0004465141        0.26993761  0.07103831 R187557      F2102
## 2  0.4892094  0.0512931487       -0.13437510  0.57229137 R187559      F1902
## 3  1.6103895  0.0428057278        0.69045892  0.87712480 R187568       A602
## 4 -1.3354583 -0.2351044549       -0.61425069 -0.48610310 R187518      A1302
## 5 -0.1467267  0.9155448975       -0.44713888 -0.61513276 R187528      A2602
## 6  1.3456794  0.5799862495        0.02555989  0.74013326 R187945      C2302
##   squid_pop
## 1         1
## 2         1
## 3         1
## 4         1
## 5         1
## 6         1