1.6 Non-Gaussian phenotypes
To simulate non-Gaussian data, we can specify a link function and a family as arguments to simulate_population()
. Underneath the predictors are being simulated as multivariate normal (on the latent scale), and then the resulting phenotype is transformed (onto the expected scale) and then binomial or Poisson sampling is applied (the observed scale).
Here is an example to simulate Poisson distributed data: \[ y_i \sim Poisson(\lambda_i) \] \[ \lambda_i = exp( \beta_0 + \boldsymbol{x}_{i} \boldsymbol{\beta} + \epsilon_i ) \] \[ \boldsymbol{x}_i \sim \mathcal{N}(\boldsymbol{\mu}_x, \Sigma_x) \] \[ \epsilon_i \sim \mathcal{N}(0,\sigma^2_\epsilon) \]
The only change in the code that is needed is the addition of the link and family arguments.
<- simulate_population(
squid_data parameters = list(
intercept = 1.75,
observation = list(
names = c("temperature","rainfall"),
beta = c(0.2,0.1)
),residual = list(
vcov = 0.2
)
),n = 2000,
family = "poisson",
link = "log"
)
<- get_population_data(squid_data)
data head(data)
## y temperature rainfall residual squid_pop
## 1 5 0.80122572 1.51171175 -0.009892707 1
## 2 11 0.19315205 -0.09121688 0.555220760 1
## 3 7 -0.05978012 0.27519677 -0.158106894 1
## 4 2 -1.94741262 0.79497686 -0.587682748 1
## 5 9 -0.62235194 1.65578543 0.112234603 1
## 6 8 0.07763423 -0.65812238 0.090078300 1
plot(table(data$y), ylab="Frequency", xlab="z")
glm(y ~ temperature + rainfall, data, family="poisson")
##
## Call: glm(formula = y ~ temperature + rainfall, family = "poisson",
## data = data)
##
## Coefficients:
## (Intercept) temperature rainfall
## 1.8627 0.1947 0.1189
##
## Degrees of Freedom: 1999 Total (i.e. Null); 1997 Residual
## Null Deviance: 5409
## Residual Deviance: 4692 AIC: 11730
Available families are ‘gaussian’, ‘poisson’ or ‘binomial’ and link functions ‘identity’, ‘log’, ‘inverse’, ‘sqrt’, ‘logit’, ‘probit’ and ‘cloglog’.
1.6.1 Transforming across scales
It can be difficult to simulate data with a certain mean and variance on the ‘observed’ scale when using parameters on the ‘latent’ scale. {squidSim} provides two functions lat2exp()
and exp2lat()
that help do with the log transformation. lat2exp()
transforms means and (co)variances from the latent (normal) scale to the expected (log-normal) scale, whilst exp2lat()
does the reverse.
Let us take an example. We want to simulate some count data that has a mean of 6 and variance 15. In a Poisson distribution, the mean is equal to the variance. This means that a variance of 6 will be added on to the expected scale variation through the stochastic Poisson sampling process. This means that we want to simulate data on the expected scale with a variance of 9. We can use the exp2lat()
function to work out the mean and variance on the latent scale that will give us what we want on the expected scale.
<- exp2lat(mean=6, cov=9)
latent_params latent_params
## $mean
## [1] 1.680188
##
## $cov
## [,1]
## [1,] 0.2231436
This gives us a list of parameters with which we can simulate:
<- simulate_population(
squid_data parameters = list(
intercept = latent_params[["mean"]],
residual = list(
vcov = latent_params[["cov"]]
)
),n = 2000,
family = "poisson",
link = "log"
)
<- get_population_data(squid_data)
data
mean(data$y)
## [1] 5.896
var(data$y)
## [1] 15.13175
As you can see, we retrieve very similar values to those we were after.
We can do with with other transformation. Another one with a simple transformation across scales is the probit. The probit transformation is simply the cumulative distribution function of a normal distribution, and so we can use the functions pnorm()
and qnorm()
to transform between latent and expected.