7.2 Missing data
The missing data methods allows generation of unbalanced data. Missing data is generated through creating probabilities of being sampled using logistic regression. Missingness can then either be random, or a function of any of the simulated variables.
This methods allows the different classes of missing data to be generated:
- Missing Completely At Random (MCAR)
- All observations have an equal probability of being sampled
- Missing At Random (MAR)
- Probability of missingness is dependent on variables correlated with the response variable (i.e. a predictor variable)
- Missing Not At Random (MNAR)
- Probability of missingness is dependent on the response variable itself
7.2.1 MCAR
Missing completely at random occurs when the probability of missingness is not dependent on anything. This can be implemented through a logistic regression, where only the intercept is specified: \[ logit(p) = beta_0 \]
Note this intercept is on the logit scale, so 0 is equivalent to 0.5.
<- simulate_population(
pop_data data_structure = make_structure("individual(100)",repeat_obs=5),
parameters = list(
individual = list(
vcov = 0.1
),observation= list(
names = c("environment"),
beta =c(0.5)
), residual = list(
vcov = 0.8
)
),
sample_type = "missing",
sample_param = "0"
)
<- get_sample_data(pop_data)
sample_data nrow(sample_data)
## [1] 276
7.2.2 MAR
Missing at random occurs when the probability of missingness is dependent on a predictor variable (or a variables correlated with y). This can be implemented through a logistic regression, where the predictor variable(s) is a predictor(s) of y: \[ logit(p) = beta_0 + beta_1*environment \]
<- simulate_population(
pop_data data_structure = make_structure("individual(100)",repeat_obs=5),
parameters = list(
individual = list(
vcov = 0.1
),observation= list(
names = c("environment"),
beta =c(0.5)
), residual = list(
vcov = 0.8
)
),
sample_type = "missing",
sample_param = "0.5*environment"
)
<- get_sample_data(pop_data)
sample_data nrow(sample_data)
## [1] 247
The predictor variables are scaled (mean 0, variance 1), so the slopes are directly comparable across traits, and intercept represents the mean (on the logit scale).
7.2.3 MNAR
Missing not at random occurs when the probability of missingness is dependent on the response variable itself variable (i.e. y). This can be implemented through a logistic regression, where the predictor variable is y: \[ logit(p) = beta_0 + beta_1*y \] Again y is scaled.
<- simulate_population(
pop_data data_structure = make_structure("individual(100)",repeat_obs=5),
parameters = list(
individual = list(
vcov = 0.8
),observation= list(
names = c("environment"),
beta =c(0.1)
), residual = list(
vcov = 0.5
)
),
sample_type = "missing",
sample_param = "0.5*y"
)
<- get_sample_data(pop_data)
sample_data nrow(sample_data)
## [1] 274
Lets try and visualise this. We know there is lots f between individual variation, and we know sampling is based on phenotype, so we would expect an association between number of observations and phenotype:
<- data.frame(
ind_data n=as.vector(table(sample_data$individual)),
mean=tapply(sample_data$y,sample_data$individual,mean)
)boxplot(mean~n,ind_data)