graphical_evaluation_simulations.Rmd
The following artificial data sets are simulated, with response values always depending on regressor variables, in order to obtain a realistic scenario for regression models:
set.seed(0)
## regressors
d <- data.frame(
x = runif(500, -1, 1),
z = runif(500, -1, 1)
)
## parameters
d <- transform(d,
lambda = exp(1 + 0.5 * x),
theta = 2,
mu = 50 + 22 * x,
sigma = 22,
sigmaz = exp(3 + 1 * z)
)
## responses
d <- transform(d,
ynorm = rnorm(500, mean = mu, sd = sigma),
yhnorm = rnorm(500, mean = mu, sd = sigmaz),
ytnorm = crch::rtnorm(500, mean = mu, sd = sigma, left = 0),
ycnorm = crch::rcnorm(500, mean = mu, sd = sigma, left = 0, right = 100),
yt = mu + sigma * rt(500, df = 4),
ylaplace = mu + rmutil::rlaplace(500, s = sigma),
yrskew = sn::rsn(500, xi = mu, omega = sigma, alpha = 5, tau = 0),
ylskew = sn::rsn(500, xi = mu, omega = sigma, alpha = -5, tau = 0),
yunif = mu + sigma * runif(500, min = -1, max = 1),
ypois = rpois(500, lambda = lambda),
ynegbin = rnbinom(500, mu = lambda, size = theta),
yzip = ifelse(runif(500) < 0.25, 0, rpois(500, lambda = lambda))
)
As a reference, a linear normal model is fitted to simulated normally distributed data without conditional heteroscedasticity.
## correct model fit
m1 <- lm(ynorm ~ x, data = d)
The model fit experiences no misspecifications resulting in a well calibrated predictive distribution (compare the case “well calibrated” in the summary plot of Graphical Evaluation: Methodology):
In the “conditional heteroscedasticity” example, two models are fitted to simulated data from the normal distribution where both the location and scale parameters depend on a regressor. The model m2
estimated with lm()
cannot account for conditional heteroscedasticity, unlike the model m3
which is fitted with crch()
.
## not accounting for heteroscedasticity
m2 <- lm(yhnorm ~ x, data = d)
### accounting for heteroscedasticity
m3 <- crch(yhnorm ~ x | z, data = d)
Not accounting for conditional heteroscedasticity results here in a model fit where the tails of the distribution are too light (compare the case “tails too light” in the summary plot of Graphical Evaluation: Methodology):
To investigate the graphical evaluation tools under the misspecification of not accounting for censoring in the estimation, homoscedastic normally distributed data are simulated with censoring at 0 and 100. The model m4
fitted with lm()
cannot account for the censored data, contrary to the model m5
estimated with crch()
.
## not accounting for censoring
m4 <- lm(ycnorm ~ x, data = d)
## accounting for censoring
m5 <- crch(ycnorm ~ x | 1, data = d, left = 0, right = 100)
Not accounting for the censoring at 0 and 100 leads to an incorrect distributional assumption in the tails of the predictive distribution, similar to the case “tails too heavy” discussed in Graphical Evaluation: Methodology:
Here, the homoscedastic normally distributed response values are truncated at zero. The model m6
fitted with lm()
cannot account correctly for this, contrary to the model m7
estimated with crch()
.
## not accounting for truncation
m6 <- lm(ytnorm ~ x, data = d)
# not accounting for truncation
m7 <- trch(ytnorm ~ x | 1, data = d, left = 0)
Not accounting for the truncation at 0 leads to an incorrect distributional assumption in the lower tail of the predictive distribution:
df = 4
)In the following example, data is simulated from the student-t distribution with 4 degrees of freedom. The model m8
incorrectly assumes normally distributed data, while the model m9
correctly incorporates t-distributed data.
## not accounting for t_4 distributed data
m8 <- lm(yt ~ x, data = d)
## accounting for t_4 distributed data
m9 <- crch(yt ~ x | 1, data = d, dist = "student")
Not accounting for the underlying student-t response distribution (df = 4
), which is characterized by heavier tails than the normal distribution, leads to a model fit where the tails of the predictive distribution are to light (compare the previous example with “conditional heteroscedasticity” and the case “tails too light” in the summary plot of Graphical Evaluation: Methodology):
In the example “skewness”, both right-skewed and left-skewed data are simulated, which are not correctly accounted for by the linear models m10
and m11
.
## not accounting for right skewed data
m10 <- lm(yrskew ~ x, data = d)
## not accounting for left skewed data
m11 <- lm(ylskew ~ x, data = d)
Not accounting for right-skewed or left-skewed data leads to a estimated model with a predictive distribution which is too skewed to the left or right, respectively (compare the cases “too skew to the left/right” in the summary plot of Graphical Evaluation: Methodology):
In the following, two further examples are shown, one with simulated data from a uniform distribution, the second with simulated data from a Laplace distribution. For both data sets, a linear model is estimated without considering the true underlying response distribution.
## not accounting for uniform distributed data
m12 <- lm(yunif ~ x, data = d)
## not accounting for laplace distributed data
m13 <- lm(ylaplace ~ x, data = d)
Uniformly distributed data generally have lighter tails than the normal distribution (depending on the lower and upper bounds, of course), while the Laplace distribution here is characterized by heavier tails and a higher probability mass in the center of the distribution than in the normal distribution. Therefore, the uniformly and Laplace distributed data lead to misspecified predictive distributions where the tails are either too heavy or too light, respectively (compare the previous examples “conditional heteroscedasticity” and “Student-t distribution”, where the tails are too light, as well aa the cases “tails too light/heavy” in the summary plot of Graphical Evaluation: Methodology).
Here, uniformly distributed data result in a misspecifed model where the tails of the predictive distribution are too heavy:
Here, Laplace distributed data result in a misspecifed model where the tails of the predictive distribution are too light:
As a reference, first a Poisson model is fitted to Poisson distributed data.
## correct model fit
m14 <- glm(ypois ~ x, data = d, family = poisson)
The model fit experiences no misspecifications resulting in a well calibrated predictive distribution (compare the case “well calibrated” in the summary plot of Graphical Evaluation: Methodology):
Two models are fitted to simulated data from a negative binomial distribution. In model m15
an underlying Poisson distribution is incorrectly assumed, in model m16
the negative binomial distribution is correctly accounted for by using glm.nb()
.
## misspecified distribution leading to an overdispersed predictive distribution
m15 <- glm(ynegbin ~ x, data = d, family = poisson)
## correct distributional assumption
m16 <- MASS::glm.nb(ynegbin ~ x, data = d)
The wrong distribution assumption results in not accounting for the overdispersion in the data, which corresponds to an underdispersed predictive disitriubtion (compare the case “underdispersed” in the summary plot of Graphical Evaluation: Methodology):
Two models are fitted to simulated count data with excess zeros; the model m17
is unable to account for zero inflation, while model m18
fits a zero-inflation regression model that combines a point mass at zero with a Poisson model for higher counts and therefore fits the excess zeros perfectly.
## not accouting for zero inflation
m17 <- glm(yzip ~ x, data = d, family = poisson)
## accouting for zero inflation
m18 <- pscl::zeroinfl(yzip ~ x | 1, data = d, dist = "poisson")
Excess zeros are inevitably linked to overdispersion, so the misspecified model fit without accounting for zero inflation has similar patterns to an underdispersive predictive distribution (compare the previous example with “negative binomial” distributed data and “underdispersed” in the summary plot of Graphical Evaluation: Methodology):