graphical_model_assessment_original.RmdA Poisson model without regressors is fitted to two simulated datasets from a Poisson and a negative binomial distribution; both distributions have a mean \(\mu = 3\), the negative binomal has a shape parameter \(\Theta = 2\):
d1a <- data.frame(
yp = rpois(100, lambda = 3),
ynb = rnbinom(100, mu = 3, size = 2)
)
lala
## correct Poisson model fit (DGP: Poisson)
m1a_p <- glm(yp ~ 1, family = poisson, data = d1a)
## incorrect Poisson model fit (DGP: neg. binomial)
m1a_nb1 <- glm(ynb ~ 1, family = poisson, data = d1a)
## correct neg. binomial model fit (DGP: Neg. binomial)
m1a_nb2 <- glm.nb(ynb ~ 1, data = d1a)

In the first and third line, a Poisson model and negative binomial model is fitted to simulated data from a Poisson and negative binomial distribution, respectively. In the second line, a false response distribution is assumed by fitting a Poisson model to simulated data from a negative binomial model. The following characteristics can be seen in the model assessment graphs for the misspecified model (2nd line):
Artificial data from negative binomial and Poisson distribution with regressors:
d1b <- data.frame(
x = runif(500, 0, 2)
)
d1b$mu <- 1 + 2 * d1b$x
d1b$yp <- rpois(500, lambda = d1b$mu)
d1b$ynb <- rnbinom(500, mu = d1b$mu, size = 2)
## correctly specified Poisson model fit
m1b_p <- glm(yp ~ x, data = d1b, family = poisson)
## misspecified Poisson model fit
m1b_nb1 <- glm(ynb ~ x, data = d1b, family = poisson)
## correctly specified Negbin model fit
m1b_nb2 <- glm.nb(ynb ~ x, data = d1b)

As for the example 1a, a false response distribution is assumed for the 2nd model by fitting a Poisson model to simulated data from a negative binomial model. Hence, similar characteristics can be seen in the model assessment graphs for the misspecified model (2nd line).
Artificial data from a Student t distribution with four degrees of freedom without regressors:
d2a <- data.frame(
yt = rt(1000, 4)
)
## correct student-t fit
m2a_t <- crch(yt ~ 1, dist = "student", data = d2a)
## incorrect normal fit
m2a_n <- crch(yt ~ 1, dist = "gaussian", data = d2a)

Wrong assumption of the underlying response distribution, estimating a Normal distribution instead of a Student t distribution with four degrees of freedom, leads to an underestimting of the heavy tails and an underdispersive model fit (shown in the 2nd line):
Artificial data from a Student t distribution with four degrees of freedom with regressors:
d2b <- data.frame(
x = runif(1000, 0, 1)
)
d2b$yt <- d2b$x + 1 * rt(1000, 4)
d2b$yn <- d2b$x + 1 * rnorm(1000)
## correct student-t fit
m2b_t <- crch(yt ~ x, dist = "student", data = d2b)
## incorrect normal fit
m2b_n1 <- crch(yt ~ x, dist = "gaussian", data = d2b)
## correct normal fit
m2b_n2 <- crch(yn ~ x, dist = "gaussian", data = d2b)

Wrong assumption of the underlying response distribution, estimating a Normal distribution instead of a Student t distribution with four degrees of freedom, leads to an underestimting of the heavy tails and an underdispersive model fit (shown in the 2nd line):
Artificial data from a Gaussian, Uniform and Laplace (mu = 0, s = 0.4) distribution without regressors:
d3a <- data.frame(
yn = rnorm(1000),
yu = runif(1000, min = -6, max = 6),
yl = rlaplace(1000, m = 0, s = 0.4)
)
## correct gaussian fit
m3a_n <- crch(yn ~ 1, dist = "gaussian", data = d3a)
## incorrect fit (underdispersive model fit)
## [U-shaped PIT, S-shaped QQ-plot, "thin tails" worm plot]
m3a_u <- crch(yu ~ 1, dist = "gaussian", data = d3a)
## incorrect fit (overdispersive model fit)
## [inverse U-shaped PIT, inverse S-shaped QQ-plot, "fat tails" worm plot]
m3a_l <- crch(yl ~ 1, dist = "gaussian", data = d3a)

Wrong assumption of underlying distribution, Uniform or Laplace distribution instead of a Normal distribution, leads to an underdispersive and overdispersive model fit (shown in the 2nd and 3rd lines):
The largest/smallest values are not as large/small as expected or under-dispersed data has a reduced number of outliers (i.e. the true underlying distribution has thinner tails).
Artificial data from a Gaussian, Uniform and Laplace distribution with regressors:
d3b <- data.frame(
x = runif(1000, 0, 1)
)
d3b$yn <- 0 + 1 * rnorm(1000, mean = d3b$x)
d3b$yu <- 0 + 1 * runif(1000, min = d3b$x - 1, max = d3b$x + 1)
d3b$yl <- 0 + 1 * rlaplace(1000, m = d3b$x, s = 0.4)
## correct gaussian fit
m3b_n <- crch(yn ~ x, dist = "gaussian", data = d3b)
## incorrect fit (underdispersive model fit)
## [U-shaped PIT, S-shaped QQ-plot, "thin tails" worm plot]
m3b_u <- crch(yu ~ x, dist = "gaussian", data = d3b)
## incorrect fit (overdispersive model fit)
## [inverse U-shaped PIT, inverse S-shaped QQ-plot, "fat tails" worm plot]
m3b_l <- crch(yl ~ x, dist = "gaussian", data = d3b)

Wrong assumption of underlying distribution, Uniform or Laplace distribution instead of a Normal distribution, leads to an underdispersive and overdispersive model fit (shown in the 2nd and 3rd lines):
Artificial data from a Normal distribution with no skewness, as well as from a right skewed, and left skewed Normal distribution without regressors:
d4a <- data.frame(
yn = rsn(n = 100, xi = 0, omega = 1, alpha = 0, tau = 0),
yrs = rsn(n = 100, xi = 0, omega = 1, alpha = 5, tau = 0),
yls = rsn(n = 100, xi = 0, omega = 1, alpha = -5, tau = 0)
)
## correct gaussian fit
m4a_n <- crch(yn ~ 1, data = d4a, dist = "gaussian")
## incorrect fit: right-skewed residuals
m4a_rs <- crch(yrs ~ 1, data = d4a, dist = "gaussian")
## incorrect fit: left-skewed residuals
m4a_ls <- crch(yls ~ 1, data = d4a, dist = "gaussian")

Wrong assumption of underlying distribution, Normal distribution instead of rigth and left skewed Normal distriubtion, leads to a misspecified model fits (shown in the 2nd and 3rd lines):
Artificial data from a Normal distribution with no skewness, as well as from a right skewed, and left skewed Normal distribution with regressors:
d4b <- data.frame(
x = rnorm(1000, 15, 1.2),
z <- rnorm(1000, -0.5, 0.4)
)
d4b$mu <- 0.4 + 0.9 * d4b$x
d4b$sigma <- exp(0.3 + 0.6 * exp(d4b$z))
d4b$yn <- rsn(n = 1000, xi = d4b$mu, omega = d4b$sigma, alpha = 0, tau = 0)
d4b$yrs <- rsn(n = 1000, xi = d4b$mu, omega = d4b$sigma, alpha = 5, tau = 0)
d4b$yls <- rsn(n = 1000, xi = d4b$mu, omega = d4b$sigma, alpha = -5, tau = 0)
## correct gaussian fit
m4b_n <- crch(yn ~ x | z, data = d4b, dist = "gaussian")
## incorrect fit: right-skewed residuals
## [curved (positive skewed) QQ-Plot, U-shape wormplot]
m4b_rs <- crch(yrs ~ x | z, data = d4b, dist = "gaussian")
## incorrect fit: left-skewed residuals
## [curved (negative skewed) QQ-Plot, inverse U-shape wormplot]
m4b_ls <- crch(yls ~ x | z, data = d4b, dist = "gaussian")

As for 4a, wrong assumption of underlying distribution, Normal distribution instead of rigth and left skewed Normal distriubtion, leads to a misspecified model fits (shown in the 2nd and 3rd lines):
Artificial data from Logistic distribution with and without truncation at zero. Data generating process without regressors:
d5a <- data.frame(
ytn = rtnorm(1000, 0.3, 0.6, left = 0),
yn = rtnorm(1000, 0.3, 0.6, left = -Inf)
)
## correct truncated normal fit
m5a_tn <- crch(ytn ~ 1, data = d5a, dist = "gaussian", truncated = TRUE, left = 0)
## incorrect normal fit
m5a_n1 <- crch(ytn ~ 1, data = d5a, dist = "gaussian", truncated = TRUE, left = -Inf)
## correct normal fit
m5a_n2 <- crch(yn ~ 1, data = d5a, dist = "gaussian", truncated = TRUE, left = -Inf)

Wrong assumption of underlying distribution, not considering truncation at zero, leads to misspecified model fit in the 2nd line:
Artificial data from Logistic distribution with and without truncation at zero. Data generating process with regressors:
d5b <- data.frame(
x = rtnorm(1000, 0.4, 0.6, left = 0),
z <- rnorm(1000, -0.6, 0.2)
)
d5b$mu <- 0.2 + 1.8 * d5b$x
d5b$sigma <- exp(0.1 + 0.6 * exp(d5b$z))
d5b$ytn <- rtnorm(1000, mean = d5b$mu, sd = d5b$sigma, left = 0)
d5b$yn <- rtnorm(1000, mean = d5b$mu, sd = d5b$sigma)
## correct truncated normal fit
m5b_tn <- crch(ytn ~ x | z, data = d5b, dist = "gaussian", truncated = TRUE, left = 0)
## incorrect normal fit
m5b_n1 <- crch(ytn ~ x | z, data = d5b, dist = "gaussian", truncated = TRUE, left = -Inf)
## correct normal fit
m5b_n2 <- crch(yn ~ x | z, data = d5b, dist = "gaussian", truncated = TRUE, left = -Inf)

Wrong assumption of underlying distribution, not considering truncation at zero, leads to misspecified model fit in the 2nd line:
Artificial data from Logistic distribution with and without censoring at zero. Data generating process without regressors:
d6a <- data.frame(
ycn = rcnorm(1000, 0.4, 0.6, left = 0),
yn = rcnorm(1000, 0.4, 0.6, left = -Inf)
)
## correct censored normal fit
m6a_cn <- crch(ycn ~ 1, data = d6a, dist = "gaussian", truncated = FALSE, left = 0)
## incorrect normal fit
m6a_n1 <- crch(ycn ~ 1, data = d6a, dist = "gaussian", truncated = FALSE, left = -Inf)
## correct normal fit
m6a_n2 <- crch(yn ~ 1, data = d6a, dist = "gaussian", truncated = FALSE, left = -Inf)

Wrong assumption of underlying distribution, not considering censoring at zero, leads to misspecified model fit in the 2nd line:
Artificial data from Logistic distribution with and without censoring at zero. Data generating process with regressors:
d6b <- data.frame(
x = rcnorm(1000, 0.4, 0.6, left = 0),
z <- rnorm(1000, -0.6, 0.2)
)
d6b$mu <- 0.2 + 1.8 * d6b$x
d6b$sigma <- exp(0.1 + 0.6 * exp(d6b$z))
d6b$ycn <- rcnorm(1000, mean = d6b$mu, sd = d6b$sigma, left = 0)
d6b$yn <- rcnorm(1000, mean = d6b$mu, sd = d6b$sigma)
## correct censored normal fit
m6b_cn <- crch(ycn ~ x | z, data = d6b, dist = "gaussian", truncated = FALSE, left = 0)
## incorrect normal fit
m6b_n1 <- crch(ycn ~ x | z, data = d6b, dist = "gaussian", truncated = FALSE, left = -Inf)
## correct normal fit
m6b_n2 <- crch(yn ~ x | z, data = d6b, dist = "gaussian", truncated = FALSE, left = -Inf)

Wrong assumption of underlying distribution, not considering censoring at zero, leads to misspecified model fit in the 2nd line:
Czado, Claudia, Tilmann Gneiting, and Leonhard Held. 2009. “Predictive Model Assessment for Count Data.” Biometrics 65 (4): 1254–61. https://doi.org/10.1111/j.1541-0420.2009.01191.x.
Kleiber, Christian, and Achim Zeileis. 2016. “Visualizing Count Data Regressions Using Rootograms.” The American Statistician 70 (3): 296–303. https://doi.org/10.1080/00031305.2016.1173590.
Kross, Sean. 2016. “A Q-Q Plot Dissection Kit.” https://seankross.com/2016/02/29/A-Q-Q-Plot-Dissection-Kit.html.
Love, Thomas E. 2018. “Data Science for Biological, Medical and Health Research: Notes for 431.” Case Western Reserve University. https://thomaselove.github.io/2018-431-book/.
Yearsley, Jon. 2021. “Examples of Quantile-Quantile Plots.” School of Biology; Environmental Science, UCD Dublin. https://www.ucd.ie/ecomodel/Resources/QQplots_WebVersion.html.