Graphics for Assessing Goodness of Fit

1 Overview

Marginal: Rootogram

Conditional: PIT histogram, Q-Q plot of randomized residuals, worm plot.

Notation: Previous section

Application: FIFA 2018 goals

data("FIFA2018", package = "distributions3")
m <- glm(goals ~ difference, data = FIFA2018, family = poisson)

2 Rootogram

The rootogram is a graphical tool for assessing the goodness of fit in terms of mariginal calibration of a parametric univariate distributional model, with estimated parameters θ^i=(θ^i1,,θ^iK) and f() desribing the density or probability mass function. Rootograms evaluate graphically whether observed frequencies obsj match the expected frequencies expj by plotting histogram-like rectangles or bars for the observed frequencies and a curve for the fitted frequencies, both on a square-root scale. In the form presented here, it was implemented by Kleiber and Zeileis () building on work of Tukey ().

In the most general form, given an observational vector of a random variable yi(i=1,,n) which is divided into subsets by a set of breakpoints b0,b1,b2, , the observed and expected frequencies are given by

obsj=i=1nwiI(yi(bj,bj+1]), expj=i=1nw{F(bj+1|θ^i)F(bj|θ^i)},

with F() being the CDF of the modeled distributional model f() and wi being optional observation-specific weights. Whereby, the weights are typically needed either for survey data or for situations with model-based weights ().

For a discrete variable yi, the observed and expected frequencies can be simplified and are given for each integer j by

obsj=i=1nI(yi=j), expj=i=1nf(j|θ^i),

with the indicator variable I() (). As rootograms are best known for count data, the latter form is quite common.

Different styles of rootograms have been proposed and are extensively discussed in Kleiber and Zeileis (). As default, they propose a so called “hanging” rootogram, which aligns all deviations along the horizontal axis, as the rectangles are drawn from expj to expjobsj, so that they “hang” from the curve with the expected frequencies expj.

The concept of comparing observed and expected frequencies graphically was also introduced in the seminal work on assessing calibration and sharpness for a predictive probalilty model by Gneiting, Balabdaoui, and Raftery () and, building on this, applied to count data by Czado, Gneiting, and Held (). However, since in both cases either the deviations or the expected and observed frequencies are presented only as lines connecting the respective frequencies, deviations are more difficult to detect compared to the rootograms introduced by Tukey () and further enhanced by Kleiber and Zeileis ().

rootogram(m)

3 PIT histogram

As described in the introduction, to check for probabilistic calibration of a regression model, Dawid () proposed the use of the probability integral transform (PIT) which is simply the predictive cumulative distribution function (CDF) evaluated at the observations. PIT values have been used under various names (e.g., ; ; ) , to emphasize their similar properties to residuals we follow Warton, Thibaut, and Wang () and refer to them as PIT residuals from now on.

For a continuous random variable yi(i=1,,n), PIT residuals are defined as

ui=F(yi|θ^i)

where F() denotes the CDF of the modeled distribution f() with estimated parameters θ^i=(θ^i1,,θ^iK). If the estimated model is a good approximation to the true data generating process, the observation will be drawn from the predictive distribution and the PIT residuals ui are approximately uniformly distributed on [0,1]. Plotting the histogram of the PIT residuals and checking for uniformity is therefore a common empirical way of checking for calibration (; ). Whereas, deviations from uniformity point to underlying forecast errors and model deficiencies: U-shaped histograms refer to underdispersed predictive distributions, inverted U-shaped histograms to overdispersion, and skewed histograms suggest that central tendencies must be biased (; ).

When considering discrete response distributions or distributions with a discrete component, e.g., in case of censoring, for a random discrete variable yi the PIT ui can be generated as a random draw from the interval [F(yi1|θ^i),F(yi|θ^i)]. Even if this leads to some randomness in the graphical representation of PIT residuals, for cases with a high number of observations the impact on the graphical evaluation when repeating the calculations (i.e. drawing new values ui) is typically rather small. For small data sets, we recommend to increase the number of random draws which significantly reduces the randomness in the graphical display.

Alternatively, a nonrandom PIT histogram was introduced by Czado, Gneiting, and Held (), where rather than building on randomized pointwise PIT resdiuals ui the expected fraction of the CDF along the interval [F(yi1|θ^i),F(yi|θ^i)] is used. This is asympotically equivalent to drawing an infinite number of random PIT residuals.

pithist(m)

4 Q-Q residuals plot

Quantile residuals are simply the inverse cumulative distribution function of a standard normal distribution Φ1 evaluated at the PIT residuals ui(i=1,,n), hence, they can be defined as

r^i=Φ1(F(yi|θ^i))=Φ1(ui),

where F() again denotes the cumulative distribution function (CDF) of the modeled distribution f() with estimated parameters θ^i=(θ^i1,,θ^iK) (). As before, for discrete or partly discrete responses, the approach includes some randomization to achieve continuous ui values; quantile residuals are therefore often referred to as randomized quantile residuals in the literature ().

In case of a correct model fit, the values ui are uniformly distributed on the unit interval and the Q-Q residuals should at least approximately be standard normally distributed. Hence, to check for normality, quantile residuals can be graphically compared to theoretical quantiles of the standard normal distribution, where strong deviations from the bisecting line indicate a misspecified model fit.

Mathematically, Q-Q plot consists of the tuples

(z(1),r^(1)),,(z(n),r^(n)),

where r^(i) denotes the ith order statistic of the quantile residuals, so that r^(1)r^(2)r^(n), and z(i) is the ordered statistics from the respective standard normal quantiles Φ1(pi), evaluated at the cumulative proportion pi=(i0.5)/n for n greater 10. This graphical evaluation is well known as normal probability plot or normal Q-Q plot (). Due to the transformation of the PIT residuals ui to the normal scale, their extreme values are more widely spread, so that normal Q-Q diagrams are better suited than, for example, PIT histograms to detect violations of the distribution assumption within its tails. An additional possible advantage of Q-Q plots is that they avoid the necessity of defining breakpoints as typically needed for histogram style evaluations ().

But Q-Q plots can also be applied to check if residuals follow any other known distribution, by employing any other inverse cumulative distribution function of interest instead of Φ1 in the computation and comparing the quantile residuals r^i to the respective theoretical quantiles. This is called than a theoretical quantile-quantile plot or Q-Q plot for short ().

qqrplot(m, confint = "line")

5 Worm plot

As in Q-Q plots, small too medium deviations can be quite hard to detect, untilting the plot by subtracting the theoretical quantiles, makes detecting pattern of departure from a now horizontal line much easier. Mathematically, therefore, the tuples in the plot are

(z(1),r^(1)z(1)),,(z(n),r^(n)z(n)),

where as before, where r^(i) denotes the order statistic of the empirical quantile residuals and z(i) the ordered statistics of the respective standard normal quantiles. This so-called de-trended Q-Q plot () is best known by the application of Buuren and Fredriks (), and is therefore usually referred to as worm plot according to their naming.

wormplot(m, confint = "line")

6 References

Brockwell, A. E. 2007. “Universal Residuals: A Multivariate Transformation.” Statistics & Probability Letters 77 (14): 1473–78. https://doi.org/https://doi.org/10.1016/j.spl.2007.02.008.
Buuren, Stef van, and Miranda Fredriks. 2001. “Worm Plot: A Simple Diagnostic Device for Modelling Growth Reference Curves.” Statistics in Medicine 20 (8): 1259–77. https://doi.org/10.1002/sim.746.
Czado, Claudia, Tilmann Gneiting, and Leonhard Held. 2009. “Predictive Model Assessment for Count Data.” Biometrics 65 (4): 1254–61. https://doi.org/10.1111/j.1541-0420.2009.01191.x.
Dawid, A. P. 1984. “Present Position and Potential Developments: Some Personal Views: Statistical Theory: The Prequential Approach.” Journal of the Royal Statistical Society: Series A (General) 147 (2): 278–92. https://doi.org/10.2307/2981683.
Diebold, Francis X., Todd A. Gunther, and Anthony S. Tay. 1998. “Evaluating Density Forecasts with Applications to Financial Risk Management.” International Economic Review 39 (4): 863–83. https://doi.org/10.2307/2527342.
Dunn, Peter K., and Gordon K. Smyth. 1996. “Randomized Quantile Residuals.” Journal of Computational and Graphical Statistics 5 (3): 236–44. https://doi.org/10.2307/1390802.
Friendly, Michael. 1991. SAS System for Statistical Graphics. 1st ed. Cary, NC: SAS Institute Inc.
Gneiting, Tilmann, Fadoua Balabdaoui, and Adrian E. Raftery. 2007. “Probabilistic Forecasts, Calibration and Sharpness.” Journal of the Royal Statistical Society: Series B (Methodological) 69 (2): 243–68. https://doi.org/10.1111/j.1467-9868.2007.00587.x.
Hoaglin, David C. 2006. “Using Quantiles to Study Shape.” In Exploring Data Tables, Trends, and Shapes, 417–60. John Wiley & Sons, Ltd. https://doi.org/https://doi.org/10.1002/9781118150702.ch10.
Kleiber, Christian, and Achim Zeileis. 2016. “Visualizing Count Data Regressions Using Rootograms.” The American Statistician 70 (3): 296–303. https://doi.org/10.1080/00031305.2016.1173590.
Klein, Nadja, Thomas Kneib, Stefan Lang, and Alexander Sohn. 2015. Bayesian Structured Additive Distributional Regression with an Application to Regional Income Inequality in Germany.” Annals of Applied Statistics 9: 1024–52. https://doi.org/10.1214/15-aoas823.
Smith, J. Q. 1985. “Diagnostic Checks of Non-Standard Time Series Models.” Journal of Forecasting 4 (3): 283–91. https://doi.org/https://doi.org/10.1002/for.3980040305.
Tukey, John W. 1977. Exploratory Data Analysis. Addison-Wesley.
Warton, David I., Loïc Thibaut, and Yi Alice Wang. 2017. “The PIT-Trap—A ‘Model-Free’ Bootstrap Procedure for Inference about Regression Models with Discrete, Multivariate Responses.” PLOS ONE 12 (7): 1–18. https://doi.org/10.1371/journal.pone.0181790.