This vignette provides two real cases discussing the graphical evaluation of the marginal and probabilistic calibration of probabilistic regression models. The first example provides a short case study modeling the goals at the 2018 FIFA World Cup employing a Poisson distribution. It is a direct extension of the vignette “The Poisson Distribution: From Basic Probability Theory to Regression Models” given in distributions3. The second example models 3 day-accumulated precipitation sums using a Logistic distribution censored at zero accounting for non-negative precipitation sums. The use case builds heavily on the vignette “Heteroscedastic Censored and Truncated Regression with crch” given in crch by Messner, Mayr, and Zeileis (2016).

Goals in the 2018 FIFA World Cup

This use-case employs the Poisson distribution for modeling count data along predicted probabilities for the number of goals in soccer matches from the 2018 FIFA World Cup. The full analysis with an illustrative introduction from basic probability theory to regression models using the R package distributions3 is given here.

To investigate the number of goals scored per match in the 2018 FIFA World Cup, the FIFA2018 data set provides two rows, one for each team, for each of the 64 matches during the tournament. In the following, we treat the goals scored by the two teams in the same match as independent which is a realistic assumption for this particular data set. We just remark briefly that there are also bivariate generalizations of the Poisson distribution that would allow for correlated observations but which are not considered here.

In addition to the goals, the data set provides some basic meta-information for the matches (an ID, team name abbreviations, type of match, group vs. knockout stage) as well as some further covariates that we will revisit later in this document. The data looks like this:

data("FIFA2018", package = "distributions3")
head(FIFA2018)
##   goals team match type stage logability difference
## 1     5  RUS     1    A group  0.1530732  0.8638406
## 2     0  KSA     1    A group -0.7107673 -0.8638406
## 3     0  EGY     2    A group -0.2066409 -0.4438080
## 4     1  URU     2    A group  0.2371671  0.4438080
## 5     3  RUS     3    A group  0.1530732  0.3597142
## 6     1  EGY     3    A group -0.2066409 -0.3597142

For now, we will focus on the goals variable only. A brief summary yields

summary(FIFA2018$goals)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.000   1.297   2.000   6.000

showing that the teams scored between \(0\) and \(6\) goals per match with an average of \(\bar y = 1.297\) from the observations \(y_i\) (\(i = 1, \dots, 128\)). The corresponding table of observed relative frequencies are:

prop.table(table(FIFA2018$goals))
## 
##         0         1         2         3         4         5         6 
## 0.2578125 0.3750000 0.2500000 0.0781250 0.0156250 0.0156250 0.0078125

(Note that in recent versions of R using proportions() rather than prop.table() is recommended.)

This confirms that goals are relatively rare events in a soccer game with each team scoring zero to two goals per match in almost 90 percent of the matches. Below we show that this observed frequency distribution can be approximated very well by a Poisson distribution which can subsequently be used to obtain predicted probabilities for the goals scored in a match.

From basic probability theory to regression models

In a first step, we simply assume that goals are scored with a constant mean over all teams and fit a single Poisson distribution for the number of goals. To do so, we obtain a point estimate of the Poisson parameter by using the empirical mean \(\hat \lambda = \bar y = 1.297\) and set up the corresponding distribution object:

p_const <- distributions3::Poisson(lambda = mean(FIFA2018$goals))
p_const
## [1] "Poisson distribution (lambda = 1.297)"

This actually corresponds to the maximum likelihood estimator for this distribution fitting a generalized linear model (GLM) to the data that links the expected number of goals per team/match \(\lambda_i\) to the linear predictor \(x_i^\top \beta\) with regressor vector \(x_i^\top\) and corresponding coefficient vector \(\beta\) using a log-link: \(\log(\lambda_i) = x_i^\top \beta\).

Here, in the simplest case fitting an intercept-only model without further regressors, the regressor vector can be written as \(x_i^\top = 1\) and the maximum likelihood estimator \(\hat \beta\) with corresponding inference, predictions, residuals, etc. can be obtained using the glm() function from base R with family = poisson:

m_ic <- glm(goals ~ 1, data = FIFA2018, family = poisson)
m_ic
## 
## Call:  glm(formula = goals ~ 1, family = poisson, data = FIFA2018)
## 
## Coefficients:
## (Intercept)  
##        0.26  
## 
## Degrees of Freedom: 127 Total (i.e. Null);  127 Residual
## Null Deviance:       144.2 
## Residual Deviance: 144.2     AIC: 372.9

The corresponding prediction for the number of goals can be obtained manually from the extracted coef() by applying exp() (as the inverse of the log-link).

lambda_zero <- exp(coef(m_ic)[1])
lambda_zero
## (Intercept) 
##    1.296875

Or equivalently the predict() function can be used with type = "response" in order to get the expected \(\hat \lambda_i\) (rather than just the linear predictor \(x_i^\top \hat \beta\) that is predicted by default).

predict(m_ic, newdata = data.frame(difference = 0), type = "response")
##        1 
## 1.296875

This yields the same predicted Poisson distribution, identical to \(p_const\), for all teams and matches in this tournament – most likely an incorrect assumption:

p_ic <- distributions3::Poisson(lambda = fitted(m_ic))
head(p_ic)
##                                       1                                       2 
## "Poisson distribution (lambda = 1.297)" "Poisson distribution (lambda = 1.297)" 
##                                       3                                       4 
## "Poisson distribution (lambda = 1.297)" "Poisson distribution (lambda = 1.297)" 
##                                       5                                       6 
## "Poisson distribution (lambda = 1.297)" "Poisson distribution (lambda = 1.297)"

To account for different expected performances from the teams in the 2018 FIFA World Cup, the FIFA2018 data provides an estimated logability for each team. These have been estimated by Zeileis, Leitner, and Hornik (2018) prior to the start of the tournament (2018-05-20) based on quoted odds from 26 online bookmakers using the bookmaker consensus model of Leitner, Zeileis, and Hornik (2010). The difference in logability between a team and its opponent is a useful predictor for the number of goals scored.

Hence, the intercept-only model m_ic can be extended by using the regressor vector \(x_i^\top = (1, \mathtt{difference}_i)\):

m_reg <- glm(goals ~ difference, data = FIFA2018, family = poisson)
m_reg
## 
## Call:  glm(formula = goals ~ difference, family = poisson, data = FIFA2018)
## 
## Coefficients:
## (Intercept)   difference  
##      0.2127       0.4134  
## 
## Degrees of Freedom: 127 Total (i.e. Null);  126 Residual
## Null Deviance:       144.2 
## Residual Deviance: 128.7     AIC: 359.4

Here, the slope of \(0.413\) can be interpreted as an ability elasticity of the number of goals scored. This is because the difference of the log-abilities can also be understood as the log of the ability ratio. Thus, when the ability ratio increases by \(1\) percent, the expected number of goals increases approximately by \(0.413\) percent.

In comparison to the intercept-only model m_ic, this yields a more realistic outcome providing a different predicted Poisson distribution for each team/match in the tournament. We can set up the vector of all \(128\) Poisson() distribution objects by extracting the vector of all fitted point estimates \((\hat \lambda_1, \dots, \hat \lambda_{128})^\top\):

p_reg <- distributions3::Poisson(lambda = fitted(m_reg))
head(p_reg)
##                                        1 
## "Poisson distribution (lambda = 1.7680)" 
##                                        2 
## "Poisson distribution (lambda = 0.8655)" 
##                                        3 
## "Poisson distribution (lambda = 1.0297)" 
##                                        4 
## "Poisson distribution (lambda = 1.4862)" 
##                                        5 
## "Poisson distribution (lambda = 1.4354)" 
##                                        6 
## "Poisson distribution (lambda = 1.0661)"

Model evaluation: Marginal calibration

Marginal calibration is generally concerned with whether the observed frequencies match the frequencies expected by the model. For discrete observations, frequencies for the observations themselves can be considered; for continuous observations or more generally, frequencies for intervals of observations are being used. Here, the expected frequencies are computed by differences between the predictive CDFs \(F( \cdot )\), evaluated at the interval breaks. Hence, marginal calibration is always obtained on the observation scale compared to the probabilistic calibration performed on the probability scale. This makes it especially useful for evaluating count data models which predictions are often close to zero.

For evaluating the marginal calibration of the Poisson models for the number of goals scored, we compute the expected absolute frequencies by averaging across the expectations per team/match from both regression model and compare these with the observed absolute frequencies:

observed <- table(FIFA2018$goals)
expected_ic <- distributions3::pdf(p_ic, 0:6)
expected_ic <- colMeans(expected_ic) * sum(observed)
expected_reg <- distributions3::pdf(p_reg, 0:6)
expected_reg <- colMeans(expected_reg) * sum(observed)
cbind(observed, expected_ic, expected_reg) 
##   observed expected_ic expected_reg
## 0       33  34.9932527    37.679930
## 1       48  45.3818746    43.541948
## 2       32  29.4273093    27.450378
## 3       10  12.7211806    12.574218
## 4        2   4.1244453     4.684102
## 5        2   1.0697780     1.501012
## 6        1   0.2312281     0.426588

Graphically, this comparison can be shown as a so-called rootogram (Kleiber and Zeileis 2016) where the gray bars represent the observed frequencies overlayed by the expected frequencies as a red line. By using the function rootogram(), we can reproduce the previous results plugging in the model objects m_ic and m_reg:

r1 <- rootogram(m_ic, fitted = TRUE, style = "standing", scale = "raw", ref = FALSE, plot = FALSE)
r2 <- rootogram(m_reg, fitted = TRUE, style = "standing", scale = "raw", ref = FALSE, plot = FALSE)
c(r1, r2)
## A `rootogram` object with `scale = 'raw'` and `style = 'standing'`
## 
##    observed   expected mid width group
## 1        33 34.9932527   0   0.9     1
## 2        48 45.3818746   1   0.9     1
## 3        32 29.4273093   2   0.9     1
## 4        10 12.7211806   3   0.9     1
## 5         2  4.1244453   4   0.9     1
## 6         2  1.0697780   5   0.9     1
## 7         1  0.2312281   6   0.9     1
## 8        33 37.6799296   0   0.9     2
## 9        48 43.5419481   1   0.9     2
## 10       32 27.4503775   2   0.9     2
## 11       10 12.5742178   3   0.9     2
## 12        2  4.6841019   4   0.9     2
## 13        2  1.5010116   5   0.9     2
## 14        1  0.4265880   6   0.9     2
ggplot2::autoplot(c(r1, r2))

Further, in order to stabilize the variances of the discrepancies, i.e., putting more focus on discrepancies for small frequencies, we can show the frequencies on a square root scale:

ggplot2::autoplot(c(r1, r2), scale = "sqrt")  # FIXME: (ML) names got lost

In a second step, we employ a so-called hanging rootogram where bars representing the square-root of the observed frequencies are “hanging” from the square-root of the expected frequencies in the red line. Here, the offset around the x-axis compared to a reference line clearly shows the difference between the two frequencies for the intercept-only model m_ic and the full probabilistic model m_reg. For both models the differences are reasonably close to zero indicating a rather good marginal fit:

ggplot2::autoplot(c(r1, r2), scale = "sqrt", style = "hanging", ref = TRUE)  # FIXME: (ML) names got lost

So far we have only evaluated the marginal calibration comparing whether the observed frequencies match the frequencies expected by estimated model on the observational scale. In the next step we will evaluate the probabilistic calibration which is performed on the probability scale.

Model evaluation: Probabilistic calibration

Probabilistic calibration is usually assessed using probability integral transform (PIT) values (Dawid 1984; Diebold, Gunther, and Tay 1998; Gneiting, Balabdaoui, and Raftery 2007) or so-called PIT residuals (Warton 2017). These are simply the predictive cumulative distribution function (CDF) evaluated at the observations

\[u_i = F(y_i | \, \hat{\boldsymbol{\theta}}_i),\]

where \(F( \cdot )\) denotes the CDF of the modeled distribution \(f( \cdot )\) with estimated parameters \(\hat{\boldsymbol{\theta}}_{i}\). PIT residuals have the desirable property, that if the model is a good approximation to the true data-generating process, i.e., the observation is drawn from the predictive distribution, the PIT residuals \(u_i\) are approximately uniformly distributed on \([0, 1]\) for continuous predictive distributions \(F( \cdot )\). Plotting the histogram of the PIT residuals and checking for uniformity is therefore a common empirical way of checking for calibration:

p1 <- pithist(m_reg)

To focus more on violations of the distributional assumption in the tails of the distribution, PIT residuals can also be transformed to the normal scale or another scale if preferred. This allows the extreme values to be spread more widely, so that less values are included in the outer intervals. However, this has the drawback that detecting misspecification is more complicated than checking for uniformity:

p2 <- pithist(m_reg, trafo = qnorm, type = "random", simint = FALSE) # `type = "expected"` does not work, check!

Discretization, i.e., the representation of values in intervals, has the advantage that it usually improves the readability of graphical representations. However, discretization can also hide certain misspecifications: breakpoints are often quite arbitrary and can mask misspecifications of relatively few but extreme values in the outer tails of the distribution. Here, classical Q-Q plots are clearly superior to PIT histograms:

q1 <- qqrplot(m_reg, confint = "line", simint = FALSE)

Since small to medium deviations are nevertheless difficult to detect in Q-Q plots, untilting the plot makes detecting pattern of departure from a now horizontal line much easier. Here, expected quantiles are subtracted from the observed quantiles:

w1 <- wormplot(m_reg, confint = "line", simint = FALSE)

This so-called de-trended Q-Q plot is best known as worm plot (Buuren and Fredriks 2001).

Precipitation forecasts

This use-case discusses a weather forecast example application of censored regression models. The example is taken from the vignette “Heteroscedastic Censored and Truncated Regression with crch” provided for the package crch by Messner, Mayr, and Zeileis (2016).

Weather forecasts are usually based on numerical weather prediction (NWP) models, which take the current state of the atmosphere and calculate future weather by numerically simulating the main atmospheric processes. However, due to uncertain initial conditions and unknown or unresolved processes, these numerical predictions are always subject to errors. To estimate these errors, many weather centers produce what are called ensemble forecasts: multiple NWP runs that use different initial conditions and model formulations. Unfortunately, these ensemble forecasts cannot account for all sources of error, so they are often still biased and uncalibrated. Therefore, they are often calibrated and corrected for systematic errors through statistical post-processing.

One popular post-processing method is heteroscedastic linear regression where the ensemble mean is used as regressor for the location and the ensemble standard deviation or variance is used as regressor for the scale (Gneiting et al. 2005). The following example applies heteroscedastic censored regression with a logistic distribution assumption to precipitation data in Innsbruck (Austria).

The RainIbk data set contains observed 3 day-accumulated precipitation amounts (rain) and the corresponding 11 member ensemble forecasts of total accumulated precipitation amount between 5 and 8 days in advance (rainfc.1, rainfc.2, … rainfc.11). In previous studies it has been shown that it is of advantage to model the square root of precipitation rather than precipitation itself. Thus all precipitation amounts are square rooted before ensemble mean and standard deviation are derived. Furthermore, events with no variation in the ensemble are omitted:

data("RainIbk", package = "crch")

RainIbk <- sqrt(RainIbk)
RainIbk$ensmean <- apply(RainIbk[,grep('^rainfc',names(RainIbk))], 1, mean)
RainIbk$enssd <- apply(RainIbk[,grep('^rainfc',names(RainIbk))], 1, sd)
RainIbk <- subset(RainIbk, enssd > 0)

From linear regression to distributional regression models

For comparison we fit a homoscedastic linear regression model using the least squares approach for rain with ensmean as regressor. As a more appropriate model for rain accounting for non-negative precipitation sums, we fit a hetereoscedastic censored model assuming an underlying logistic response distribution with ensmean as regressor for the location and log(enssd) as regressor for the scale:

## linear model
m_lm <- lm(rain ~ ensmean, data = RainIbk)

## heteroscedastic censored regression with a logistic distribution assumption
m_hclog <- crch::crch(rain ~ ensmean | log(enssd), data = RainIbk, left = 0,
  dist = "logistic")

In the scatterplot of rain against ensmean it can be seen, that precipitation is clearly non-negative with many zero observations. Thus the censored regression model m_hclog is more suitable than the linear regression model to estimate the underlying relationship:

plot(rain ~ ensmean, data = RainIbk, pch = 19, col = gray(0, alpha = 0.2))
abline(coef(m_lm)[1:2], col = 3, lwd = 2)
abline(coef(m_hclog)[1:2], col = 4, lwd = 2)

legend("topright", lwd = c(2, 1), lty = c(1, 1), col = c(3, 4), 
  c("m_lm", "m_hclog"), bty = "n")

Model evaluation: Marginal calibration

The significant underfitting of zero observations is also very apparent in the rootogramm of the linear model, as it does not correctly account for the point mass at zero. Additionally, a weak wavelike pattern indicates a slight overfitting of precipitation sums between zero and 5mm and an underfitting of precipitation sums above. In contrast, the censored logistic regression m_hclog provides a pretty good marginal fit, where the expected squared frequencies closely match the squared observed frequencies:

r1 <- rootogram(m_lm, plot = FALSE)
r2 <- rootogram(m_hclog, plot = FALSE)

ggplot2::autoplot(c(r1, r2))

Model evaluation: Probabilistic calibration

The PIT histogram of the linear model m_lm is skewed, indicating a misfit in terms of the probabilistic calibration, whereas the PIT histogram of the linear model is rather uniformly distributed.

p1 <- pithist(m_lm, plot = FALSE)
p2 <- pithist(m_hclog, plot = FALSE)

ggplot2::autoplot(c(p1, p2))

For a better comparison of the two model fits, alternatively a line-style PIT histogram can also be shown:

ggplot2::autoplot(c(p1, p2), colour = c(3, 4), style = "line", single_graph = TRUE, confint_col = 1, confint = "line", ref = FALSE)

To further analyse the misfit of the linear model in terms of probabilistic calibration, we transform the PIT residuals to the normal scale and increase the number of breaks in the PIT histogram. Transforming to the normal scale spreads the values at the tails of the distribution further apart, and increasing the breaks prevents potential masking of small scale patterns.

We can see that the distribution of PIT residuals is right skewed with more values between -2 and 0, but less values between 0 and 2 compared to the standard normal distribution:

pithist(m_lm, breaks = 50, trafo = qnorm, xlim = c(-3, 3))

This is also reflected in the pattern of the Q-Q plot of the linear model: Due to the high frequency in values between -2 and 0, the observed quantile residuals increase slower in this region relative to the standard normal quantiles. However, the low frequency of quantile residuals above zero leads to an higher increase in quantile residuals compared to the standard normal quantiles.

In summary, the Q-Q plot of the linear model has a positive curvature, bending up at both ends of the distributional tails. The censored logistic regression m_hclog follows more or less the reference line indicating normally distributed quantile residuals:

q1 <- qqrplot(m_lm, plot = FALSE)
q2 <- qqrplot(m_hclog, plot = FALSE)
ggplot2::autoplot(c(q1, q2), colour = c(3, 4), alpha = 0.8, single_graph = TRUE, simint = FALSE, confint ="line")

Consistent with the pattern of right-skewed PIT residuals corresponding to a “too left-skewed” fitted distribution, a U-shaped pattern is evident in the worm plot of the linear model; the censored logistic is more or less flat around zero.

w1 <- wormplot(m_lm, plot = FALSE)
w2 <- wormplot(m_hclog, plot = FALSE)
ggplot2::autoplot(c(w1, w2), colour = c(3, 4), alpha = 0.8, single_graph = TRUE, simint = FALSE, confint ="line")

References

Buuren, Stef van, and Miranda Fredriks. 2001. “Worm Plot: A Simple Diagnostic Device for Modelling Growth Reference Curves.” Statistics in Medicine 20 (8): 1259–77. https://doi.org/10.1002/sim.746.

Dawid, A. P. 1984. “Present Position and Potential Developments: Some Personal Views: Statistical Theory: The Prequential Approach.” Journal of the Royal Statistical Society: Series A (General) 147 (2): 278–92. https://doi.org/10.2307/2981683.

Diebold, Francis X., Todd A. Gunther, and Anthony S. Tay. 1998. “Evaluating Density Forecasts with Applications to Financial Risk Management.” International Economic Review 39 (4): 863–83. https://doi.org/10.2307/2527342.

Gneiting, Tilmann, Fadoua Balabdaoui, and Adrian E. Raftery. 2007. “Probabilistic Forecasts, Calibration and Sharpness.” Journal of the Royal Statistical Society: Series B (Methodological) 69 (2): 243–68. https://doi.org/10.1111/j.1467-9868.2007.00587.x.

Gneiting, Tilmann, Adrian E. Raftery, Anton H. Westveld, and Tom Goldman. 2005. “Calibrated Probabilistic Forecasting Using Ensemble Model Output Statistics and Minimum CRPS Estimation.” Monthly Weather Review 133 (5): 1098–1118. https://doi.org/http://dx.doi.org/10.1175/MWR2904.1.

Kleiber, Christian, and Achim Zeileis. 2016. “Visualizing Count Data Regressions Using Rootograms.” The American Statistician 70 (3): 296–303. https://doi.org/10.1080/00031305.2016.1173590.

Leitner, Christoph, Achim Zeileis, and Kurt Hornik. 2010. “Forecasting Sports Tournaments by Ratings of (Prob)abilities: A Comparison for the EURO 2008.” International Journal of Forecasting 26 (3): 471–81. https://doi.org/10.1016/j.ijforecast.2009.10.001.

Messner, Jakob W., Georg J. Mayr, and Achim Zeileis. 2016. “Heteroscedastic Censored and Truncated Regression with crch.” The R Journal 8 (1): 173–81. https://doi.org/10.32614/RJ-2016-012.

Warton, Loïc AND Wang, David I. AND Thibaut. 2017. “The Pit-Trap—a ‘Model-Free’ Bootstrap Procedure for Inference About Regression Models with Discrete, Multivariate Responses.” PLOS ONE 12 (7): 1–18. https://doi.org/10.1371/journal.pone.0181790.

Zeileis, Achim, Christoph Leitner, and Kurt Hornik. 2018. “Probabilistic Forecasts for the 2018 FIFA World Cup Based on the Bookmaker Consensus Model.” Working Paper 2018-09. Working Papers in Economics; Statistics, Research Platform Empirical; Experimental Economics, Universität Innsbruck. https://EconPapers.RePEc.org/RePEc:inn:wpaper:2018-09.