Probabilistic Forecasting Infrastructure

1 Overview

The functionality provided by the topmodels package can be broadly placed into three groups:

Numeric quantities: Functions which, based on fitted model objects (like lm, glm, etc.), compute quantities of interest, such as predicted probabilities, quantiles, residuals, etc.
Visualizations: Functions which can help assess goodness of fit for fitted model objects, leveraging the numeric quantities from above.
Under the hood: Functions which can extract/predict probability distributions as proper S3 objects and provide standard methods for working with these distributions.

The S3 framework for probability distributions (3. above) is actually set up in the distributions3 package that topmodels builds upon. Actually, all numeric and graphical functions (1. and 2. above) have “smart” default methods. This means that if necessary methods can be defined for them but all the default methods work out of the box if the distributions3 functionality from 3. is provided. The table below shows all of the functions that become directly available when interfacing the distributions3 infrastructure.

Function	Description
Numeric quantities
`procast()`	Probabilistic forecasts (probabilities, quantiles, etc.) based on model objects
`proscore()`	Evaluate scoring rules for procasts
`proresiduals()`	Residuals for probabilistic regression models (quantile, PIT, Pearson, …)
Visualizations
`pithist()`	PIT histograms
`qqrplot()`	Q-Q plots for quantile residuals
`wormplot()`	Worm plots for quantile residuals
`rootogram()`	Rootograms of observed and fitted frequencies
`reliagram()`	(Extended) reliability diagrams
Under the hood
`prodist()`	Fitted and predicated probability distributions based on model objects
`Normal()`,	Probability distribution objects
`Poisson()`, …	With methods `pdf()`, `cdf()`, `quantile()`, `mean()`, …

Thus, to connect a new class of models to the topmodels tools the following building blocks need to be provided:

A class with all necessary methods for the probability distribution of the response variable.
A prodist() method for the model object, which typically first predicts the parameters and then sets up the distribution object from a.

Below we illustrate both steps by setting up a so-called “Tobit” model (Tobin 1958). This is a model with a normally-distributed response, left-censored at zero, which can in R be fitted with the crch() function (Messner, Mayr, and Zeileis 2016) from the package of the same name (among other packages). For illustration we first set up a new Tobit() distributions object (which is a special case of the CensoredNormal() class provided in crch). Subsequently, we add a prodist() method for crch objects.

2 Adding a new distribution

To illustrate how to set up a distributions3 constructor function, we employ the “Tobit” model, i.e., a normal distribution, left-censored at zero. Note that in practice we could just use the CensoredNormal() distribution provided by the crch() package (with arbitrary left and/or right censoring) as well as the underlying dcnorm(), pcnorm(), qcnorm(), and rcnorm() functions. However, here we avoid doing so and set up the Tobit() class constructor and accompanying methods from scratch, just for illustration.

2.1 Class constructor

First, the constructor (or generator) function for the distribution object should set up a data frame, containing the distribution’s parameters, with a custom class "Tobit" inheriting from "distribution". In case of the Tobit distribution, the parameters are the mean mu \(= \mu\) and standard deviation sigma \(= \sigma\) of the underlying uncensored normal distribution.

Tobit <- function(mu = 0, sigma = 1) {
  n <- c(length(mu), length(sigma))
  stopifnot("parameter lengths do not match (only scalars are allowed to be recycled)" =
    all(n %in% c(1L, max(n))))
  d <- data.frame(mu = mu, sigma = sigma)
  class(d) <- c("Tobit", "distribution")
  return(d)
}

This is already sufficient for setting up a vector Y containing three different Tobit distributions:

Y <- Tobit(mu = 1:3, sigma = c(1, 1, 4))
Y

[1] "Tobit(mu = 1, sigma = 1)" "Tobit(mu = 2, sigma = 1)"
[3] "Tobit(mu = 3, sigma = 4)"

Printing, subsetting, and some coercion functions already work due to suitable, flexible methods for "distribution" objects in general.

length(Y)

[1] 3

Y[-2]

[1] "Tobit(mu = 1, sigma = 1)" "Tobit(mu = 3, sigma = 4)"

as.matrix(Y)

     mu sigma
[1,]  1     1
[2,]  2     1
[3,]  3     4

2.2 Methods

Having constructed a distributions object like the "Tobit" object Y above, the next step is to perform standard tasks on it, such as computing densities, probabilities, moments, etc. While in base R the familiar functions of type ddist() (density), pdist() (cumulative probability), qdist() (quantile), and rdist() (random numbers) are typically used (e.g., where dist = norm or pois etc.), the distributions3 employs an object-oriented approach. Thus, rather than having the specification of the distribution as part of the function name, it is captured in the class of the object. Based on that suitable generic functions like pdf() (density), cdf() (cumulative probability), etc. can be provided for each distribution class. See the table below for an overview of generic functions and their purpose.

Function	Package	Description
Distributional
`pdf()`	`distributions3`	Probability density function (or probability mass function, typically via `ddist()`)
`log_pdf()`	`distributions3`	Log-density (or log-likelihood, typically via `ddist(..., log = TRUE)`)
`cdf()`	`distributions3`	Cumulative distribution function (typically via `pdist()`)
`quantile()`	`stats`	Compute quantiles (typically via `qdist()`)
`random()`	`distributions3`	Simulate random samples (typically via `rdist()`)
`crps()`	`scoringRules`	(Continuous) ranked probability scored (typically via `crps_dist()`)
Moments
`mean()`	`base`	Expectation
`variance()`	`distributions3`	Variance
`skewness()`	`distributions3`	Skewness
`kurtosis()`	`distributions3`	Excess kurtosis
Support
`support()`	`distributions3`	Maximum and minimum of the support of the probability distribution
`is_discrete()`	`distributions3`	Determine whether a distribution is discrete on its support
`is_continuous()`	`distributions3`	Determine whether a distribution is continuous on its support

In the following we just show how the mean() and the cdf() method can be set up and applied. Some further details are provided at the end of this section, see also the underlying source code in the packages.

First, the extractor functions for moments of the distribution can be typically be defined as functions of the list of parameters stored in the object:

mean.Tobit <- function(x, ...) {
  m <- x$mu * pnorm(x$mu/x$sigma) + x$sigma * dnorm(x$mu/x$sigma)
  setNames(m, names(x))
}

In a Tobit distribution the expectation is that of the underlying uncensored normal distribution (mu), suitably adjusted with the cumulative distribution function and probability density function of the uncensored distribution. For the three distributions in Y this yields:

mean(Y)

[1] 1.083 2.008 3.525

In the first two the effect of censoring is rather small, while in the third distribution it is more pronounced (due to the higher variance).

For evaluating the usual d/p/q/r functions in a standardized way, taking care of dimensions and naming etc., the distributions3 package provides the auxiliary function apply_dpqr():

cdf.Tobit <- function(d, x, drop = TRUE, elementwise = NULL, lower.tail = TRUE, log.p = FALSE, ...) {
  FUN <- function(at, d) {
    p <- pnorm(at, mean = d$mu, sd = d$sigma, lower.tail = lower.tail, log.p = log.p)
    p[rep_len(at, length(p)) < 0] <- if(lower.tail) {
      if(log.p) -Inf else 0
    } else {
      if(log.p) 0 else 1
    }
    p
  }
  apply_dpqr(d = d, FUN = FUN, at = x, type = "probability", elementwise = elementwise, drop = drop)
}

This can be used to evaluate all distribution functions at the same argument (returning a vector by default):

cdf(Y, 0)

[1] 0.15866 0.02275 0.22663

Or to evaluate all distribution functions at several arguments (returning a matrix by default):

cdf(Y, c(0, 5))

         p_0    p_5
[1,] 0.15866 1.0000
[2,] 0.02275 0.9987
[3,] 0.22663 0.6915

Finally, if the distribution and the argument have the same length, the default is to do the evaluating elementwise (always returning a vector):

cdf(Y, 2:0)

[1] 0.8413 0.1587 0.2266

cdf(Y, 2:0, elementwise = TRUE)

[1] 0.8413 0.1587 0.2266

But the evaluation can also be forced to be carried out for each combination of distribution and argument (always returning a matrix):

cdf(Y, 2:0, elementwise = FALSE)

        p_2    p_1     p_0
[1,] 0.8413 0.5000 0.15866
[2,] 0.5000 0.1587 0.02275
[3,] 0.4013 0.3085 0.22663

These two additional methods are sufficient for the following illustrations. More details are provided below at the end of this section.

3 Adding a new model

When setting up the procasting infrastructure, the heavy lifting is done with the creation of a suitable distribution class and methods. Interfacing a new model requires only a new prodist() method for setting up the probability distribution object based on a fitted model (and potentially a newdata set). The idea is to extract or predict the distribution parameters (mu and sigma in case of the Tobit distribution) and subsequently call the distribution class constructor (preserving observation names, if any).

prodist.crch <- function(object, newdata = NULL, na.action = na.pass, ...) {
  par <- predict(object, newdata = newdata, na.action = na.action, type = "parameter", ...)
  Tobit(mu = setNames(par$location, rownames(par)), sigma = par$scale)
}

To illustrate how this can be used in practice we fit a heteroscedastic Tobit model using the crch() function to a precipitation dataset from Innsbruck, Austria. As precipitation is often zero and never can become negative, a Tobit model is typically a good starting point for probabilistic modeling and forecasting. (See FIXME for more details.)

The data can be preprocessed in the following way:

data("RainIbk", package = "crch")
RainIbk <- sqrt(RainIbk)
RainIbk$ensmean <- apply(RainIbk[, grep('^rainfc', names(RainIbk))], 1, mean)
RainIbk$enssd   <- apply(RainIbk[, grep('^rainfc', names(RainIbk))], 1, sd)
RainIbk <- subset(RainIbk, enssd > 0)

And the model is fitted via:

library("crch")
m <- crch(rain ~ ensmean | log(enssd), data = RainIbk, left = 0)

Then, the predicted Tobit distributions for the first two observations can be obtained with the prodist() method obtained above:

RainIbk2 <- head(RainIbk, 2)
prodist(m, newdata = RainIbk2)

                       2000-01-04                        2000-01-05 
"Tobit(mu = 1.205, sigma = 2.16)" "Tobit(mu = 0.565, sigma = 1.98)"

Then, the mean() or cdf() method could be applied to this distribution vector. Alternatively, instead of calling these methods “by hand”, we can use the convenience procast() method to do so:

procast(m, newdata = RainIbk2, type = "mean")

            mean
2000-01-04 1.597
2000-01-05 1.105

procast(m, newdata = RainIbk2, type = "cdf", at = 0)

           probability
2000-01-04      0.2888
2000-01-05      0.3878

Instead of "mean" and "cdf", we could also use the aliases "response" and "probability", respectively.

Similarly, other functions such as proscore(), proresiduals(), pithist(), or rootogram() can be used once all of the methods from the table above are defined.

4 Further details

Most of the remaining methods for the distribution objects follow the same structure as the mean() and cdf() method above, respectively. However, the random() and the support-related methods are slightly different. Hence, these are also briefly illustrated in the following.

Drawing random() samples also uses apply_dpqr() with the argument n which is assured to always be a positive integer.

random.Tobit <- function(x, n = 1L, drop = TRUE, ...) {
  n <- make_positive_integer(n)
  if (n == 0L) {
    return(numeric(0L))
  }
  FUN <- function(at, d) {
    y <- rnorm(n = at, mean = d$mu, sd = d$sigma)
    y[y < 0] <- 0
    y
  }
  apply_dpqr(d = x, FUN = FUN, at = n, type = "random", drop = drop)
}

If drop = TRUE (the default), then one random sample for each distribution yields a vector, while several random samples for each distribution yields a matrix:

random(Y, 1)

[1] 0.03797 1.18499 5.94178

random(Y, 3)

       r_1   r_2    r_3
[1,] 1.487 1.593 0.6649
[2,] 2.136 2.136 0.3355
[3,] 1.784 7.920 0.4718

The support() method should return a matrix of "min" and "max" for the distribution. The make_support() function helps to set the right names and dimension.

support.Tobit <- function(d, drop = TRUE, ...) {
  min <- rep(0, length(d))
  max <- rep(Inf, length(d))
  make_support(min, max, d, drop = drop)
}
support(Y)

     min max
[1,]   0 Inf
[2,]   0 Inf
[3,]   0 Inf

Finally, the is_discrete() and is_continuous() methods should return TRUE for distributions that are discrete or continuous, respectively, on the entire support. Thus, for mixed discrete-continuous distributions like the Tobit, both methods should return FALSE:

is_discrete.Tobit <- function(d, ...) {
  setNames(rep.int(FALSE, length(d)), names(d))
}
is_continuous.Tobit <- function(d, ...) {
  setNames(rep.int(TRUE, length(d)), names(d))
}
is_discrete(Y)

[1] FALSE FALSE FALSE

is_continuous(Y)

[1] TRUE TRUE TRUE

5 References

Messner, Jakob W., Georg J. Mayr, and Achim Zeileis. 2016. “Heteroscedastic Censored and Truncated Regression with crch.” The R Journal 8 (1): 173–81. https://doi.org/10.32614/RJ-2016-012.

Tobin, J. 1958. “Estimation of Relationships for Limited Dependent Variables.” Econometrica 26 (1): 24–36. https://doi.org/10.2307/1907382.