Probabilistic Forecasting Infrastructure

1 Overview

The functionality provided by the topmodels package can be broadly placed into three groups:

  1. Numeric quantities: Functions which, based on fitted model objects (like lm, glm, etc.), compute quantities of interest, such as predicted probabilities, quantiles, residuals, etc.
  2. Visualizations: Functions which can help assess goodness of fit for fitted model objects, leveraging the numeric quantities from above.
  3. Under the hood: Functions which can extract/predict probability distributions as proper S3 objects and provide standard methods for working with these distributions.

The S3 framework for probability distributions (3. above) is actually set up in the distributions3 package that topmodels builds upon. Actually, all numeric and graphical functions (1. and 2. above) have “smart” default methods. This means that if necessary methods can be defined for them but all the default methods work out of the box if the distributions3 functionality from 3. is provided. The table below shows all of the functions that become directly available when interfacing the distributions3 infrastructure.

Function Description
Numeric quantities
procast() Probabilistic forecasts (probabilities, quantiles, etc.) based on model objects
proscore() Evaluate scoring rules for procasts
proresiduals() Residuals for probabilistic regression models (quantile, PIT, Pearson, …)
Visualizations
pithist() PIT histograms
qqrplot() Q-Q plots for quantile residuals
wormplot() Worm plots for quantile residuals
rootogram() Rootograms of observed and fitted frequencies
reliagram() (Extended) reliability diagrams
Under the hood
prodist() Fitted and predicated probability distributions based on model objects
Normal(), Probability distribution objects
Poisson(), … With methods pdf(), cdf(), quantile(), mean(), …

Thus, to connect a new class of models to the topmodels tools the following building blocks need to be provided:

  1. A class with all necessary methods for the probability distribution of the response variable.
  2. A prodist() method for the model object, which typically first predicts the parameters and then sets up the distribution object from a.

Below we illustrate both steps by setting up a so-called “Tobit” model (Tobin 1958). This is a model with a normally-distributed response, left-censored at zero, which can in R be fitted with the crch() function (Messner, Mayr, and Zeileis 2016) from the package of the same name (among other packages). For illustration we first set up a new Tobit() distributions object (which is a special case of the CensoredNormal() class provided in crch). Subsequently, we add a prodist() method for crch objects.

2 Adding a new distribution

To illustrate how to set up a distributions3 constructor function, we employ the “Tobit” model, i.e., a normal distribution, left-censored at zero. Note that in practice we could just use the CensoredNormal() distribution provided by the crch() package (with arbitrary left and/or right censoring) as well as the underlying dcnorm(), pcnorm(), qcnorm(), and rcnorm() functions. However, here we avoid doing so and set up the Tobit() class constructor and accompanying methods from scratch, just for illustration.

2.1 Class constructor

First, the constructor (or generator) function for the distribution object should set up a data frame, containing the distribution’s parameters, with a custom class "Tobit" inheriting from "distribution". In case of the Tobit distribution, the parameters are the mean mu \(= \mu\) and standard deviation sigma \(= \sigma\) of the underlying uncensored normal distribution.

Tobit <- function(mu = 0, sigma = 1) {
  n <- c(length(mu), length(sigma))
  stopifnot("parameter lengths do not match (only scalars are allowed to be recycled)" =
    all(n %in% c(1L, max(n))))
  d <- data.frame(mu = mu, sigma = sigma)
  class(d) <- c("Tobit", "distribution")
  return(d)
}

This is already sufficient for setting up a vector Y containing three different Tobit distributions:

Y <- Tobit(mu = 1:3, sigma = c(1, 1, 4))
Y
[1] "Tobit(mu = 1, sigma = 1)" "Tobit(mu = 2, sigma = 1)"
[3] "Tobit(mu = 3, sigma = 4)"

Printing, subsetting, and some coercion functions already work due to suitable, flexible methods for "distribution" objects in general.

[1] 3
Y[-2]
[1] "Tobit(mu = 1, sigma = 1)" "Tobit(mu = 3, sigma = 4)"
     mu sigma
[1,]  1     1
[2,]  2     1
[3,]  3     4

2.2 Methods

Having constructed a distributions object like the "Tobit" object Y above, the next step is to perform standard tasks on it, such as computing densities, probabilities, moments, etc. While in base R the familiar functions of type ddist() (density), pdist() (cumulative probability), qdist() (quantile), and rdist() (random numbers) are typically used (e.g., where dist = norm or pois etc.), the distributions3 employs an object-oriented approach. Thus, rather than having the specification of the distribution as part of the function name, it is captured in the class of the object. Based on that suitable generic functions like pdf() (density), cdf() (cumulative probability), etc. can be provided for each distribution class. See the table below for an overview of generic functions and their purpose.

Function Package Description
Distributional
pdf() distributions3 Probability density function (or probability mass function, typically via ddist())
log_pdf() distributions3 Log-density (or log-likelihood, typically via ddist(..., log = TRUE))
cdf() distributions3 Cumulative distribution function (typically via pdist())
quantile() stats Compute quantiles (typically via qdist())
random() distributions3 Simulate random samples (typically via rdist())
crps() scoringRules (Continuous) ranked probability scored (typically via crps_dist())
Moments
mean() base Expectation
variance() distributions3 Variance
skewness() distributions3 Skewness
kurtosis() distributions3 Excess kurtosis
Support
support() distributions3 Maximum and minimum of the support of the probability distribution
is_discrete() distributions3 Determine whether a distribution is discrete on its support
is_continuous() distributions3 Determine whether a distribution is continuous on its support

In the following we just show how the mean() and the cdf() method can be set up and applied. Some further details are provided at the end of this section, see also the underlying source code in the packages.

First, the extractor functions for moments of the distribution can be typically be defined as functions of the list of parameters stored in the object:

mean.Tobit <- function(x, ...) {
  m <- x$mu * pnorm(x$mu/x$sigma) + x$sigma * dnorm(x$mu/x$sigma)
  setNames(m, names(x))
}

In a Tobit distribution the expectation is that of the underlying uncensored normal distribution (mu), suitably adjusted with the cumulative distribution function and probability density function of the uncensored distribution. For the three distributions in Y this yields:

mean(Y)
[1] 1.083 2.008 3.525

In the first two the effect of censoring is rather small, while in the third distribution it is more pronounced (due to the higher variance).

For evaluating the usual d/p/q/r functions in a standardized way, taking care of dimensions and naming etc., the distributions3 package provides the auxiliary function apply_dpqr():

cdf.Tobit <- function(d, x, drop = TRUE, elementwise = NULL, lower.tail = TRUE, log.p = FALSE, ...) {
  FUN <- function(at, d) {
    p <- pnorm(at, mean = d$mu, sd = d$sigma, lower.tail = lower.tail, log.p = log.p)
    p[rep_len(at, length(p)) < 0] <- if(lower.tail) {
      if(log.p) -Inf else 0
    } else {
      if(log.p) 0 else 1
    }
    p
  }
  apply_dpqr(d = d, FUN = FUN, at = x, type = "probability", elementwise = elementwise, drop = drop)
}

This can be used to evaluate all distribution functions at the same argument (returning a vector by default):

cdf(Y, 0)
[1] 0.15866 0.02275 0.22663

Or to evaluate all distribution functions at several arguments (returning a matrix by default):

cdf(Y, c(0, 5))
         p_0    p_5
[1,] 0.15866 1.0000
[2,] 0.02275 0.9987
[3,] 0.22663 0.6915

Finally, if the distribution and the argument have the same length, the default is to do the evaluating elementwise (always returning a vector):

cdf(Y, 2:0)
[1] 0.8413 0.1587 0.2266
cdf(Y, 2:0, elementwise = TRUE)
[1] 0.8413 0.1587 0.2266

But the evaluation can also be forced to be carried out for each combination of distribution and argument (always returning a matrix):

cdf(Y, 2:0, elementwise = FALSE)
        p_2    p_1     p_0
[1,] 0.8413 0.5000 0.15866
[2,] 0.5000 0.1587 0.02275
[3,] 0.4013 0.3085 0.22663

These two additional methods are sufficient for the following illustrations. More details are provided below at the end of this section.

3 Adding a new model

When setting up the procasting infrastructure, the heavy lifting is done with the creation of a suitable distribution class and methods. Interfacing a new model requires only a new prodist() method for setting up the probability distribution object based on a fitted model (and potentially a newdata set). The idea is to extract or predict the distribution parameters (mu and sigma in case of the Tobit distribution) and subsequently call the distribution class constructor (preserving observation names, if any).

prodist.crch <- function(object, newdata = NULL, na.action = na.pass, ...) {
  par <- predict(object, newdata = newdata, na.action = na.action, type = "parameter", ...)
  Tobit(mu = setNames(par$location, rownames(par)), sigma = par$scale)
}

To illustrate how this can be used in practice we fit a heteroscedastic Tobit model using the crch() function to a precipitation dataset from Innsbruck, Austria. As precipitation is often zero and never can become negative, a Tobit model is typically a good starting point for probabilistic modeling and forecasting. (See FIXME for more details.)

The data can be preprocessed in the following way:

data("RainIbk", package = "crch")
RainIbk <- sqrt(RainIbk)
RainIbk$ensmean <- apply(RainIbk[, grep('^rainfc', names(RainIbk))], 1, mean)
RainIbk$enssd   <- apply(RainIbk[, grep('^rainfc', names(RainIbk))], 1, sd)
RainIbk <- subset(RainIbk, enssd > 0)

And the model is fitted via:

library("crch")
m <- crch(rain ~ ensmean | log(enssd), data = RainIbk, left = 0)

Then, the predicted Tobit distributions for the first two observations can be obtained with the prodist() method obtained above:

RainIbk2 <- head(RainIbk, 2)
prodist(m, newdata = RainIbk2)
                       2000-01-04                        2000-01-05 
"Tobit(mu = 1.205, sigma = 2.16)" "Tobit(mu = 0.565, sigma = 1.98)" 

Then, the mean() or cdf() method could be applied to this distribution vector. Alternatively, instead of calling these methods “by hand”, we can use the convenience procast() method to do so:

procast(m, newdata = RainIbk2, type = "mean")
            mean
2000-01-04 1.597
2000-01-05 1.105
procast(m, newdata = RainIbk2, type = "cdf", at = 0)
           probability
2000-01-04      0.2888
2000-01-05      0.3878

Instead of "mean" and "cdf", we could also use the aliases "response" and "probability", respectively.

Similarly, other functions such as proscore(), proresiduals(), pithist(), or rootogram() can be used once all of the methods from the table above are defined.

4 Further details

Most of the remaining methods for the distribution objects follow the same structure as the mean() and cdf() method above, respectively. However, the random() and the support-related methods are slightly different. Hence, these are also briefly illustrated in the following.

Drawing random() samples also uses apply_dpqr() with the argument n which is assured to always be a positive integer.

random.Tobit <- function(x, n = 1L, drop = TRUE, ...) {
  n <- make_positive_integer(n)
  if (n == 0L) {
    return(numeric(0L))
  }
  FUN <- function(at, d) {
    y <- rnorm(n = at, mean = d$mu, sd = d$sigma)
    y[y < 0] <- 0
    y
  }
  apply_dpqr(d = x, FUN = FUN, at = n, type = "random", drop = drop)
}

If drop = TRUE (the default), then one random sample for each distribution yields a vector, while several random samples for each distribution yields a matrix:

random(Y, 1)
[1] 0.000 2.593 4.435
random(Y, 3)
       r_1    r_2   r_3
[1,] 2.902 2.2861 1.117
[2,] 1.446 0.3598 0.000
[3,] 0.000 0.0000 0.000

The support() method should return a matrix of "min" and "max" for the distribution. The make_support() function helps to set the right names and dimension.

support.Tobit <- function(d, drop = TRUE, ...) {
  min <- rep(0, length(d))
  max <- rep(Inf, length(d))
  make_support(min, max, d, drop = drop)
}
support(Y)
     min max
[1,]   0 Inf
[2,]   0 Inf
[3,]   0 Inf

Finally, the is_discrete() and is_continuous() methods should return TRUE for distributions that are discrete or continuous, respectively, on the entire support. Thus, for mixed discrete-continuous distributions like the Tobit, both methods should return FALSE:

is_discrete.Tobit <- function(d, ...) {
  setNames(rep.int(FALSE, length(d)), names(d))
}
is_continuous.Tobit <- function(d, ...) {
  setNames(rep.int(TRUE, length(d)), names(d))
}
is_discrete(Y)
[1] FALSE FALSE FALSE
is_continuous(Y)
[1] TRUE TRUE TRUE

5 References

Messner, Jakob W., Georg J. Mayr, and Achim Zeileis. 2016. “Heteroscedastic Censored and Truncated Regression with crch.” The R Journal 8 (1): 173–81. https://doi.org/10.32614/RJ-2016-012.
Tobin, J. 1958. “Estimation of Relationships for Limited Dependent Variables.” Econometrica 26 (1): 24–36. https://doi.org/10.2307/1907382.