Probabilistic Forecasting Infrastructure
1 Overview
The functionality provided by the topmodels
package can be broadly placed into three groups:
-
Numeric quantities: Functions which, based on fitted model objects (like
lm
,glm
, etc.), compute quantities of interest, such as predicted probabilities, quantiles, residuals, etc. - Visualizations: Functions which can help assess goodness of fit for fitted model objects, leveraging the numeric quantities from above.
- Under the hood: Functions which can extract/predict probability distributions as proper S3 objects and provide standard methods for working with these distributions.
The S3 framework for probability distributions (3. above) is actually set up in the distributions3
package that topmodels
builds upon. Actually, all numeric and graphical functions (1. and 2. above) have “smart” default methods. This means that if necessary methods can be defined for them but all the default methods work out of the box if the distributions3
functionality from 3. is provided. The table below shows all of the functions that become directly available when interfacing the distributions3
infrastructure.
Function | Description |
---|---|
Numeric quantities | |
procast() |
Probabilistic forecasts (probabilities, quantiles, etc.) based on model objects |
proscore() |
Evaluate scoring rules for procasts |
proresiduals() |
Residuals for probabilistic regression models (quantile, PIT, Pearson, …) |
Visualizations | |
pithist() |
PIT histograms |
qqrplot() |
Q-Q plots for quantile residuals |
wormplot() |
Worm plots for quantile residuals |
rootogram() |
Rootograms of observed and fitted frequencies |
reliagram() |
(Extended) reliability diagrams |
Under the hood | |
prodist() |
Fitted and predicated probability distributions based on model objects |
Normal() , |
Probability distribution objects |
Poisson() , … |
With methods pdf() , cdf() , quantile() , mean() , … |
Thus, to connect a new class of models to the topmodels
tools the following building blocks need to be provided:
- A class with all necessary methods for the probability distribution of the response variable.
- A
prodist()
method for the model object, which typically first predicts the parameters and then sets up the distribution object from a.
Below we illustrate both steps by setting up a so-called “Tobit” model (Tobin 1958). This is a model with a normally-distributed response, left-censored at zero, which can in R be fitted with the crch()
function (Messner, Mayr, and Zeileis 2016) from the package of the same name (among other packages). For illustration we first set up a new Tobit()
distributions object (which is a special case of the CensoredNormal()
class provided in crch
). Subsequently, we add a prodist()
method for crch
objects.
2 Adding a new distribution
To illustrate how to set up a distributions3
constructor function, we employ the “Tobit” model, i.e., a normal distribution, left-censored at zero. Note that in practice we could just use the CensoredNormal()
distribution provided by the crch()
package (with arbitrary left and/or right censoring) as well as the underlying dcnorm()
, pcnorm()
, qcnorm()
, and rcnorm()
functions. However, here we avoid doing so and set up the Tobit()
class constructor and accompanying methods from scratch, just for illustration.
2.1 Class constructor
First, the constructor (or generator) function for the distribution object should set up a data frame, containing the distribution’s parameters, with a custom class "Tobit"
inheriting from "distribution"
. In case of the Tobit distribution, the parameters are the mean mu
\(= \mu\) and standard deviation sigma
\(= \sigma\) of the underlying uncensored normal distribution.
This is already sufficient for setting up a vector Y
containing three different Tobit distributions:
Y <- Tobit(mu = 1:3, sigma = c(1, 1, 4))
Y
[1] "Tobit(mu = 1, sigma = 1)" "Tobit(mu = 2, sigma = 1)"
[3] "Tobit(mu = 3, sigma = 4)"
Printing, subsetting, and some coercion functions already work due to suitable, flexible methods for "distribution"
objects in general.
2.2 Methods
Having constructed a distributions object like the "Tobit"
object Y
above, the next step is to perform standard tasks on it, such as computing densities, probabilities, moments, etc. While in base R the familiar functions of type ddist()
(density), pdist()
(cumulative probability), qdist()
(quantile), and rdist()
(random numbers) are typically used (e.g., where dist
= norm
or pois
etc.), the distributions3
employs an object-oriented approach. Thus, rather than having the specification of the distribution as part of the function name, it is captured in the class of the object. Based on that suitable generic functions like pdf()
(density), cdf()
(cumulative probability), etc. can be provided for each distribution class. See the table below for an overview of generic functions and their purpose.
Function | Package | Description |
---|---|---|
Distributional | ||
pdf() |
distributions3 |
Probability density function (or probability mass function, typically via ddist() ) |
log_pdf() |
distributions3 |
Log-density (or log-likelihood, typically via ddist(..., log = TRUE) ) |
cdf() |
distributions3 |
Cumulative distribution function (typically via pdist() ) |
quantile() |
stats |
Compute quantiles (typically via qdist() ) |
random() |
distributions3 |
Simulate random samples (typically via rdist() ) |
crps() |
scoringRules |
(Continuous) ranked probability scored (typically via crps_dist() ) |
Moments | ||
mean() |
base |
Expectation |
variance() |
distributions3 |
Variance |
skewness() |
distributions3 |
Skewness |
kurtosis() |
distributions3 |
Excess kurtosis |
Support | ||
support() |
distributions3 |
Maximum and minimum of the support of the probability distribution |
is_discrete() |
distributions3 |
Determine whether a distribution is discrete on its support |
is_continuous() |
distributions3 |
Determine whether a distribution is continuous on its support |
In the following we just show how the mean()
and the cdf()
method can be set up and applied. Some further details are provided at the end of this section, see also the underlying source code in the packages.
First, the extractor functions for moments of the distribution can be typically be defined as functions of the list of parameters stored in the object:
In a Tobit distribution the expectation is that of the underlying uncensored normal distribution (mu
), suitably adjusted with the cumulative distribution function and probability density function of the uncensored distribution. For the three distributions in Y
this yields:
mean(Y)
[1] 1.083 2.008 3.525
In the first two the effect of censoring is rather small, while in the third distribution it is more pronounced (due to the higher variance).
For evaluating the usual d
/p
/q
/r
functions in a standardized way, taking care of dimensions and naming etc., the distributions3
package provides the auxiliary function apply_dpqr()
:
cdf.Tobit <- function(d, x, drop = TRUE, elementwise = NULL, lower.tail = TRUE, log.p = FALSE, ...) {
FUN <- function(at, d) {
p <- pnorm(at, mean = d$mu, sd = d$sigma, lower.tail = lower.tail, log.p = log.p)
p[rep_len(at, length(p)) < 0] <- if(lower.tail) {
if(log.p) -Inf else 0
} else {
if(log.p) 0 else 1
}
p
}
apply_dpqr(d = d, FUN = FUN, at = x, type = "probability", elementwise = elementwise, drop = drop)
}
This can be used to evaluate all distribution functions at the same argument (returning a vector by default):
cdf(Y, 0)
[1] 0.15866 0.02275 0.22663
Or to evaluate all distribution functions at several arguments (returning a matrix by default):
cdf(Y, c(0, 5))
p_0 p_5
[1,] 0.15866 1.0000
[2,] 0.02275 0.9987
[3,] 0.22663 0.6915
Finally, if the distribution and the argument have the same length, the default is to do the evaluating elementwise (always returning a vector):
cdf(Y, 2:0)
[1] 0.8413 0.1587 0.2266
cdf(Y, 2:0, elementwise = TRUE)
[1] 0.8413 0.1587 0.2266
But the evaluation can also be forced to be carried out for each combination of distribution and argument (always returning a matrix):
cdf(Y, 2:0, elementwise = FALSE)
p_2 p_1 p_0
[1,] 0.8413 0.5000 0.15866
[2,] 0.5000 0.1587 0.02275
[3,] 0.4013 0.3085 0.22663
These two additional methods are sufficient for the following illustrations. More details are provided below at the end of this section.
3 Adding a new model
When setting up the procasting infrastructure, the heavy lifting is done with the creation of a suitable distribution class and methods. Interfacing a new model requires only a new prodist()
method for setting up the probability distribution object based on a fitted model (and potentially a newdata
set). The idea is to extract or predict the distribution parameters (mu
and sigma
in case of the Tobit distribution) and subsequently call the distribution class constructor (preserving observation names, if any).
To illustrate how this can be used in practice we fit a heteroscedastic Tobit model using the crch()
function to a precipitation dataset from Innsbruck, Austria. As precipitation is often zero and never can become negative, a Tobit model is typically a good starting point for probabilistic modeling and forecasting. (See FIXME for more details.)
The data can be preprocessed in the following way:
And the model is fitted via:
Then, the predicted Tobit distributions for the first two observations can be obtained with the prodist()
method obtained above:
RainIbk2 <- head(RainIbk, 2)
prodist(m, newdata = RainIbk2)
2000-01-04 2000-01-05
"Tobit(mu = 1.205, sigma = 2.16)" "Tobit(mu = 0.565, sigma = 1.98)"
Then, the mean()
or cdf()
method could be applied to this distribution vector. Alternatively, instead of calling these methods “by hand”, we can use the convenience procast()
method to do so:
procast(m, newdata = RainIbk2, type = "mean")
mean
2000-01-04 1.597
2000-01-05 1.105
procast(m, newdata = RainIbk2, type = "cdf", at = 0)
probability
2000-01-04 0.2888
2000-01-05 0.3878
Instead of "mean"
and "cdf"
, we could also use the aliases "response"
and "probability"
, respectively.
Similarly, other functions such as proscore()
, proresiduals()
, pithist()
, or rootogram()
can be used once all of the methods from the table above are defined.
4 Further details
Most of the remaining methods for the distribution objects follow the same structure as the mean()
and cdf()
method above, respectively. However, the random()
and the support-related methods are slightly different. Hence, these are also briefly illustrated in the following.
Drawing random()
samples also uses apply_dpqr()
with the argument n
which is assured to always be a positive integer.
If drop = TRUE
(the default), then one random sample for each distribution yields a vector, while several random samples for each distribution yields a matrix:
random(Y, 1)
[1] 0.000 2.593 4.435
random(Y, 3)
r_1 r_2 r_3
[1,] 2.902 2.2861 1.117
[2,] 1.446 0.3598 0.000
[3,] 0.000 0.0000 0.000
The support()
method should return a matrix of "min"
and "max"
for the distribution. The make_support()
function helps to set the right names and dimension.
support.Tobit <- function(d, drop = TRUE, ...) {
min <- rep(0, length(d))
max <- rep(Inf, length(d))
make_support(min, max, d, drop = drop)
}
support(Y)
min max
[1,] 0 Inf
[2,] 0 Inf
[3,] 0 Inf
Finally, the is_discrete()
and is_continuous()
methods should return TRUE
for distributions that are discrete or continuous, respectively, on the entire support. Thus, for mixed discrete-continuous distributions like the Tobit, both methods should return FALSE
: