procast.Rmd
The functionality provided by the topmodels
package can
be broadly placed into three groups:
lm
, glm
, etc.), compute
quantities of interest, such as predicted probabilities, quantiles,
residuals, etc.The S3 framework for probability distributions (3. above) is actually
set up in the distributions3
package that
topmodels
builds upon. Actually, all numeric and graphical
functions (1. and 2. above) have “smart” default methods. This means
that if necessary methods can be defined for them but all the
default methods work out of the box if the distributions3
functionality from 3. is provided. The table below shows all of the
functions that become directly available when interfacing the
distributions3
infrastructure.
Function | Description |
---|---|
Numeric quantities | |
procast() |
Probabilistic forecasts (probabilities, quantiles, etc.) based on model objects |
proscore() |
Evaluate scoring rules for procasts |
proresiduals() |
Residuals for probabilistic regression models (quantile, PIT, Pearson, …) |
Visualizations | |
pithist() |
PIT histograms |
qqrplot() |
Q-Q plots for quantile residuals |
wormplot() |
Worm plots for quantile residuals |
rootogram() |
Rootograms of observed and fitted frequencies |
reliagram() |
(Extended) reliability diagrams |
Under the hood | |
prodist() |
Fitted and predicated probability distributions based on model objects |
Normal() , |
Probability distribution objects |
Poisson() , … |
With methods pdf() , cdf() ,
quantile() , mean() , … |
Thus, to connect a new class of models to the topmodels
tools the following building blocks need to be provided:
prodist()
method for the model object, which
typically first predicts the parameters and then sets up the
distribution object from a.Below we illustrate both steps by setting up a so-called “Tobit”
model (Tobin 1958). This is a model with a
normally-distributed response, left-censored at zero, which can in R be
fitted with the crch()
function (Messner, Mayr, and Zeileis 2016) from the
package of the same name (among other packages). For illustration we
first set up a new Tobit()
distributions object (which is a
special case of the CensoredNormal()
class provided in
crch
). Subsequently, we add a prodist()
method
for crch
objects.
To illustrate how to set up a distributions3
constructor
function, we employ the “Tobit” model, i.e., a normal distribution,
left-censored at zero. Note that in practice we could just use the
CensoredNormal()
distribution provided by the
crch()
package (with arbitrary left and/or right censoring)
as well as the underlying dcnorm()
, pcnorm()
,
qcnorm()
, and rcnorm()
functions. However,
here we avoid doing so and set up the Tobit()
class
constructor and accompanying methods from scratch, just for
illustration.
First, the constructor (or generator) function for the distribution
object should set up a data frame, containing the distribution’s
parameters, with a custom class "Tobit"
inheriting from
"distribution"
. In case of the Tobit distribution, the
parameters are the mean mu
\(=
\mu\) and standard deviation sigma
\(= \sigma\) of the underlying uncensored
normal distribution.
Tobit <- function(mu = 0, sigma = 1) {
n <- c(length(mu), length(sigma))
stopifnot("parameter lengths do not match (only scalars are allowed to be recycled)" =
all(n %in% c(1L, max(n))))
d <- data.frame(mu = mu, sigma = sigma)
class(d) <- c("Tobit", "distribution")
return(d)
}
This is already sufficient for setting up a vector Y
containing three different Tobit distributions:
Y <- Tobit(mu = 1:3, sigma = c(1, 1, 4))
Y
## [1] "Tobit distribution (mu = 1, sigma = 1)"
## [2] "Tobit distribution (mu = 2, sigma = 1)"
## [3] "Tobit distribution (mu = 3, sigma = 4)"
Printing, subsetting, and some coercion functions already work due to
suitable, flexible methods for "distribution"
objects in
general.
length(Y)
## [1] 3
Y[-2]
## [1] "Tobit distribution (mu = 1, sigma = 1)"
## [2] "Tobit distribution (mu = 3, sigma = 4)"
as.matrix(Y)
## mu sigma
## [1,] 1 1
## [2,] 2 1
## [3,] 3 4
Having constructed a distributions object like the
"Tobit"
object Y
above, the next step is to
perform standard tasks on it, such as computing densities,
probabilities, moments, etc. While in base R the familiar functions of
type ddist()
(density), pdist()
(cumulative
probability), qdist()
(quantile), and rdist()
(random numbers) are typically used (e.g., where dist
=
norm
or pois
etc.), the
distributions3
employs an object-oriented approach. Thus,
rather than having the specification of the distribution as part of the
function name, it is captured in the class of the object. Based on that
suitable generic functions like pdf()
(density),
cdf()
(cumulative probability), etc. can be provided for
each distribution class. See the table below for an overview of generic
functions and their purpose.
Function | Package | Description |
---|---|---|
Distributional | ||
pdf() |
distributions3 |
Probability density function (or probability mass
function, typically via ddist() ) |
log_pdf() |
distributions3 |
Log-density (or log-likelihood, typically via
ddist(..., log = TRUE) ) |
cdf() |
distributions3 |
Cumulative distribution function (typically via
pdist() ) |
quantile() |
stats |
Compute quantiles (typically via
qdist() ) |
random() |
distributions3 |
Simulate random samples (typically via
rdist() ) |
crps() |
scoringRules |
(Continuous) ranked probability scored (typically via
crps_dist() ) |
Moments | ||
mean() |
base |
Expectation |
variance() |
distributions3 |
Variance |
skewness() |
distributions3 |
Skewness |
kurtosis() |
distributions3 |
Excess kurtosis |
Support | ||
support() |
distributions3 |
Maximum and minimum of the support of the probability distribution |
is_discrete() |
distributions3 |
Determine whether a distribution is discrete on its support |
is_continuous() |
distributions3 |
Determine whether a distribution is continuous on its support |
In the following we just show how the mean()
and the
cdf()
method can be set up and applied. Some further
details are provided at the end of this section, see also the underlying
source code in the packages.
First, the extractor functions for moments of the distribution can be typically be defined as functions of the list of parameters stored in the object:
mean.Tobit <- function(x, ...) {
m <- x$mu * pnorm(x$mu/x$sigma) + x$sigma * dnorm(x$mu/x$sigma)
setNames(m, names(x))
}
In a Tobit distribution the expectation is that of the underlying
uncensored normal distribution (mu
), suitably adjusted with
the cumulative distribution function and probability density function of
the uncensored distribution. For the three distributions in
Y
this yields:
mean(Y)
## [1] 1.083 2.008 3.525
In the first two the effect of censoring is rather small, while in the third distribution it is more pronounced (due to the higher variance).
For evaluating the usual
d
/p
/q
/r
functions in
a standardized way, taking care of dimensions and naming etc., the
distributions3
package provides the auxiliary function
apply_dpqr()
:
cdf.Tobit <- function(d, x, drop = TRUE, elementwise = NULL, lower.tail = TRUE, log.p = FALSE, ...) {
FUN <- function(at, d) {
p <- pnorm(at, mean = d$mu, sd = d$sigma, lower.tail = lower.tail, log.p = log.p)
p[rep_len(at, length(p)) < 0] <- if(lower.tail) {
if(log.p) -Inf else 0
} else {
if(log.p) 0 else 1
}
p
}
apply_dpqr(d = d, FUN = FUN, at = x, type = "probability", elementwise = elementwise, drop = drop)
}
This can be used to evaluate all distribution functions at the same argument (returning a vector by default):
cdf(Y, 0)
## [1] 0.15866 0.02275 0.22663
Or to evaluate all distribution functions at several arguments (returning a matrix by default):
## p_0 p_5
## [1,] 0.15866 1.0000
## [2,] 0.02275 0.9987
## [3,] 0.22663 0.6915
Finally, if the distribution and the argument have the same length, the default is to do the evaluating elementwise (always returning a vector):
cdf(Y, 2:0)
## [1] 0.8413 0.1587 0.2266
cdf(Y, 2:0, elementwise = TRUE)
## [1] 0.8413 0.1587 0.2266
But the evaluation can also be forced to be carried out for each combination of distribution and argument (always returning a matrix):
cdf(Y, 2:0, elementwise = FALSE)
## p_2 p_1 p_0
## [1,] 0.8413 0.5000 0.15866
## [2,] 0.5000 0.1587 0.02275
## [3,] 0.4013 0.3085 0.22663
These two additional methods are sufficient for the following illustrations. More details are provided below at the end of this section.
When setting up the procasting infrastructure, the heavy lifting is
done with the creation of a suitable distribution class and methods.
Interfacing a new model requires only a new prodist()
method for setting up the probability distribution object based on a
fitted model (and potentially a newdata
set). The idea is
to extract or predict the distribution parameters (mu
and
sigma
in case of the Tobit distribution) and subsequently
call the distribution class constructor (preserving observation names,
if any).
prodist.crch <- function(object, newdata = NULL, na.action = na.pass, ...) {
par <- predict(object, newdata = newdata, na.action = na.action, type = "parameter", ...)
Tobit(mu = setNames(par$location, rownames(par)), sigma = par$scale)
}
To illustrate how this can be used in practice we fit a
heteroscedastic Tobit model using the crch()
function to a
precipitation dataset from Innsbruck, Austria. As precipitation is often
zero and never can become negative, a Tobit model is typically a good
starting point for probabilistic modeling and forecasting. (See FIXME
for more details.)
The data can be preprocessed in the following way:
data("RainIbk", package = "crch")
RainIbk <- sqrt(RainIbk)
RainIbk$ensmean <- apply(RainIbk[, grep('^rainfc', names(RainIbk))], 1, mean)
RainIbk$enssd <- apply(RainIbk[, grep('^rainfc', names(RainIbk))], 1, sd)
RainIbk <- subset(RainIbk, enssd > 0)
And the model is fitted via:
Then, the predicted Tobit distributions for the first two
observations can be obtained with the prodist()
method
obtained above:
## 2000-01-04
## "Tobit distribution (mu = 1.205, sigma = 2.16)"
## 2000-01-05
## "Tobit distribution (mu = 0.565, sigma = 1.98)"
Then, the mean()
or cdf()
method could be
applied to this distribution vector. Alternatively, instead of calling
these methods “by hand”, we can use the convenience
procast()
method to do so:
procast(m, newdata = RainIbk2, type = "mean")
## mean
## 2000-01-04 1.597
## 2000-01-05 1.105
procast(m, newdata = RainIbk2, type = "cdf", at = 0)
## probability
## 2000-01-04 0.2888
## 2000-01-05 0.3878
Instead of "mean"
and "cdf"
, we could also
use the aliases "response"
and "probability"
,
respectively.
Similarly, other functions such as proscore()
,
proresiduals()
, pithist()
, or
rootogram()
can be used once all of the methods from the
table above are defined.
Most of the remaining methods for the distribution objects follow the
same structure as the mean()
and cdf()
method
above, respectively. However, the random()
and the
support-related methods are slightly different. Hence, these are also
briefly illustrated in the following.
Drawing random()
samples also uses
apply_dpqr()
with the argument n
which is
assured to always be a positive integer.
random.Tobit <- function(x, n = 1L, drop = TRUE, ...) {
n <- make_positive_integer(n)
if (n == 0L) {
return(numeric(0L))
}
FUN <- function(at, d) {
y <- rnorm(n = at, mean = d$mu, sd = d$sigma)
y[y < 0] <- 0
y
}
apply_dpqr(d = x, FUN = FUN, at = n, type = "random", drop = drop)
}
If drop = TRUE
(the default), then one random sample for
each distribution yields a vector, while several random samples for each
distribution yields a matrix:
random(Y, 1)
## [1] 0.000 2.255 0.000
random(Y, 3)
## r_1 r_2 r_3
## [1,] 0.9944 0.000 0.7173
## [2,] 2.6216 1.753 1.4463
## [3,] 7.5936 2.023 5.5159
The support()
method should return a matrix of
"min"
and "max"
for the distribution. The
make_support()
function helps to set the right names and
dimension.
support.Tobit <- function(d, drop = TRUE, ...) {
min <- rep(0, length(d))
max <- rep(Inf, length(d))
make_support(min, max, d, drop = drop)
}
support(Y)
## min max
## [1,] 0 Inf
## [2,] 0 Inf
## [3,] 0 Inf
Finally, the is_discrete()
and
is_continuous()
methods should return TRUE
for
distributions that are discrete or continuous, respectively, on the
entire support. Thus, for mixed discrete-continuous distributions like
the Tobit, both methods should return FALSE
:
is_discrete.Tobit <- function(d, ...) {
setNames(rep.int(FALSE, length(d)), names(d))
}
is_continuous.Tobit <- function(d, ...) {
setNames(rep.int(TRUE, length(d)), names(d))
}
is_discrete(Y)
## [1] FALSE FALSE FALSE
## [1] TRUE TRUE TRUE