Method for Numerically Evaluating the CRPS of Probability Distributions

Method to the crps generic function from the scoringRules package for numerically evaluating the (continuous) ranked probability score (CRPS) of any probability distributions3 object.

crps.distribution(
  y,
  x,
  drop = TRUE,
  elementwise = NULL,
  gridsize = 500L,
  batchsize = 10000L,
  applyfun = NULL,
  cores = NULL,
  method = NULL,
  ...
)

crps.Beta(y, x, drop = TRUE, elementwise = NULL, ...)

crps.Bernoulli(y, x, drop = TRUE, elementwise = NULL, ...)

crps.Binomial(y, x, drop = TRUE, elementwise = NULL, ...)

crps.Erlang(y, x, drop = TRUE, elementwise = NULL, ...)

crps.Exponential(y, x, drop = TRUE, elementwise = NULL, ...)

crps.Gamma(y, x, drop = TRUE, elementwise = NULL, ...)

crps.GEV(y, x, drop = TRUE, elementwise = NULL, ...)

crps.Geometric(y, x, drop = TRUE, elementwise = NULL, ...)

crps.Gumbel(y, x, drop = TRUE, elementwise = NULL, ...)

crps.HyperGeometric(y, x, drop = TRUE, elementwise = NULL, ...)

crps.Logistic(y, x, drop = TRUE, elementwise = NULL, ...)

crps.LogNormal(y, x, drop = TRUE, elementwise = NULL, ...)

crps.NegativeBinomial(y, x, drop = TRUE, elementwise = NULL, ...)

crps.Normal(y, x, drop = TRUE, elementwise = NULL, ...)

crps.Poisson(y, x, drop = TRUE, elementwise = NULL, ...)

crps.StudentsT(y, x, drop = TRUE, elementwise = NULL, ...)

crps.Uniform(y, x, drop = TRUE, elementwise = NULL, ...)

crps.XBetaX(y, x, drop = TRUE, elementwise = NULL, method = "cdf", ...)

crps.GAMLSS(y, x, drop = TRUE, elementwise = NULL, ...)

crps.BAMLSS(y, x, drop = TRUE, elementwise = NULL, ...)

Arguments

y: A distribution object, e.g., as created by Normal or Binomial.
x: A vector of elements whose CRPS should be determined given the distribution y.
drop: logical. Should the result be simplified to a vector if possible?
elementwise: logical. Should each distribution in y be evaluated at all elements of x (elementwise = FALSE, yielding a matrix)? Or, if y and x have the same length, should the evaluation be done element by element (elementwise = TRUE, yielding a vector)? The default of NULL means that elementwise = TRUE is used if the lengths match and otherwise elementwise = FALSE is used.
gridsize: positive size of the grid used to approximate the CDF for the numerical calculation of the CRPS.
batchsize: maximum batch size. Used to split the input into batches. Lower values reduce required memory but may increase computation time.
applyfun: an optional lapply-style function with arguments function(X, FUN, ...). It is used to compute the CRPS for each element of y. The default is to use the basic lapply function unless the cores argument is specified (see below).
cores: numeric. If set to an integer the applyfun is set to mclapply with the desired number of cores, except on Windows where parLapply with makeCluster(cores) is used.
method: character. Should the grid be set up on the observation scale and method = "cdf" be used to compute the corresponding probabilities? Or should the grid be set up on the probability scale and method = "quantile" be used to compute the corresponding observations? By default, "cdf" is used for discrete observations whose range is smaller than the gridsize and "quantile" otherwise.
...: currently not used.

Value

In case of a single distribution object, either a numeric vector of length(x) (if drop = TRUE, default) or a matrix with

length(x) columns (if drop = FALSE). In case of a vectorized distribution object, a matrix with length(x) columns containing all possible combinations.

Details

The (continuous) ranked probability score (CRPS) for (univariate) probability distributions can be computed based on the the object-oriented infrastructure provided by the distributions3 package. The general crps.distribution method does so by using numeric integration based on the cdf and/or quantile methods (for more details see below). Additionally, if dedicated closed-form CRPS computations are provided by the scoringRules package for the specified distribution, then these are used because they are both computationally faster and numerically more precise. For example, the crps method for Normal objects leverages crps_norm rather than relying on numeric integration.

The general method for any distribution object uses the following strategy for numerical CRPS computation. By default (if the method argument is NULL), it distinguishes distributions whose entire support is continuous, or whose entire support is discrete, or mixed discrete-continuous distribution using is_continuous and is_discrete, respectively.

For continuous and mixed distributions, an equidistant grid of gridsize + 5 probabilities is drawn for which the corresponding quantiles for each distribution y are calculated (including the observation x). The calculation of the CRPS then uses a trapezoidal approximation for the numeric integration. For discrete distributions, gridsize equidistant quantiles (in steps of 1) are drawn and the corresponding probabilities from the cdf are calculated for each distribution y (including the observation x) and the CRPS calculated using numeric integration. If the gridsize in steps of 1 is not sufficient to cover the required range, the method falls back to the procedure used for continuous and mixed distributions to approximate the CRPS.

If the method argument is set to either "cdf" or "quantile", then the specific strategy for setting up the grid of observations and corresponding probabilities can be enforced. This can be useful if for a certain distribution class, only a cdf or only a quantile method is available or only one of them is numerically stable or computationally efficient etc.

The numeric approximation requires to set up a matrix of dimension length(y) * (gridsize + 5) (or length(y) * (gridsize + 1)) which may be very memory intensive if length(y) and/or gridsize are large. Thus, the data is split batches of (approximately) equal size, not larger than batchsize. Thus, the memory requirement is reduced to batchsize * (gridsize + 5) in each step. Hence, a smaller value of batchsize will reduce memory footprint but will slightly increase computation time.

The error (deviation between numerical approximation and analytic solution) has been shown to be in the order of 1e-2 for a series of distributions tested. Accuracy can be increased by increasing gridsize and will be lower for a smaller gridsize.

For parallelization of the numeric computations, a suitable applyfun can be provided that carries out the integration for each element of y. To facilitate setting up a suitable applyfun using the basic parallel package, the argument cores is provided for convenience. When used, y is split into B equidistant batches; at least B = cores batches or a multiple of cores with a maximum size of batchsize. On systems running Windows parlapply is used, else mclapply.

Examples

 if(!requireNamespace("scoringRules")) {
  if(interactive() || is.na(Sys.getenv("_R_CHECK_PACKAGE_NAME_", NA))) {
    stop("not all packages required for the example are installed")
  } else q() }

set.seed(6020)

## three normal distributions X and observations x
library("distributions3")
X <- Normal(mu = c(0, 1, 2), sigma = c(2, 1, 1))
x <- c(0, 0, 1)

## evaluate crps
## using infrastructure from scoringRules (based on closed-form analytic equations)
library("scoringRules")
crps(X, x)
#> [1] 0.4673900 0.6024414 0.6024414

## using general distribution method explicitly (based on numeric integration)
crps.distribution(X, x)
#> [1] 0.4674058 0.6024482 0.6024482

## analogously for Poisson distribution
Y <- Poisson(c(0.5, 1, 2))
crps(Y, x)
#> [1] 0.1631650 0.4762224 0.4991650
crps.distribution(Y, x)
#> [1] 0.1631650 0.4762224 0.4991650