Uni Duisburg-Essen Logo HECF Logo

Data Science Methods for Forecasting in Energy and Economics

Jonathan Berrisch

University of Duisburg-Essen, House of Energy, Climate, and Finance

2025-06-30

The beginning: June 2020

Artwork by @allison_horst

High-Level View

Motivation

Energy market liberalization created complex, interconnected trading systems
Renewable energy transition introduces uncertainty and volatility from weather-dependent generation
Traditional point forecasts are inadequate for modern energy markets with increasing uncertainty
Risk inherently is a probabilistic concept
Probabilistic forecasting essential for risk management, planning and decision making in volatile energy environments
Online learning methods needed for fast-updating models with streaming energy data

Image generated by Sora (OpenAI)

Overview of the Thesis

Berrisch, J., & Ziel, F. (2023). CRPS learning. Journal of Econometrics, 237(2), 105221.
Berrisch, J., & Ziel, F. (2024). Multivariate probabilistic CRPS learning with an application to day-ahead electricity prices. International Journal of Forecasting, 40(4), 1568–1586.
Hirsch, S., Berrisch, J., & Ziel, F. (2024). Online Distributional Regression. arXiv preprint arXiv:2407.08750.
Berrisch, J., & Ziel, F. (2022). Distributional modeling and forecasting of natural gas prices. Journal of Forecasting, 41(6), 1065–1086.
Berrisch, J., Pappert, S., Ziel, F., & Arsova, A. (2023). Modeling volatility and dependence of European carbon and energy prices. Finance Research Letters, 52, 103503.
Berrisch, J., Narajewski, M., & Ziel, F. (2023). High-resolution peak demand estimation using generalized additive models and deep neural networks. Energy and AI, 13, 100236.
Berrisch, J. (2025). rcpptimer: Rcpp Tic-Toc Timer with OpenMP Support. arXiv preprint arXiv:2501.15856.

Overview of the Thesis

Berrisch, J., & Ziel, F. (2023). CRPS learning. Journal of Econometrics, 237(2), 105221.
Berrisch, J., & Ziel, F. (2024). Multivariate probabilistic CRPS learning with an application to day-ahead electricity prices. International Journal of Forecasting, 40(4), 1568–1586.
Hirsch, S., Berrisch, J., & Ziel, F. (2024). Online Distributional Regression. arXiv preprint arXiv:2407.08750.
Berrisch, J., & Ziel, F. (2022). Distributional modeling and forecasting of natural gas prices. Journal of Forecasting, 41(6), 1065–1086.
Berrisch, J., Pappert, S., Ziel, F., & Arsova, A. (2023). Modeling volatility and dependence of European carbon and energy prices. Finance Research Letters, 52, 103503.
Berrisch, J., Narajewski, M., & Ziel, F. (2023). High-resolution peak demand estimation using generalized additive models and deep neural networks. Energy and AI, 13, 100236.
Berrisch, J. (2025). rcpptimer: Rcpp Tic-Toc Timer with OpenMP Support. arXiv preprint arXiv:2501.15856.

Overview of the Thesis

Berrisch, J., & Ziel, F. (2023). CRPS learning. Journal of Econometrics, 237(2), 105221.
Berrisch, J., & Ziel, F. (2024). Multivariate probabilistic CRPS learning with an application to day-ahead electricity prices. International Journal of Forecasting, 40(4), 1568–1586.
Hirsch, S., Berrisch, J., & Ziel, F. (2024). Online Distributional Regression. arXiv preprint arXiv:2407.08750.
Berrisch, J., & Ziel, F. (2022). Distributional modeling and forecasting of natural gas prices. Journal of Forecasting, 41(6), 1065–1086.
Berrisch, J., Pappert, S., Ziel, F., & Arsova, A. (2023). Modeling volatility and dependence of European carbon and energy prices. Finance Research Letters, 52, 103503.
Berrisch, J., Narajewski, M., & Ziel, F. (2023). High-resolution peak demand estimation using generalized additive models and deep neural networks. Energy and AI, 13, 100236.
Berrisch, J. (2025). rcpptimer: Rcpp Tic-Toc Timer with OpenMP Support. arXiv preprint arXiv:2501.15856.

Overview

Online Distributional Regression

Distributional Modeling and Forecasting of Natural Gas Prices

Reduces estimation time by 2-3 orders of magnitude

Maintains competitive forecasting accuracy

Real-World Validation in Energy Markets

Python Package ondil on Github and PyPi

Forecasting of Day-Ahead and Month-Ahead prices

To capture the full distribution of future prices

Extensive data analysis from 2011-2020

State-Space models, skewed Student’s t distribution

Python Package sstudentt on Github and PyPi

Overview

High-Resolution Peak Demand Estimation Using Generalized Additive Models and Deep Neural Networks

rcpptimer: Rcpp Tic-Toc Timer with OpenMP Support

Predict high-resolution electricity peaks using only low-resolution data

Combine GAMs and DNN’s for superior accuracy

Won Western Power Distribution Competition

Won Best-Student-Presentation Award

R Package rcpptimer on Github and CRAN

Provides Rcpp bindings for cpptimer

tic-toc class for timing C++ code

Supports nested, overlapping and scoped timers

Supports OpenMP parallelism

//[[Rcpp::depends(rcpptimer)]]
#include <rcpptimer.h>

//[[Rcpp::export]]
void ex()
{
  Rcpp::Timer timer;
  timer.tic();
  // Code to be timed
  timer.toc();
}

CRPS Learning

Berrisch, J., & Ziel, F. (2023). Journal of Econometrics, 237(2), 105221.

Introduction

The Idea:

  • Combine multiple forecasts instead of choosing one

  • Combination weights may vary over time, over the distribution or both

2 Popular options for combining distributions:

  • Combining across quantiles (this paper)
    • Horizontal aggregation, vincentization
  • Combining across probabilities
    • Vertical aggregation
  • Combining at an angle
    • Taylor & Meng (2023)

The Framework of Prediction under Expert Advice

 

Each day, \(t = 1, 2, ... T\)

  • The forecaster receives predictions \(\widehat{X}_{t,k}\) from \(K\) experts
  • The forecaster assigns weights \(w_{t,k}\) to each expert
  • The forecaster calculates the prediction:

\[\begin{equation} \widetilde{X}_{t} = \sum_{k=1}^K w_{t,k} \widehat{X}_{t,k}. \label{eq_forecast_def} \end{equation}\]

  • The realization for \(t\) is observed

The experts can be institutions, persons, or models

The forecasts can be point-forecasts (i.e., mean or median) or full predictive distributions

Cesa-Bianchi & Lugosi (2006)

The Regret

Weights are updated sequentially according to the past performance of the \(K\) experts.

That is, a loss function \(\ell\) is needed. This is used to compute the cumulative regret \(R_{t,k}\)

\[\begin{equation} R_{t,k} = \widetilde{L}_{t} - \widehat{L}_{t,k} = \sum_{i = 1}^t \ell(\widetilde{X}_{i},Y_i) - \ell(\widehat{X}_{i,k},Y_i)\label{eq:regret} \end{equation}\]

The cumulative regret:

  • Indicates the predictive accuracy of the expert \(k\) until time \(t\).
  • Measures how much the forecaster regrets not having followed the expert’s advice

Popular loss functions for point forecasting Gneiting (2011):

\(\ell_2\) loss:

\[\begin{equation} \ell_2(x, y) = | x -y|^2 \label{eq:elltwo} \end{equation}\]

Strictly proper for mean prediction

\(\ell_1\) loss:

\[\begin{equation} \ell_1(x, y) = | x -y| \label{eq:ellone} \end{equation}\]

Strictly proper for median predictions

 

The naive combination

\[\begin{equation} w_{t,k}^{\text{Naive}} = \frac{1}{K}\label{eq:naive_combination} \end{equation}\]

The exponentially weighted average forecaster (EWA)

\[\begin{equation} \begin{aligned} w_{t,k}^{\text{EWA}} & = \frac{e^{\eta R_{t,k}} }{\sum_{k = 1}^K e^{\eta R_{t,k}}}\\ & = \frac{e^{-\eta \ell(\widehat{X}_{t,k},Y_t)} w^{\text{EWA}}_{t-1,k} }{\sum_{k = 1}^K e^{-\eta \ell(\widehat{X}_{t,k},Y_t)} w^{\text{EWA}}_{t-1,k}} \end{aligned}\label{eq:exp_combination} \end{equation}\]

Risk

In stochastic settings, the cumulative Risk should be analyzed Wintenberger (2017):

\[\begin{align} &\underbrace{\widetilde{\mathcal{R}}_t = \sum_{i=1}^t \mathbb{E}[\ell(\widetilde{X}_{i},Y_i)|\mathcal{F}_{i-1}]}_{\text{Cumulative Risk of Forecaster}} \\ &\underbrace{\widehat{\mathcal{R}}_{t,k} = \sum_{i=1}^t \mathbb{E}[\ell(\widehat{X}_{i,k},Y_i)|\mathcal{F}_{i-1}]}_{\text{Cumulative Risk of Experts}} \label{eq_def_cumrisk} \end{align}\]

(7) expected loss of the algorithm (lower = better)

Optimal Convergence


The selection problem

\[\begin{equation} \frac{1}{t}\left(\widetilde{\mathcal{R}}_t - \widehat{\mathcal{R}}_{t,\min} \right) \stackrel{t\to \infty}{\rightarrow} a \quad \text{with} \quad a \leq 0. \label{eq_opt_select} \end{equation}\] The forecaster is asymptotically not worse than the best expert.

The convex aggregation problem

\[\begin{equation} \frac{1}{t}\left(\widetilde{\mathcal{R}}_t - \widehat{\mathcal{R}}_{t,\pi} \right) \stackrel{t\to \infty}{\rightarrow} b \quad \text{with} \quad b \leq 0 . \label{eq_opt_conv} \end{equation}\] The forecaster is asymptotically not worse than the best convex combination in hindsight (oracle).

Optimal rates with respect to selection \(\eqref{eq_opt_select}\) and convex aggregation \(\eqref{eq_opt_conv}\) Wintenberger (2017):

\[\begin{align} \frac{1}{t}\left(\widetilde{\mathcal{R}}_t - \widehat{\mathcal{R}}_{t,\min} \right) & = \mathcal{O}\left(\frac{\log(K)}{t}\right)\label{eq_optp_select} \end{align}\]

\[\begin{align} \frac{1}{t}\left(\widetilde{\mathcal{R}}_t - \widehat{\mathcal{R}}_{t,\pi} \right) & = \mathcal{O}\left(\sqrt{\frac{\log(K)}{t}}\right) \label{eq_optp_conv} \end{align}\]

Algorithms can satisfy both \(\eqref{eq_optp_select}\) and \(\eqref{eq_optp_conv}\) depending on:

  • The loss function
  • Regularity conditions on \(Y_t\) and \(\widehat{X}_{t,k}\)
  • The weighting scheme

Optimal Convergence


Requirements:

EWA satisfies optimal selection convergence \(\eqref{eq_optp_select}\) in a deterministic setting if:

Loss \(\ell\) is exp-concave

Learning-rate \(\eta\) is chosen correctly

Optimal convex aggregation convergence \(\eqref{eq_optp_conv}\) can be satisfied by applying the kernel-trick:

\[\begin{align} \ell^{\nabla}(x,y) = \ell'(\widetilde{X},y) x \end{align}\]

\(\ell'\) is the subgradient of \(\ell\) at forecast combination \(\widetilde{X}\)

Those results can be converted to any stochastic setting Wintenberger (2017)

Probabilistic Setting


An appropriate choice:

\[\begin{equation*} \text{CRPS}(F, y) = \int_{\mathbb{R}} {(F(x) - \mathbb{1}\{ x > y \})}^2 dx \label{eq:crps} \end{equation*}\]

It’s strictly proper (Gneiting & Raftery, 2007).

Using the CRPS, we can calculate time-adaptive weights \(w_{t,k}\). However, what if the experts’ performance varies in parts of the distribution?

Utilize this relation:

\[\begin{equation*} \text{CRPS}(F, y) = 2 \int_0^{1} \text{QL}_p(F^{-1}(p), y) dp.\label{eq_crps_qs} \end{equation*}\]

… to combine quantiles of the probabilistic forecasts individually using the quantile-loss QL.

CRPS Learning Optimality

QL is convex, but not exp-concave

Bernstein Online Aggregation (BOA) lets us weaken the exp-concavity condition. It satisfies that there exists a \(C>0\) such that for \(x>0\) it holds that

\[\begin{equation} P\left( \frac{1}{t}\left(\widetilde{\mathcal{R}}_t - \widehat{\mathcal{R}}_{t,\pi} \right) \leq C \log(\log(t)) \left(\sqrt{\frac{\log(K)}{t}} + \frac{\log(K)+x}{t}\right) \right) \geq 1-e^{-x} \label{eq_boa_opt_conv} \end{equation}\]

if the loss function is convex.

Almost optimal w.r.t. convex aggregation \(\eqref{eq_optp_conv}\) Wintenberger (2017).

The same algorithm satisfies that there exists a \(C>0\) such that for \(x>0\) it holds that \[\begin{equation} P\left( \frac{1}{t}\left(\widetilde{\mathcal{R}}_t - \widehat{\mathcal{R}}_{t,\min} \right) \leq C\left(\frac{\log(K)+\log(\log(Gt))+ x}{\alpha t}\right)^{\frac{1}{2-\beta}} \right) \geq 1-2e^{-x} \label{eq_boa_opt_select} \end{equation}\]

if the loss \(\ell\) is \(G\)-Lipschitz and weak exp-concave in its first coordinate

Almost optimal w.r.t. selection \(\eqref{eq_optp_select}\) Gaillard & Wintenberger (2018).

We show that this holds for QL under feasible conditions.

Proposition 1: The Power of Flexibility

\[\begin{align} 2\overline{\widehat{\mathcal{R}}}^{\text{QL}}_{t,\min} & \leq \widehat{\mathcal{R}}^{\text{CRPS}}_{t,\min} \label{eq_risk_ql_crps_expert} \\ 2\overline{\widehat{\mathcal{R}}}^{\text{QL}}_{t,\pi} & \leq \widehat{\mathcal{R}}^{\text{CRPS}}_{t,\pi} . \label{eq_risk_ql_crps_convex} \end{align}\]

Pointwise can outperform constant procedures

\(\text{QL}\) is convex: almost optimal convergence w.r.t. convex aggregation \(\eqref{eq_boa_opt_conv}\)

For almost optimal convergence w.r.t. selection \(\eqref{eq_boa_opt_select}\) we need (Gaillard & Wintenberger, 2018):

A1: Lipschitz Continuity

A2: Weak Exp-Concavity

QL is Lipschitz continuous with \(G=\max(p, 1-p)\):

A1 holds

A1: Lipschitz Continuity

For some \(G>0\) it holds for all \(x_1,x_2\in \mathbb{R}\) and \(t>0\) that

\[ | \ell(x_1, Y_t)-\ell(x_2, Y_t) | \leq G |x_1-x_2|\]

A2 Weak Exp-Concavity

For some \(\alpha>0\), \(\beta\in[0,1]\) it holds for all \(x_1,x_2 \in \mathbb{R}\) and \(t>0\) that

\[\begin{align*} \mathbb{E}[ & \ell(x_1, Y_t)-\ell(x_2, Y_t) | \mathcal{F}_{t-1}] \leq \\ & \mathbb{E}[ \ell'(x_1, Y_t)(x_1 - x_2) |\mathcal{F}_{t-1}] \\ & + \mathbb{E}\left[ \left. \left( \alpha(\ell'(x_1, Y_t)(x_1 - x_2))^{2}\right)^{1/\beta} \right|\mathcal{F}_{t-1}\right] \end{align*}\]

The strongest case is \(\beta=1\) (Strong Convexity)

Conditional quantile risk: \(\mathcal{Q}_p(x) = \mathbb{E}[ \text{QL}_p(x, Y_t) | \mathcal{F}_{t-1}]\).

convexity properties of \(\mathcal{Q}_p\) depend on the conditional distribution \(Y_t|\mathcal{F}_{t-1}\).

Proposition 2

Let \(Y\) be a univariate random variable with (Radon-Nikodym) \(\nu\)-density \(f\), then for the second subderivative of the quantile risk \(\mathcal{Q}_p(x) = \mathbb{E}[ \text{QL}_p(x, Y) ]\) of \(Y\) it holds for all \(p\in(0,1)\) that \(\mathcal{Q}_p'' = f.\) Additionally, if \(f\) is a continuous Lebesgue-density with \(f\geq\gamma>0\) for some constant \(\gamma>0\) on its support \(\text{spt}(f)\) then \(\mathcal{Q}_p\) is \(\gamma\)-strongly convex.

This implies satisfaction of condition A2 with \(\beta=1\) and \(\alpha = \gamma / 2G^2\) (Gaillard & Wintenberger, 2018)

Theorem 1

The gradient based fully adaptive Bernstein online aggregation (BOAG) applied pointwise for all \(p\in(0,1)\) on \(\text{QL}\) satisfies \(\eqref{eq_boa_opt_conv}\) with minimal CRPS given by

\[\widehat{\mathcal{R}}_{t,\pi} = 2\overline{\widehat{\mathcal{R}}}^{\text{QL}}_{t,\pi}.\]

If \(Y_t|\mathcal{F}_{t-1}\) is bounded and has a pdf \(f_t\) satisfying \(f_t>\gamma >0\) on its support \(\text{spt}(f_t)\) then \(\eqref{eq_boa_opt_select}\) holds with \(\beta=1\) and

\[\widehat{\mathcal{R}}_{t,\min} = 2\overline{\widehat{\mathcal{R}}}^{\text{QL}}_{t,\min}\]

BOAG with \(\text{QL}\) satisfies \(\eqref{eq_boa_opt_conv}\) and \(\eqref{eq_boa_opt_select}\)

First, rewrite \(QL_p\) using the check function:

\[\begin{align} \rho_p(z) = z(\mathbb{1}(0 < z) - p) \label{eq:check} \end{align}\]

\[\begin{align} QL_p(x, y) &= (\mathbb{1}(y < x) - p)(x - y) \\ &= \rho_p(x-y) \end{align}\]

Now we can express the quantile risk as:

\[\begin{align} \mathcal{Q}_p(x) = \mathbb{E}[\rho_p(x-y)] = ∫ \rho_p(x-y)f(y)dy \end{align}\]

This integral form is where the convolution becomes apparent. A convolution of functions is defined as:

\[\begin{align} (g * h)(x) &= ∫ g(z)h(x - z)dz \\ &= ∫ g(x - z)h(z)dz \end{align}\]

They are commutative.

Using exchangability of subgradients in covolutions:

\[\begin{align} \mathcal{Q}''_p(x) = \rho_p'' * f \end{align}\]

To find \(\mathcal{Q}''_p(x)\) we rewrite \(\eqref{eq:check}\):

\[\begin{align} \rho_p(z) &= \begin{cases} z(1 - p) = z - zp, & \text{if } 0 < z \\ z(0 - p) = -zp, & \text{if } 0 \geq z \end{cases} \\ \rho_p'(z) &= \begin{cases} 1 - p, & \text{if } z > 0 \\ -p, & \text{if } z < 0 \end{cases} \end{align}\]

The function \(\rho'_p(z)\) jumps from \(-p\) to \(1-p\) at \(0\). So:

\[\begin{align} \rho''_p(x) = \delta_0(z) \quad \text{(Dirac Delta)} \end{align}\]

Now the magical part :

\[\begin{align} \mathcal{Q}''_p(x) = \delta_0 * f = f \end{align}\]

Because Dirac Delta is the identity element for convolutions.

A Probabilistic Example

Simple Example:

\[\begin{align} Y_t & \sim \mathcal{N}(0,\,1) \\ \widehat{X}_{t,1} & \sim \widehat{F}_{1} = \mathcal{N}(-1,\,1) \\ \widehat{X}_{t,2} & \sim \widehat{F}_{2} = \mathcal{N}(3,\,4) \label{eq:dgp_sim1} \end{align}\]

  • True weights vary over \(p\)
  • Figures show the ECDF and calculated weights using \(T=25\) realizations
  • Pointwise solution creates rough estimates
  • Pointwise is better than constant
  • Smooth solution is better than pointwise

The Smoothing Procedures

Penalized cubic B-Splines for smoothing weights:

Let \(\varphi=(\varphi_1,\ldots, \varphi_L)\) be bounded basis functions on \((0,1)\) Then we approximate \(w_{t,k}\) by

\[\begin{align} w_{t,k}^{\text{smooth}} = \sum_{l=1}^L \beta_l \varphi_l = \beta'\varphi \end{align}\]

with parameter vector \(\beta\). The latter is estimated to penalize \(L_2\)-smoothing which minimizes

\[\begin{equation} \| w_{t,k} - \beta' \varphi \|^2_2 + \lambda \| \mathcal{D}^{d} (\beta' \varphi) \|^2_2 \label{eq_function_smooth} \end{equation}\]

with differential operator \(\mathcal{D}\)

Computation is easy, since we have an analytical solution

We receive the constant solution for high values of \(\lambda\) when setting \(d=1\)

Represent weights as linear combinations of bounded basis functions:

\[\begin{equation} w_{t,k} = \sum_{l=1}^L \beta_{t,k,l} \varphi_l = \boldsymbol \beta_{t,k}' \boldsymbol \varphi \end{equation}\]

A popular choice are are B-Splines as local basis functions

\(\boldsymbol \beta_{t,k}\) is calculated using a reduced regret matrix:

\[\begin{equation} \underbrace{\boldsymbol r_{t}}_{\text{LxK}} = \frac{L}{P} \underbrace{\boldsymbol B'}_{\text{LxP}} \underbrace{\left({\boldsymbol{QL}}_{\mathcal{P}}^{\nabla}(\widetilde{\boldsymbol X}_{t},Y_t)- {\boldsymbol{QL}}_{\mathcal{P}}^{\nabla}(\widehat{\boldsymbol X}_{t},Y_t)\right)}_{\text{PxK}} \end{equation}\]

\(\boldsymbol r_{t}\) is transformed from PxK to LxK

If \(L = P\) it holds that \(\boldsymbol \varphi = \boldsymbol{I}\) For \(L = 1\) we receive constant weights

Weights converge to the constant solution if \(L\rightarrow 1\)

The Proposed CRPS-Learning Algorithm


Initialization:

Array of expert predictions: \(\widehat{X}_{t,p,k}\)

Vector of Prediction targets: \(Y_t\)

Starting Weights: \(\boldsymbol w_0=(w_{0,1},\ldots, w_{0,K})\)

Penalization parameter: \(\lambda\geq 0\)

B-spline and penalty matrices \(\boldsymbol B\) and \(\boldsymbol D\) on \(\mathcal{P}= (p_1,\ldots,p_M)\)

Hat matrix: \[\boldsymbol{\mathcal{H}} = \boldsymbol B(\boldsymbol B'\boldsymbol B+ \lambda (\alpha \boldsymbol D_1'\boldsymbol D_1 + (1-\alpha) \boldsymbol D_2'\boldsymbol D_2))^{-1} \boldsymbol B'\]

Cumulative Regret: \(R_{0,k} = 0\)

Range parameter: \(E_{0,k}=0\)

Starting pseudo-weights: \(\boldsymbol \beta_0 = \boldsymbol B^{\text{pinv}}\boldsymbol w_0(\boldsymbol{\mathcal{P}})\)

Core:

for( t in 1:T ) {

    \(\widetilde{\boldsymbol X}_{t} = \text{Sort}\left( \boldsymbol w_{t-1}'(\boldsymbol P) \widehat{\boldsymbol X}_{t} \right)\) # Prediction

    \(\boldsymbol r_{t} = \frac{L}{M} \boldsymbol B' \left({\boldsymbol{QL}}_{\boldsymbol{\mathcal P}}^{\nabla}(\widetilde{\boldsymbol X}_{t},Y_t)- {\boldsymbol{QL}}_{\boldsymbol{\mathcal P}}^{\nabla}(\widehat{\boldsymbol X}_{t},Y_t)\right)\)

    \(\boldsymbol E_{t} = \max(\boldsymbol E_{t-1}, \boldsymbol r_{t}^+ + \boldsymbol r_{t}^-)\)

    \(\boldsymbol V_{t} = \boldsymbol V_{t-1} + \boldsymbol r_{t}^{ \odot 2}\)

    \(\boldsymbol \eta_{t} =\min\left( \left(-\log(\boldsymbol \beta_{0}) \odot \boldsymbol V_{t}^{\odot -1} \right)^{\odot\frac{1}{2}} , \frac{1}{2}\boldsymbol E_{t}^{\odot-1}\right)\)

    \(\boldsymbol R_{t} = \boldsymbol R_{t-1}+ \boldsymbol r_{t} \odot \left( \boldsymbol 1 - \boldsymbol \eta_{t} \odot \boldsymbol r_{t} \right)/2 + \boldsymbol E_{t} \odot \mathbb{1}\{-2\boldsymbol \eta_{t}\odot \boldsymbol r_{t} > 1\}\)

    \(\boldsymbol \beta_{t} = K \boldsymbol \beta_{0} \odot \boldsymbol {SoftMax}\left( - \boldsymbol \eta_{t} \odot \boldsymbol R_{t} + \log( \boldsymbol \eta_{t}) \right)\)

    \(\boldsymbol w_{t}(\boldsymbol P) = \underbrace{\boldsymbol B(\boldsymbol B'\boldsymbol B+ \lambda (\alpha \boldsymbol D_1'\boldsymbol D_1 + (1-\alpha) \boldsymbol D_2'\boldsymbol D_2))^{-1} \boldsymbol B'}_{\boldsymbol{\mathcal{H}}} \boldsymbol B \boldsymbol \beta_{t}\)

}

Simulation Study

Data Generating Process of the simple probabilistic example:

\[\begin{align*} Y_t &\sim \mathcal{N}(0,\,1)\\ \widehat{X}_{t,1} &\sim \widehat{F}_{1}=\mathcal{N}(-1,\,1) \\ \widehat{X}_{t,2} &\sim \widehat{F}_{2}=\mathcal{N}(3,\,4) \end{align*}\]

  • Constant solution \(\lambda \rightarrow \infty\)
  • Pointwise Solution of the proposed BOAG
  • Smoothed Solution of the proposed BOAG
    • Weights are smoothed during learning
    • Smooth weights are used to calculate Regret, adjust weights, etc.

New DGP:

\[\begin{align*} Y_t &\sim \mathcal{N}\left(\frac{\sin(0.005 \pi t )}{2},\,1\right) \\ \widehat{X}_{t,1} &\sim \widehat{F}_{1} = \mathcal{N}(-1,\,1) \\ \widehat{X}_{t,2} &\sim \widehat{F}_{2} = \mathcal{N}(3,\,4) \end{align*}\]

Changing optimal weights

Single run example depicted aside

No forgetting leads to long-term constant weights

 

Application Study

Data:

  • Forecasting European emission allowances (EUA)
  • Daily month-ahead prices
  • Jan 13 - Dec 20 (Phase III, 2092 Obs)
  • Rolling Window (length 250 ~ 1 year)

Combination methods:

  • Online
    • Naive, BOAG, EWAG, ML-PolyG, BMA
  • Batch
    • QRlin, QRconv

Tuning parameter grids:

  • Smoothing Penalty: \(\Lambda= \{0\}\cup \{2^x|x\in \{-4,-3.5,\ldots,12\}\}\)
  • Learning Rates: \(\mathcal{E}= \{2^x|x\in \{-1,-0.5,\ldots,9\}\}\)

Simple exponential smoothing with additive errors (ETS-ANN):

\[\begin{align*} Y_{t} = l_{t-1} + \varepsilon_t \quad \text{with} \quad l_t = l_{t-1} + \alpha \varepsilon_t \quad \text{and} \quad \varepsilon_t \sim \mathcal{N}(0,\sigma^2) \end{align*}\]

Quantile regression (QuantReg): For each \(p \in \mathcal{P}\) we assume:

\[\begin{align*} F^{-1}_{Y_t}(p) = \beta_{p,0} + \beta_{p,1} Y_{t-1} + \beta_{p,2} |Y_{t-1}-Y_{t-2}| \end{align*}\]

ARIMA(1,0,1)-GARCH(1,1) with Gaussian errors (ARMA-GARCH):

\[\begin{align*} Y_{t} = \mu + \phi(Y_{t-1}-\mu) + \theta \varepsilon_{t-1} + \varepsilon_t \quad \text{with} \quad \varepsilon_t = \sigma_t Z, \quad \sigma_t^2 = \omega + \alpha \varepsilon_{t-1}^2 + \beta \sigma_{t-1}^2 \quad \text{and} \quad Z_t \sim \mathcal{N}(0,1) \end{align*}\]

ARIMA(0,1,0)-I-EGARCH(1,1) with Gaussian errors (I-EGARCH):

\[\begin{align*} Y_{t} = \mu + Y_{t-1} + \varepsilon_t \quad \text{with} \quad \varepsilon_t = \sigma_t Z, \quad \log(\sigma_t^2) = \omega + \alpha Z_{t-1}+ \gamma (|Z_{t-1}|-\mathbb{E}|Z_{t-1}|) + \beta \log(\sigma_{t-1}^2) \quad \text{and} \quad Z_t \sim \mathcal{N}(0,1) \end{align*}\]

ARIMA(0,1,0)-GARCH(1,1) with student-t errors (I-GARCHt):

\[\begin{align*} Y_{t} = \mu + Y_{t-1} + \varepsilon_t \quad \text{with} \quad \varepsilon_t = \sigma_t Z, \quad \sigma_t^2 = \omega + \alpha \varepsilon_{t-1}^2 + \beta \sigma_{t-1}^2 \quad \text{and} \quad Z_t \sim t(0,1, \nu) \end{align*}\]

ETS QuantReg ARMA-GARCH I-EGARCH I-GARCHt
2.101(>.999) 1.358(>.999) 0.52(0.993) 0.511(0.999) -0.037(0.406)
BOAG EWAG ML-PolyG BMA QRlin QRconv
Pointwise -0.170(0.055) -0.089(0.175) -0.141(0.112) 0.032(0.771) 3.482(>.999) -0.019(0.309)
B-Constant -0.118(0.146) -0.049(0.305) -0.090(0.218) 0.038(0.834) 4.002(>.999) 0.539(0.996)
P-Constant -0.138(0.020) -0.070(0.137) -0.133(0.026) 0.039(0.851) 5.275(>.999) 0.009(0.683)
B-Smooth -0.173(0.062) -0.065(0.276) -0.141(0.118) -0.042(0.386) - -
P-Smooth -0.182(0.039) -0.107(0.121) -0.160(0.065) 0.040(0.804) 3.495(>.999) -0.012(0.369)

CRPS difference to Naive. Negative values correspond to better performance (the best value is bold).
Additionally, we show the p-value of the DM-test, testing against Naive. The cells are colored with respect to their values (the greener better).

Multivariate Probabilistic CRPS Learning with an Application to Day-Ahead Electricity Prices

Berrisch, J., & Ziel, F. (2024). International Journal of Forecasting, 40(4), 1568-1586.

Multivariate CRPS Learning

We extend the B-Smooth and P-Smooth procedures to the multivariate setting:

Let \(\boldsymbol{\psi}^{\text{mv}}=(\psi_1,\ldots, \psi_{D})\) and \(\boldsymbol{\psi}^{\text{pr}}=(\psi_1,\ldots, \psi_{P})\) be two sets of bounded basis functions on \((0,1)\):

\[\begin{equation*} \boldsymbol w_{t,k} = \boldsymbol{\psi}^{\text{mv}} \boldsymbol{b}_{t,k} {\boldsymbol{\psi}^{pr}}' \end{equation*}\]

with parameter matrix \(\boldsymbol b_{t,k}\). The latter is estimated to penalize \(L_2\)-smoothing which minimizes

\[\begin{align} & \| \boldsymbol{\beta}_{t,d, k}' \boldsymbol{\varphi}^{\text{pr}} - \boldsymbol b_{t, d, k}' \boldsymbol{\psi}^{\text{pr}} \|^2_2 + \lambda^{\text{pr}} \| \mathcal{D}_{q} (\boldsymbol b_{t, d, k}' \boldsymbol{\psi}^{\text{pr}}) \|^2_2 + \nonumber \\ & \| \boldsymbol{\beta}_{t, p, k}' \boldsymbol{\varphi}^{\text{mv}} - \boldsymbol b_{t, p, k}' \boldsymbol{\psi}^{\text{mv}} \|^2_2 + \lambda^{\text{mv}} \| \mathcal{D}_{q} (\boldsymbol b_{t, p, k}' \boldsymbol{\psi}^{\text{mv}}) \|^2_2 \nonumber \end{align}\]

with differential operator \(\mathcal{D}_q\) of order \(q\)

We have an analytical solution.

Linear combinations of bounded basis functions:

\[\begin{equation} \underbrace{\boldsymbol w_{t,k}}_{D \text{ x } P} = \sum_{j=1}^{\widetilde D} \sum_{l=1}^{\widetilde P} \beta_{t,j,l,k} \varphi^{\text{mv}}_{j} \varphi^{\text{pr}}_{l} = \underbrace{\boldsymbol \varphi^{\text{mv}}}_{D\text{ x }\widetilde D} \boldsymbol \beta_{t,k} \underbrace{{\boldsymbol\varphi^{\text{pr}}}'}_{\widetilde P \text{ x }P} \nonumber \end{equation}\]

A popular choice: B-Splines

\(\boldsymbol \beta_{t,k}\) is calculated using a reduced regret matrix:

\(\underbrace{\boldsymbol r_{t,k}}_{\widetilde P \times \widetilde D} = \boldsymbol \varphi^{\text{pr}} \underbrace{\left({\boldsymbol{QL}}_{\mathcal{P}}^{\nabla}(\widetilde{\boldsymbol X}_{t},Y_t)- {\boldsymbol{QL}}_{\mathcal{P}}^{\nabla}(\widehat{\boldsymbol X}_{t},Y_t)\right)}_{\text{PxD}}\boldsymbol \varphi^{\text{mv}}\)

If \(\widetilde P = P\) it holds that \(\boldsymbol \varphi^{pr} = \boldsymbol{I}\) (pointwise)

For \(\widetilde P = 1\) we receive constant weights

Application

Data

  • Day-Ahead electricity price forecasts from Marcjasz et al. (2023)
  • Produced using probabilistic neural networks
  • 24-dimensional distributional forecasts
  • Distribution assumptions: JSU and Normal
  • 8 experts (4 JSU, 4 Normal)
  • 27th Dec. 2018 to 31st Dec. 2020 (736 days)
  • We extract 99 quantiles (percentiles)

Setup

Evaluation: Exclude first 182 observations

Extensions: Penalized smoothing | Forgetting

Tuning strategies:

  • Bayesian Fix
    • Sophisticated Baesian Search algorithm
  • Online
    • Dynamic based on past performance
  • Bayesian Online
    • First Bayesian Fix then Online

Computation Time: ~30 Minutes

Results

JSU1 JSU2 JSU3 JSU4 Norm1 Norm2 Norm3 Norm4 Naive
1.487 1.444 1.499 1.374 1.414 1.535 1.420 1.422 1.295
Description Parameter Tuning BOA ML-Poly EWA
Constant 1.2933 1.2966 1.3188
Pointwise 1.2936 1.3010 1.3101
FTL 1.3752 1.3692 1.3863
B Constant Pr 1.2936 1.3000 1.3432
B Constant Mv 1.2918 1.2945 1.3076
Forget Bayesian Fix 1.2930 1.2956 1.3096
Full Bayesian Fix 1.2905 1.2902 1.2870.
Smooth.forget Bayesian Fix 1.2911 1.2912 1.2869.
Smooth Bayesian Fix 1.2918 1.2917 1.2873.
Forget Bayesian Online 1.2855** 1.2961 1.3098
Full Bayesian Online 1.2919 1.2873. 1.2873.
Smooth.forget Bayesian Online 1.2845** 1.2862* 1.2864.
Smooth Bayesian Online 1.2918 1.2918 1.2874.
Forget Sampling Online 1.2855** 1.2961 1.3114
Full Sampling Online 1.2886 1.2861* 1.2873.
Smooth.forget Sampling Online 1.2845*** 1.2867* 1.2866.
Smooth Sampling Online 1.2918 1.2917 1.2877.
Coloring w.r.t. test statistic: <-5 -4 -3 -2 -1 0 1 2 3 4 >5
Significance denoted by: . p < 0.1; * p < 0.05; ** p < 0.01; *** p < 0.001;

Results

Non-Equidistant Knots

Non-central beta distribution Johnson et al. (1995):

\[\begin{equation*} \mathcal{B}(x, a, b, c) = \sum_{j=0}^{\infty} e^{-c/2} \frac{\left( \frac{c}{2} \right)^j}{j!} I_x \left( a + j , b \right) \end{equation*}\]

Penalty and \(\lambda\) need to be adjusted accordingly Li & Cao (2022)

Using non equidistant knots in profoc is straightforward:

mod <- online(
  y = Y,
  experts = experts,
  tau = 1:99 / 100,
  b_smooth_pr = list(
    knots = 9,
    mu = 0.3,
    sigma = 1,
    nonc = 0,
    tailweight = 1,
    deg = 3
  )
)

Basis specification b_smooth_pr is internally passed to make_basis_mats().

Profoc adjusts (and scales) penatly

Wrap-Up

Potential Downsides:

  • Pointwise optimization can induce quantile crossing
    • Can be solved by sorting the predictions

Important:

  • The choice of the learning rate is crucial
  • The loss function has to meet certain criteria

Upsides:

  • Pointwise learning outperforms the Naive solution significantly
  • Online learning is much faster than batch methods
  • Smoothing further improves the predictive performance
  • Asymptotically not worse than the best convex combination

The profoc R Package:

  • Implements all algorithms discussed above
  • Is written using RcppArmadillo its fast
  • Accepts vectors for most parameters
    • The best parameter combination is chosen online
  • Implements
    • Forgetting, Fixed Share
    • Different loss functions + gradients

Pubications:

Berrisch, J., & Ziel, F. (2023). CRPS learning. Journal of Econometrics, 237(2), 105221.

Berrisch, J., & Ziel, F. (2024). Multivariate probabilistic CRPS learning with an application to day-ahead electricity prices. International Journal of Forecasting, 40(4), 1568-1586.

Modeling Volatility and Dependence of European Carbon and Energy Prices

Berrisch, J., Pappert, S., Ziel, F., & Arsova, A. (2023). Finance Research Letters, 52, 103503.

Motivation

Understanding European Emission Allowances (EUA)

Portfolio & Risk Management,

Sustainability Planing

Political decisions

EUA prices are connected to energy markets

How can the dynamics be characterized?

Several Questions arise:

  • Data (Pre)processing
  • Modeling Approach
  • Evaluation

Data

Daily Observations: 03/15/2010 - 10/14/2022

EUA, Natural Gas, Brent Crude Oil, Coal

  • normalized w.r.t. \(\text{CO}_2\) emissions
  • Adjusted for inflation by Eurostat’s HICP, excluding energy

Emission-adjusted prices reflect one tonne of \(\text{CO}_2\)

Log transformation of the data to stabilize the variance

ADF Test: All series are stationary in first differences

Johansen’s likelihood ratio trace test suggests two cointegrating relationships (only in levels)

Data

Modeling Approach: Notation


  • Let \(\boldsymbol{X}_t\) be a \(K\)-dimensional vector at time \(t\)
  • The forecasting target:
    • Conditional joint distribution
    • \(F_{\boldsymbol{X}_t|\mathcal{F}_{t-1}}\)
    • \(\mathcal{F}_{t}\) is the sigma field generated by all information available up to and including time \(t\)

Sklars theorem: decompose target into

  • marginal distributions: \(F_{X_{k,t}|\mathcal{F}_{t-1}}\) for \(k=1,\ldots, K\), and
  • copula function: \(C_{\boldsymbol{U}_{t}|\mathcal{F}_{t - 1}}\)

Let \(\boldsymbol{x}_t= (x_{1,t},\ldots, x_{K,t})^\intercal\) be the realized values

It holds that:

\[\begin{align} F_{\boldsymbol{X}_t|\mathcal{F}_{t-1}}(\boldsymbol{x}_t) = C_{\boldsymbol{U}_{t}|\mathcal{F}_{t - 1}}(\boldsymbol{u}_t) \nonumber \end{align}\]

with: \(\boldsymbol{u}_t =(u_{1,t},\ldots, u_{K,t})^\intercal\), \(u_{k,t} = F_{X_{k,t}|\mathcal{F}_{t-1}}(x_{k,t})\)

For brevity we drop the conditioning on \(\mathcal{F}_{t-1}\).

The model can be specified as follows

\[\begin{align} F(\boldsymbol{x}_t) = C \left[\mathbf{F}(\boldsymbol{x}_t; \boldsymbol{\mu}_t, \boldsymbol{ \sigma }_{t}^2, \boldsymbol{\nu}, \boldsymbol{\lambda}); \Xi_t, \Theta\right] \nonumber \end{align}\]

\(\Xi_{t}\) denotes time-varying dependence parameters \(\Theta\) denotes time-invariant dependence parameters

We take \(C\) as the \(t\)-copula

Modeling Approach: The General Framework

Individual marginal distributions:

\[\mathbf{F} = (F_1, \ldots, F_K)^{\intercal}\]

Generalized non-central t-distributions

  • Time varying:
    • expectation \(\boldsymbol{\mu}_t = (\mu_{1,t}, \ldots, \mu_{K,t})^{\intercal}\)
    • variance: \(\boldsymbol{\sigma}_{t}^2 = (\sigma_{1,t}^2, \ldots, \sigma_{K,t}^2)^{\intercal}\)
  • Time invariant
    • degrees of freedom: \(\boldsymbol{\nu} = (\nu_1, \ldots, \nu_K)^{\intercal}\)
    • noncentrality: \(\boldsymbol{\lambda} = (\lambda_1, \ldots, \lambda_K)^{\intercal}\)

VECM Model

\[\begin{align} \Delta \boldsymbol{\mu}_t = \Pi \boldsymbol{x}_{t-1} + \Gamma \Delta \boldsymbol{x}_{t-1} \nonumber \end{align}\]

where \(\Pi = \alpha \beta^{\intercal}\) is the cointegrating matrix of rank \(r\), \(0 \leq r\leq K\).

GARCH model

\[\begin{align} \sigma_{i,t}^2 = & \omega_i + \alpha^+_{i} (\epsilon_{i,t-1}^+)^2 + \alpha^-_{i} (\epsilon_{i,t-1}^-)^2 + \beta_i \sigma_{i,t-1}^2 \nonumber \end{align}\]

where \(\epsilon_{i,t-1}^+ = \max\{\epsilon_{i,t-1}, 0\}\)

Positive vs. negative innovations (capture leverage effects).

Time-varying dependence parameters

\[\begin{align*} \Xi_{t} = & \Lambda\left(\boldsymbol{\xi}_{t}\right) \\ \xi_{ij,t} = & \eta_{0,ij} + \eta_{1,ij} \xi_{ij,t-1} + \eta_{2,ij} z_{i,t-1} z_{j,t-1}, \end{align*}\]

\(z_{i,t}\) is the \(i\)-th standardized residual from time series \(i\)

\(\Lambda(\cdot)\) is a link function:

  • ensures that \(\Xi_{t}\) is a valid variance covariance matrix
  • ensures that \(\Xi_{t}\) does not exceed its support space and remains semi-positive definite

Study Design and Evaluation


Rolling-window forecasting study

  • 3257 observations total
  • Window size: 1000 days (~ four years)
  • We sample 250 of 2227 starting points
  • We draw \(2^{12}= 2048\) trajectories 30 steps ahead

Estimation

Joint (quasi) maximum likelihood estimation:

\[\begin{align*} f_{\mathbf{X}_t}(\mathbf{x}_t | \mathcal{F}_{t-1}) = c\left[\mathbf{F}(\mathbf{x}_t;\boldsymbol{\mu}_t, \boldsymbol{\sigma}_{t}^2, \boldsymbol{\nu}, \boldsymbol{\lambda});\Xi_t, \Theta\right] \cdot \\ \prod_{i=1}^K f_{X_{i,t}}(\mathbf{x}_t;\boldsymbol{\mu}_t, \boldsymbol{\sigma}_{t}^2, \boldsymbol{\nu}, \boldsymbol{\lambda}) \end{align*}\]

The copula density \(c\) can be derived analytically.

Evaluation

Our main objective is the Energy Score (ES)

\[\begin{align*} \text{ES}_t(F, \mathbf{x}_t) = \mathbb{E}_{F} \left(||\tilde{\mathbf{X}}_t - \mathbf{x}_t||_2\right) - \\ \frac{1}{2} \mathbb{E}_F \left(||\tilde{\mathbf{X}}_t - \tilde{\mathbf{X}}_t'||_2 \right) \end{align*}\]

where \(\mathbf{x}_t\) is the observed \(K\)-dimensional realization and \(\tilde{\mathbf{X}}_t\), respectively \(\tilde{\mathbf{X}}_t'\) are independent random vectors distributed according to \(F\)

For univariate cases the Energy Score becomes the Continuous Ranked Probability Score (CRPS)

Results

Relative improvement in ES compared to \(\text{RW}^{\sigma, \rho}\)

Cellcolor: w.r.t. test statistic of Diebold-Mariano test (wether the model outperformes the benchmark, greener = better).

Model \(\text{ES}^{\text{All}}_{1-30}\) \(\text{ES}^{\text{EUA}}_{1-30}\) \(\text{ES}^{\text{Oil}}_{1-30}\) \(\text{ES}^{\text{NGas}}_{1-30}\) \(\text{ES}^{\text{Coal}}_{1-30}\) \(\text{ES}^{\text{All}}_{1}\) \(\text{ES}^{\text{All}}_{5}\) \(\text{ES}^{\text{All}}_{30}\)
\(\textrm{RW}^{\sigma, \rho}_{}\) 161.96 10.06 37.94 146.73 13.22 5.56 13.28 34.29
\(\textrm{RW}^{\sigma_t, \rho_t}_{}\) 9.40 3.75 -0.41 11.39 4.13 10.34 9.10 7.59
\(\textrm{RW}^{\sigma, \rho_t}_{\textrm{ncp}, \textrm{log}}\) 12.04 6.16 -0.56 14.33 7.35 9.22 9.82 10.02
\(\textrm{RW}^{\sigma, \rho}_{\textrm{log}}\) 12.10 6.25 -0.59 14.44 7.31 9.04 9.66 9.91
\(\textrm{VECM}^{\textrm{r0}, \sigma_t, \rho_t}_{\textrm{lev}, \textrm{ncp}}\) 9.68 -0.72 0.32 11.74 3.70 10.82 10.50 8.21
\(\textrm{VECM}^{\textrm{r0}, \sigma, \rho_t}_{\textrm{log}}\) 12.15 6.10 -0.70 14.57 7.80 8.05 9.99 10.04
\(\textrm{ETS}^{\sigma}\) 9.94 5.75 0.08 13.05 7.83 6.96 7.74 6.21
\(\textrm{ETS}^{\sigma}_{\textrm{log}}\) 8.12 7.80 -0.51 11.17 8.54 5.05 6.14 2.66
\(\textrm{VES}^{\sigma}\) 5.50 -4.43 -3.22 6.29 4.68 -25.99 -2.42 3.07
\(\textrm{VES}^{\sigma}_{\textrm{log}}\) 7.68 3.31 -4.34 9.07 8.30 -22.11 1.07 4.32
Coloring w.r.t. test statistic: <-5 -4 -3 -2 -1 0 1 2 3 4 >5
  • Benchmarks:
    • \(\text{RW}^{\sigma, \rho}\): Random walk with constant volatility and correlation
    • Univariate \(\text{ETS}^{\sigma}\) with constant volatility
    • Vector ETS \(VES^{\sigma}\) with constant volatility
  • Heteroscedasticity is a main driver of ES
  • The VECM model without cointegration (VAR) is the best performing model in terms of ES overall
  • For EUA, the ETS Benchmark is the best performing model in terms of ES
  • CRPS solely evaluates the marginal distributions
    • The cross-sectional dependence is ignored
  • VES models deliver poor performance in short horizons
  • For Oil prices the RW Benchmark can’t be outperformed 30 steps ahead
  • Both VECM models generally deliver good performance

Relative improvement in CRPS compared to \(\text{RW}^{\sigma, \rho}\)

EUA
Oil
NGas
Coal
Model H1 H5 H30 H1 H5 H30 H1 H5 H30 H1 H5 H30
\(\textrm{RW}^{\sigma, \rho}_{}\) 0.4 0.9 2.1 1.5 3.4 9.1 4.7 11.6 29.8 0.3 0.9 2.8
\(\textrm{RW}^{\sigma_t, \rho_t}_{}\) 5.6 6.0 2.8 2.1 2.7 -0.8 12.6 10.5 9.6 10.7 6.5 2.1
\(\textrm{RW}^{\sigma, \rho_t}_{\textrm{ncp}, \textrm{log}}\) 5.1 8.7 5.0 0.7 0.8 -0.4 11.4 11.5 12.4 8.0 7.3 6.7
\(\textrm{RW}^{\sigma, \rho}_{\textrm{log}}\) 4.7 8.9 5.2 0.0 0.3 -0.6 11.2 11.4 12.4 7.7 7.5 6.6
\(\textrm{VECM}^{\textrm{r0}, \sigma_t, \rho_t}_{\textrm{lev}, \textrm{ncp}}\) 3.6 0.6 -1.6 2.7 3.0 0.0 13.1 12.2 10.4 11.8 7.2 1.5
\(\textrm{VECM}^{\textrm{r0}, \sigma, \rho_t}_{\textrm{log}}\) 4.2 8.9 5.1 0.2 0.4 -0.8 9.9 11.8 12.7 7.8 7.9 7.3
\(\textrm{ETS}^{\sigma}\) 0.2 6.8 5.7 1.1 0.9 -0.2 10.9 11.3 10.9 7.5 6.7 5.6
\(\textrm{ETS}^{\sigma}_{\textrm{log}}\) 1.0 8.6 8.0 0.1 0.7 -0.6 8.9 9.4 7.1 7.3 7.8 6.7
\(\textrm{VES}^{\sigma}\) -38.5 -6.4 -5.4 -33.3 -6.1 -2.4 -26.6 -2.6 3.6 -37.5 -5.5 4.7
\(\textrm{VES}^{\sigma}_{\textrm{log}}\) -32.4 2.8 1.8 -30.4 -6.2 -3.2 -22.0 1.8 5.4 -27.0 2.3 6.4
Coloring w.r.t. test statistic: <-5 -4 -3 -2 -1 0 1 2 3 4 >5

RMSE measures the performance of the forecasts at their mean

Some models beat the benchmarks at short horizons

Conclusion: the Improvements seen before must be attributed to other parts of the multivariate predictive distribution

Relative improvement in RMSE compared to \(\text{RW}^{\sigma, \rho}\)

EUA
Oil
NGas
Coal
Model H1 H5 H30 H1 H5 H30 H1 H5 H30 H1 H5 H30
\(\textrm{RW}^{\sigma, \rho}_{}\) 0.9 2.0 5.0 2.9 6.4 16.7 17.8 42.8 85.4 0.9 2.9 7.0
\(\textrm{RW}^{\sigma_t, \rho_t}_{}\) -0.1 -0.1 0.7 0.0 -0.3 -0.1 -0.2 0.3 1.3 -0.2 0.0 -1.8
\(\textrm{RW}^{\sigma, \rho_t}_{\textrm{ncp}, \textrm{log}}\) -270.5 -154.1 -139.9 0.5 -0.5 -2.9 -0.8 0.7 -1.6 0.3 -31.2 -24.5
\(\textrm{RW}^{\sigma, \rho}_{\textrm{log}}\) -705.0 -265.4 -125.2 0.6 0.2 -0.2 -0.4 0.1 -1.6 -0.9 -0.3 -8.3
\(\textrm{VECM}^{\textrm{r0}, \sigma_t, \rho_t}_{\textrm{lev}, \textrm{ncp}}\) -0.9 0.2 0.5 0.5 0.2 0.0 -0.4 0.7 0.2 1.4 0.1 0.2
\(\textrm{VECM}^{\textrm{r0}, \sigma, \rho_t}_{\textrm{log}}\) -271.5 -191.3 -114.3 1.7 -12.3 -3.6 -0.6 1.6 -4.1 0.0 -0.8 -6.7
\(\textrm{ETS}^{\sigma}\) -0.3 0.3 1.6 0.7 0.1 -0.1 0.1 -0.1 0.2 -2.4 -3.9 2.5
\(\textrm{ETS}^{\sigma}_{\textrm{log}}\) -1.0 0.4 1.6 0.9 0.0 -0.1 -1.9 -1.9 -13.9 -0.3 -3.6 -1.8
\(\textrm{VES}^{\sigma}\) -37.4 -8.9 -6.0 -27.9 -7.4 -2.8 -27.2 -9.5 -2.4 -41.7 -1.2 1.6
\(\textrm{VES}^{\sigma}_{\textrm{log}}\) -37.6 -9.2 -7.8 -26.8 -7.3 -3.0 -27.0 -6.8 -3.5 -41.2 -2.2 -0.3
Coloring w.r.t. test statistic: <-5 -4 -3 -2 -1 0 1 2 3 4 >5

Final Remarks

Contributions

Theoretical

Probabilistic Online Learning:

Aggregation
Regression

Practical

Applications

Energy Commodities
Electricity Prices
Electricity Load

Well received by the academic community:

of papers already published

108 citations since 2020 (Google Scholar)

Software

R Packages:

profoc, rcpptimer, dccpp

Python Packages:

ondil, sstudentt

Contributions to other projects:

RcppArmadillo
gamlss
NixOS/nixpkgs
OpenPrinting/foomatic-db

Awards:

Berrisch, J., Narajewski, M., & Ziel, F. (2023):

Won Western Power Distribution Competition
Won Best-Student-Presentation Award

Questions!

Artwork by @allison_horst

References

Berrisch, J. (2025). Rcpptimer: Rcpp tic-toc timer with OpenMP support. arXiv Preprint arXiv:2501.15856, abs/2501.15856. DOI: 10.48550/arXiv.2501.15856
Berrisch, J., Narajewski, M., & Ziel, F. (2023). High-resolution peak demand estimation using generalized additive models and deep neural networks. Energy and AI, 13, 100236. DOI: 10.1016/j.egyai.2023.100236
Berrisch, J., Pappert, S., Ziel, F., & Arsova, A. (2023). Modeling volatility and dependence of european carbon and energy prices. Finance Research Letters, 52, 103503. DOI: 10.1016/j.frl.2022.103503
Berrisch, J., & Ziel, F. (2022). Distributional modeling and forecasting of natural gas prices. Journal of Forecasting, 41(6), 1065–1086. DOI: 10.1002/for.2853
Berrisch, J., & Ziel, F. (2023). CRPS learning. Journal of Econometrics, 237(2, Part C), 105221. DOI: 10.1016/j.jeconom.2021.11.008
Berrisch, J., & Ziel, F. (2024). Multivariate probabilistic CRPS learning with an application to day-ahead electricity prices. International Journal of Forecasting, 40(4), 1568–1586. DOI: 10.1016/j.ijforecast.2024.01.005
Cesa-Bianchi, N., & Lugosi, G. (2006). Prediction, learning, and games (pp. I–XII, 1–394). Cambridge university press. DOI: 10.1017/CBO9780511546921
Gaillard, P., & Wintenberger, O. (2018). Efficient online algorithms for fast-rate regret bounds under sparsity. In S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, & R. Garnett (Eds.), Proceedings of the 32nd international conference on neural information processing systems (pp. 7026–7036). Cornell University. DOI: 10.48550/arxiv.1805.09174
Gneiting, T. (2011). Making and evaluating point forecasts. Journal of the American Statistical Association, 106(494), 746–762. DOI: 10.1198/jasa.2011.r10138
Gneiting, T., & Raftery, A.E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477), 359–378. DOI: 10.1198/016214506000001437
Hirsch, S., Berrisch, J., & Ziel, F. (2024). Online distributional regression. arXiv Preprint arXiv:2407.08750, abs/2407.08750. DOI: 10.48550/arXiv.2407.08750
Johnson, N.L., Kotz, S., & Balakrishnan, N. (1995). Continuous univariate distributions, volume 2 (Vol. 289). John wiley & sons.
Li, Z., & Cao, J. (2022). General p-splines for non-uniform b-splines. arXiv Preprint. DOI: 10.48550/arXiv.2201.06808
Marcjasz, G., Narajewski, M., Weron, R., & Ziel, F. (2023). Distributional neural networks for electricity price forecasting. Energy Economics, 125, 106843. DOI: 10.1016/j.eneco.2023.106843
Taylor, J.W., & Meng, X. (2023). Angular combining of forecasts of probability distributions. arXiv Preprint arXiv:2305.16735. DOI: 10.48550/arXiv.2305.16735
Wintenberger, O. (2017). Optimal learning with bernstein online aggregation. Machine Learning, 106(1), 119–141. DOI: 10.1007/s10994-016-5592-6