Data Science Methods for Forecasting in Energy and Economics

Jonathan Berrisch

University of Duisburg-Essen, House of Energy, Climate, and Finance

2025-06-30

The beginning: June 2020

High-Level View

Motivation

	Energy market liberalization created complex, interconnected trading systems
	Renewable energy transition introduces uncertainty and volatility from weather-dependent generation
	Traditional point forecasts are inadequate for modern energy markets with increasing uncertainty
	Risk inherently is a probabilistic concept
	Probabilistic forecasting essential for risk management, planning and decision making in volatile energy environments
	Online learning methods needed for fast-updating models with streaming energy data

Overview of the Thesis

	Berrisch, J., & Ziel, F. (2023). CRPS learning. Journal of Econometrics, 237(2), 105221.
	Berrisch, J., & Ziel, F. (2024). Multivariate probabilistic CRPS learning with an application to day-ahead electricity prices. International Journal of Forecasting, 40(4), 1568–1586.
	Hirsch, S., Berrisch, J., & Ziel, F. (2024). Online Distributional Regression. arXiv preprint arXiv:2407.08750.
	Berrisch, J., & Ziel, F. (2022). Distributional modeling and forecasting of natural gas prices. Journal of Forecasting, 41(6), 1065–1086.
	Berrisch, J., Pappert, S., Ziel, F., & Arsova, A. (2023). Modeling volatility and dependence of European carbon and energy prices. Finance Research Letters, 52, 103503.
	Berrisch, J., Narajewski, M., & Ziel, F. (2023). High-resolution peak demand estimation using generalized additive models and deep neural networks. Energy and AI, 13, 100236.
	Berrisch, J. (2025). rcpptimer: Rcpp Tic-Toc Timer with OpenMP Support. arXiv preprint arXiv:2501.15856.

Overview of the Thesis

	Berrisch, J., & Ziel, F. (2023). CRPS learning. Journal of Econometrics, 237(2), 105221.
	Berrisch, J., & Ziel, F. (2024). Multivariate probabilistic CRPS learning with an application to day-ahead electricity prices. International Journal of Forecasting, 40(4), 1568–1586.
	Hirsch, S., Berrisch, J., & Ziel, F. (2024). Online Distributional Regression. arXiv preprint arXiv:2407.08750.
	Berrisch, J., & Ziel, F. (2022). Distributional modeling and forecasting of natural gas prices. Journal of Forecasting, 41(6), 1065–1086.
	Berrisch, J., Pappert, S., Ziel, F., & Arsova, A. (2023). Modeling volatility and dependence of European carbon and energy prices. Finance Research Letters, 52, 103503.
	Berrisch, J., Narajewski, M., & Ziel, F. (2023). High-resolution peak demand estimation using generalized additive models and deep neural networks. Energy and AI, 13, 100236.
	Berrisch, J. (2025). rcpptimer: Rcpp Tic-Toc Timer with OpenMP Support. arXiv preprint arXiv:2501.15856.

Overview of the Thesis

	Berrisch, J., & Ziel, F. (2023). CRPS learning. Journal of Econometrics, 237(2), 105221.
	Berrisch, J., & Ziel, F. (2024). Multivariate probabilistic CRPS learning with an application to day-ahead electricity prices. International Journal of Forecasting, 40(4), 1568–1586.
	Hirsch, S., Berrisch, J., & Ziel, F. (2024). Online Distributional Regression. arXiv preprint arXiv:2407.08750.
	Berrisch, J., & Ziel, F. (2022). Distributional modeling and forecasting of natural gas prices. Journal of Forecasting, 41(6), 1065–1086.
	Berrisch, J., Pappert, S., Ziel, F., & Arsova, A. (2023). Modeling volatility and dependence of European carbon and energy prices. Finance Research Letters, 52, 103503.
	Berrisch, J., Narajewski, M., & Ziel, F. (2023). High-resolution peak demand estimation using generalized additive models and deep neural networks. Energy and AI, 13, 100236.
	Berrisch, J. (2025). rcpptimer: Rcpp Tic-Toc Timer with OpenMP Support. arXiv preprint arXiv:2501.15856.

Overview

Online Distributional Regression

Distributional Modeling and Forecasting of Natural Gas Prices

Reduces estimation time by 2-3 orders of magnitude

Maintains competitive forecasting accuracy

Real-World Validation in Energy Markets

Python Package ondil on Github and PyPi

Forecasting of Day-Ahead and Month-Ahead prices

To capture the full distribution of future prices

Extensive data analysis from 2011-2020

State-Space models, skewed Student’s t distribution

Python Package sstudentt on Github and PyPi

Overview

High-Resolution Peak Demand Estimation Using Generalized Additive Models and Deep Neural Networks

rcpptimer: Rcpp Tic-Toc Timer with OpenMP Support

Predict high-resolution electricity peaks using only low-resolution data

Combine GAMs and DNN’s for superior accuracy

Won Western Power Distribution Competition

Won Best-Student-Presentation Award

R Package rcpptimer on Github and CRAN

Provides Rcpp bindings for cpptimer

tic-toc class for timing C++ code

Supports nested, overlapping and scoped timers

Supports OpenMP parallelism

//[[Rcpp::depends(rcpptimer)]]
#include <rcpptimer.h>

//[[Rcpp::export]]
void ex()
{
  Rcpp::Timer timer;
  timer.tic();
  // Code to be timed
  timer.toc();
}

CRPS Learning

Berrisch, J., & Ziel, F. (2023). Journal of Econometrics, 237(2), 105221.

Introduction

The Idea:

Combine multiple forecasts instead of choosing one
Combination weights may vary over time, over the distribution or both

2 Popular options for combining distributions:

Combining across quantiles (this paper)
- Horizontal aggregation, vincentization
Combining across probabilities
- Vertical aggregation
Combining at an angle
- Taylor & Meng (2023)

Time
Distribution

d3 = require("d3@7")

cdf_data = FileAttachment("assets/crps_learning/weights_plot/cdf_data.csv").csv({ typed: true })

function updateChartInner(g, x, y, linesGroup, color, line, data) {
  // Update axes with transitions
  x.domain(d3.extent(data, d => d.x));
  g.select(".x-axis").transition().duration(1500).call(d3.axisBottom(x).ticks(10));
  y.domain([0, d3.max(data, d => d.y)]);
  g.select(".y-axis").transition().duration(1500).call(d3.axisLeft(y).ticks(5));

  // Group data by basis function
  const dataByFunction = Array.from(d3.group(data, d => d.b));
  const keyFn = d => d[0];

  // Update basis function lines
  const u = linesGroup.selectAll("path").data(dataByFunction, keyFn);
  u.join(
    enter => enter.append("path").attr("fill","none").attr("stroke-width",3)
      .attr("stroke", (_, i) => color(i)).attr("d", d => line(d[1].map(pt => ({x: pt.x, y: 0}))))
      .style("opacity",0),
    update => update,
    exit => exit.transition().duration(1000).style("opacity",0).remove()
  )
  .transition().duration(1000)
    .attr("d", d => line(d[1]))
    .attr("stroke", (_, i) => color(i))
    .style("opacity",1);
}

chart = {
  // State variable for selected mu parameter
  let selectedMu = 1;
  
  const filteredData = () => cdf_data.filter(d =>
    Math.abs(selectedMu - d.mu) < 0.001
  );
  
  const container = d3.create("div")
    .style("max-width", "none")
    .style("width", "100%");
    
  const controlsContainer = container.append("div")
    .style("display", "flex")
    .style("gap", "20px")
    .style("align-items", "center");
  
  // Single slider control for mu
  const sliderContainer = controlsContainer.append('div')
    .style('display','flex')
    .style('align-items','center')
    .style('gap','10px')
    .style('flex','1');
    
  sliderContainer.append('label')
    .text('Naive:')
    .style('font-size','20px');
    
  const muSlider = sliderContainer.append('input')
    .attr('type','range')
    .attr('min', 0)
    .attr('max', 1)
    .attr('step', 0.1)
    .property('value', selectedMu)
    .on('input', function(event) {
      selectedMu = +this.value;
      muDisplay.text(selectedMu.toFixed(1));
      updateChart(filteredData());
    })
    .style('width', '100%')
    //.style('-webkit-appearance', 'none')
    .style('appearance', 'none')
    .style('height', '8px')
    .style('background', '#BDBDBDFF');
    
  const muDisplay = sliderContainer.append('span')
    .text(selectedMu.toFixed(1))
    .style('font-size','20px');
  
  // Add Reset button
  controlsContainer.append('button')
    .text('Reset')
    .style('font-size', '20px')
    .style('align-self', 'center')
    .style('margin-left', 'auto')
    .on('click', () => {
      selectedMu = 1;
      muSlider.property('value', selectedMu);
      muDisplay.text(selectedMu.toFixed(1));
      updateChart(filteredData());
    });

   // Build SVG
   const width = 600;
   const height = 450;
   const margin = {top: 40, right: 20, bottom: 40, left: 40};
   const innerWidth = width - margin.left - margin.right;
   const innerHeight = height - margin.top - margin.bottom;

   // Set controls container width to match SVG plot width
   controlsContainer.style("max-width", "none").style("width", "100%");
   // Distribute each control evenly and make sliders full-width
   controlsContainer.selectAll("div").style("flex", "1").style("min-width", "0px");
   controlsContainer.selectAll("input").style("width", "100%").style("box-sizing", "border-box");
   
   // Create scales
   const x = d3.scaleLinear()
     .range([0, innerWidth]);
     
   const y = d3.scaleLinear()
     .range([innerHeight, 0]);
   
   // Create a color scale for the basis functions
   const color = d3.scaleOrdinal(["#80C684FF", "#FFD44EFF", "#D81A5FFF"]);
   
   // Create SVG
   const svg = d3.create("svg")
     .attr("width", "100%")
     .attr("height", "auto")
     .attr("viewBox", [0, 0, width, height])
     .attr("preserveAspectRatio", "xMidYMid meet")
     .attr("style", "max-width: 100%; height: auto;");
   
   // Create the chart group
   const g = svg.append("g")
     .attr("transform", `translate(${margin.left},${margin.top})`);
   
   // Add axes
   const xAxis = g.append("g")
     .attr("transform", `translate(0,${innerHeight})`)
     .attr("class", "x-axis")
     .call(d3.axisBottom(x).ticks(10))
     .style("font-size", "20px");
   
   const yAxis = g.append("g")
     .attr("class", "y-axis")
     .call(d3.axisLeft(y).ticks(5))
     .style("font-size", "20px");
   
   // Add a horizontal line at y = 0
   g.append("line")
     .attr("x1", 0)
     .attr("x2", innerWidth)
     .attr("y1", y(0))
     .attr("y2", y(0))
     .attr("stroke", "#000")
     .attr("stroke-opacity", 0.2);
   
   // Add gridlines
   g.append("g")
     .attr("class", "grid-lines")
     .selectAll("line")
     .data(y.ticks(5))
     .join("line")
     .attr("x1", 0)
     .attr("x2", innerWidth)
     .attr("y1", d => y(d))
     .attr("y2", d => y(d))
     .attr("stroke", "#ccc")
     .attr("stroke-opacity", 0.5);
   
   // Create a line generator
   const line = d3.line()
     .x(d => x(d.x))
     .y(d => y(d.y))
     .curve(d3.curveBasis);
   
   // Group to contain the basis function lines
   const linesGroup = g.append("g")
     .attr("class", "basis-functions");
   
   // Store the current basis functions for transition
   let currentBasisFunctions = new Map();
   
   // Function to update the chart with new data
   function updateChart(data) {
     updateChartInner(g, x, y, linesGroup, color, line, data);
   }
   
   // Store the update function
   svg.node().update = updateChart;
   
   // Initial render
   updateChart(filteredData());
   
   container.node().appendChild(svg.node());
   return container.node();
}

The Framework of Prediction under Expert Advice

Each day, \(t = 1, 2, ... T\)

The forecaster receives predictions \(\widehat{X}_{t,k}\) from \(K\) experts
The forecaster assigns weights \(w_{t,k}\) to each expert
The forecaster calculates the prediction:

\[\begin{equation} \widetilde{X}_{t} = \sum_{k=1}^K w_{t,k} \widehat{X}_{t,k}. \label{eq_forecast_def} \end{equation}\]

The realization for \(t\) is observed

The experts can be institutions, persons, or models

The forecasts can be point-forecasts (i.e., mean or median) or full predictive distributions

Cesa-Bianchi & Lugosi (2006)

The Regret

Weights are updated sequentially according to the past performance of the \(K\) experts.

That is, a loss function \(\ell\) is needed. This is used to compute the cumulative regret \(R_{t,k}\)

\[\begin{equation} R_{t,k} = \widetilde{L}_{t} - \widehat{L}_{t,k} = \sum_{i = 1}^t \ell(\widetilde{X}_{i},Y_i) - \ell(\widehat{X}_{i,k},Y_i)\label{eq:regret} \end{equation}\]

The cumulative regret:

Indicates the predictive accuracy of the expert \(k\) until time \(t\).
Measures how much the forecaster regrets not having followed the expert’s advice

Popular loss functions for point forecasting Gneiting (2011):

\(\ell_2\) loss:

\[\begin{equation} \ell_2(x, y) = | x -y|^2 \label{eq:elltwo} \end{equation}\]

Strictly proper for mean prediction

\(\ell_1\) loss:

\[\begin{equation} \ell_1(x, y) = | x -y| \label{eq:ellone} \end{equation}\]

Strictly proper for median predictions

Popular Aggregation Algorithms

The naive combination

\[\begin{equation} w_{t,k}^{\text{Naive}} = \frac{1}{K}\label{eq:naive_combination} \end{equation}\]

The exponentially weighted average forecaster (EWA)

\[\begin{equation} \begin{aligned} w_{t,k}^{\text{EWA}} & = \frac{e^{\eta R_{t,k}} }{\sum_{k = 1}^K e^{\eta R_{t,k}}}\\ & = \frac{e^{-\eta \ell(\widehat{X}_{t,k},Y_t)} w^{\text{EWA}}_{t-1,k} }{\sum_{k = 1}^K e^{-\eta \ell(\widehat{X}_{t,k},Y_t)} w^{\text{EWA}}_{t-1,k}} \end{aligned}\label{eq:exp_combination} \end{equation}\]

Risk

In stochastic settings, the cumulative Risk should be analyzed Wintenberger (2017):

\[\begin{align} &\underbrace{\widetilde{\mathcal{R}}_t = \sum_{i=1}^t \mathbb{E}[\ell(\widetilde{X}_{i},Y_i)|\mathcal{F}_{i-1}]}_{\text{Cumulative Risk of Forecaster}} \\ &\underbrace{\widehat{\mathcal{R}}_{t,k} = \sum_{i=1}^t \mathbb{E}[\ell(\widehat{X}_{i,k},Y_i)|\mathcal{F}_{i-1}]}_{\text{Cumulative Risk of Experts}} \label{eq_def_cumrisk} \end{align}\]

(7) expected loss of the algorithm (lower = better)

Optimal Convergence

The selection problem

\[\begin{equation} \frac{1}{t}\left(\widetilde{\mathcal{R}}_t - \widehat{\mathcal{R}}_{t,\min} \right) \stackrel{t\to \infty}{\rightarrow} a \quad \text{with} \quad a \leq 0. \label{eq_opt_select} \end{equation}\] The forecaster is asymptotically not worse than the best expert.

The convex aggregation problem

\[\begin{equation} \frac{1}{t}\left(\widetilde{\mathcal{R}}_t - \widehat{\mathcal{R}}_{t,\pi} \right) \stackrel{t\to \infty}{\rightarrow} b \quad \text{with} \quad b \leq 0 . \label{eq_opt_conv} \end{equation}\] The forecaster is asymptotically not worse than the best convex combination in hindsight (oracle).

Optimal rates with respect to selection \(\eqref{eq_opt_select}\) and convex aggregation \(\eqref{eq_opt_conv}\) Wintenberger (2017):

\[\begin{align} \frac{1}{t}\left(\widetilde{\mathcal{R}}_t - \widehat{\mathcal{R}}_{t,\min} \right) & = \mathcal{O}\left(\frac{\log(K)}{t}\right)\label{eq_optp_select} \end{align}\]

\[\begin{align} \frac{1}{t}\left(\widetilde{\mathcal{R}}_t - \widehat{\mathcal{R}}_{t,\pi} \right) & = \mathcal{O}\left(\sqrt{\frac{\log(K)}{t}}\right) \label{eq_optp_conv} \end{align}\]

Algorithms can satisfy both \(\eqref{eq_optp_select}\) and \(\eqref{eq_optp_conv}\) depending on:

The loss function
Regularity conditions on \(Y_t\) and \(\widehat{X}_{t,k}\)
The weighting scheme

Optimal Convergence

Requirements:

EWA satisfies optimal selection convergence \(\eqref{eq_optp_select}\) in a deterministic setting if:

Loss \(\ell\) is exp-concave

Learning-rate \(\eta\) is chosen correctly

Optimal convex aggregation convergence \(\eqref{eq_optp_conv}\) can be satisfied by applying the kernel-trick:

\[\begin{align} \ell^{\nabla}(x,y) = \ell'(\widetilde{X},y) x \end{align}\]

\(\ell'\) is the subgradient of \(\ell\) at forecast combination \(\widetilde{X}\)

Those results can be converted to any stochastic setting Wintenberger (2017)

Probabilistic Setting

An appropriate choice:

\[\begin{equation*} \text{CRPS}(F, y) = \int_{\mathbb{R}} {(F(x) - \mathbb{1}\{ x > y \})}^2 dx \label{eq:crps} \end{equation*}\]

It’s strictly proper (Gneiting & Raftery, 2007).

Using the CRPS, we can calculate time-adaptive weights \(w_{t,k}\). However, what if the experts’ performance varies in parts of the distribution?

Utilize this relation:

\[\begin{equation*} \text{CRPS}(F, y) = 2 \int_0^{1} \text{QL}_p(F^{-1}(p), y) dp.\label{eq_crps_qs} \end{equation*}\]

… to combine quantiles of the probabilistic forecasts individually using the quantile-loss QL.

QL is convex, but not exp-concave

Bernstein Online Aggregation (BOA) lets us weaken the exp-concavity condition. It satisfies that there exists a \(C>0\) such that for \(x>0\) it holds that

\[\begin{equation} P\left( \frac{1}{t}\left(\widetilde{\mathcal{R}}_t - \widehat{\mathcal{R}}_{t,\pi} \right) \leq C \log(\log(t)) \left(\sqrt{\frac{\log(K)}{t}} + \frac{\log(K)+x}{t}\right) \right) \geq 1-e^{-x} \label{eq_boa_opt_conv} \end{equation}\]

if the loss function is convex.

Almost optimal w.r.t. convex aggregation \(\eqref{eq_optp_conv}\) Wintenberger (2017).

The same algorithm satisfies that there exists a \(C>0\) such that for \(x>0\) it holds that \[\begin{equation} P\left( \frac{1}{t}\left(\widetilde{\mathcal{R}}_t - \widehat{\mathcal{R}}_{t,\min} \right) \leq C\left(\frac{\log(K)+\log(\log(Gt))+ x}{\alpha t}\right)^{\frac{1}{2-\beta}} \right) \geq 1-2e^{-x} \label{eq_boa_opt_select} \end{equation}\]

if the loss \(\ell\) is \(G\)-Lipschitz and weak exp-concave in its first coordinate

Almost optimal w.r.t. selection \(\eqref{eq_optp_select}\) Gaillard & Wintenberger (2018).

We show that this holds for QL under feasible conditions.

Proposition 1: The Power of Flexibility

\[\begin{align} 2\overline{\widehat{\mathcal{R}}}^{\text{QL}}_{t,\min} & \leq \widehat{\mathcal{R}}^{\text{CRPS}}_{t,\min} \label{eq_risk_ql_crps_expert} \\ 2\overline{\widehat{\mathcal{R}}}^{\text{QL}}_{t,\pi} & \leq \widehat{\mathcal{R}}^{\text{CRPS}}_{t,\pi} . \label{eq_risk_ql_crps_convex} \end{align}\]

Pointwise can outperform constant procedures

\(\text{QL}\) is convex: almost optimal convergence w.r.t. convex aggregation \(\eqref{eq_boa_opt_conv}\)

For almost optimal convergence w.r.t. selection \(\eqref{eq_boa_opt_select}\) we need (Gaillard & Wintenberger, 2018):

A1: Lipschitz Continuity

A2: Weak Exp-Concavity

QL is Lipschitz continuous with \(G=\max(p, 1-p)\):

A1 holds

A1: Lipschitz Continuity

For some \(G>0\) it holds for all \(x_1,x_2\in \mathbb{R}\) and \(t>0\) that

\[ | \ell(x_1, Y_t)-\ell(x_2, Y_t) | \leq G |x_1-x_2|\]

A2 Weak Exp-Concavity

For some \(\alpha>0\), \(\beta\in[0,1]\) it holds for all \(x_1,x_2 \in \mathbb{R}\) and \(t>0\) that

\[\begin{align*} \mathbb{E}[ & \ell(x_1, Y_t)-\ell(x_2, Y_t) | \mathcal{F}_{t-1}] \leq \\ & \mathbb{E}[ \ell'(x_1, Y_t)(x_1 - x_2) |\mathcal{F}_{t-1}] \\ & + \mathbb{E}\left[ \left. \left( \alpha(\ell'(x_1, Y_t)(x_1 - x_2))^{2}\right)^{1/\beta} \right|\mathcal{F}_{t-1}\right] \end{align*}\]

The strongest case is \(\beta=1\) (Strong Convexity)

Conditional quantile risk: \(\mathcal{Q}_p(x) = \mathbb{E}[ \text{QL}_p(x, Y_t) | \mathcal{F}_{t-1}]\).

convexity properties of \(\mathcal{Q}_p\) depend on the conditional distribution \(Y_t|\mathcal{F}_{t-1}\).

Proposition 2

Let \(Y\) be a univariate random variable with (Radon-Nikodym) \(\nu\)-density \(f\), then for the second subderivative of the quantile risk \(\mathcal{Q}_p(x) = \mathbb{E}[ \text{QL}_p(x, Y) ]\) of \(Y\) it holds for all \(p\in(0,1)\) that \(\mathcal{Q}_p'' = f.\) Additionally, if \(f\) is a continuous Lebesgue-density with \(f\geq\gamma>0\) for some constant \(\gamma>0\) on its support \(\text{spt}(f)\) then \(\mathcal{Q}_p\) is \(\gamma\)-strongly convex.

This implies satisfaction of condition A2 with \(\beta=1\) and \(\alpha = \gamma / 2G^2\) (Gaillard & Wintenberger, 2018)

Theorem 1

The gradient based fully adaptive Bernstein online aggregation (BOAG) applied pointwise for all \(p\in(0,1)\) on \(\text{QL}\) satisfies \(\eqref{eq_boa_opt_conv}\) with minimal CRPS given by

\[\widehat{\mathcal{R}}_{t,\pi} = 2\overline{\widehat{\mathcal{R}}}^{\text{QL}}_{t,\pi}.\]

If \(Y_t|\mathcal{F}_{t-1}\) is bounded and has a pdf \(f_t\) satisfying \(f_t>\gamma >0\) on its support \(\text{spt}(f_t)\) then \(\eqref{eq_boa_opt_select}\) holds with \(\beta=1\) and

\[\widehat{\mathcal{R}}_{t,\min} = 2\overline{\widehat{\mathcal{R}}}^{\text{QL}}_{t,\min}\]

BOAG with \(\text{QL}\) satisfies \(\eqref{eq_boa_opt_conv}\) and \(\eqref{eq_boa_opt_select}\)

First, rewrite \(QL_p\) using the check function:

\[\begin{align} \rho_p(z) = z(\mathbb{1}(0 < z) - p) \label{eq:check} \end{align}\]

\[\begin{align} QL_p(x, y) &= (\mathbb{1}(y < x) - p)(x - y) \\ &= \rho_p(x-y) \end{align}\]

Now we can express the quantile risk as:

\[\begin{align} \mathcal{Q}_p(x) = \mathbb{E}[\rho_p(x-y)] = ∫ \rho_p(x-y)f(y)dy \end{align}\]

This integral form is where the convolution becomes apparent. A convolution of functions is defined as:

\[\begin{align} (g * h)(x) &= ∫ g(z)h(x - z)dz \\ &= ∫ g(x - z)h(z)dz \end{align}\]

They are commutative.

Using exchangability of subgradients in covolutions:

\[\begin{align} \mathcal{Q}''_p(x) = \rho_p'' * f \end{align}\]

To find \(\mathcal{Q}''_p(x)\) we rewrite \(\eqref{eq:check}\):

\[\begin{align} \rho_p(z) &= \begin{cases} z(1 - p) = z - zp, & \text{if } 0 < z \\ z(0 - p) = -zp, & \text{if } 0 \geq z \end{cases} \\ \rho_p'(z) &= \begin{cases} 1 - p, & \text{if } z > 0 \\ -p, & \text{if } z < 0 \end{cases} \end{align}\]

The function \(\rho'_p(z)\) jumps from \(-p\) to \(1-p\) at \(0\). So:

\[\begin{align} \rho''_p(x) = \delta_0(z) \quad \text{(Dirac Delta)} \end{align}\]

Now the magical part :

\[\begin{align} \mathcal{Q}''_p(x) = \delta_0 * f = f \end{align}\]

Because Dirac Delta is the identity element for convolutions.

A Probabilistic Example

Simple Example:

\[\begin{align} Y_t & \sim \mathcal{N}(0,\,1) \\ \widehat{X}_{t,1} & \sim \widehat{F}_{1} = \mathcal{N}(-1,\,1) \\ \widehat{X}_{t,2} & \sim \widehat{F}_{2} = \mathcal{N}(3,\,4) \label{eq:dgp_sim1} \end{align}\]

True weights vary over \(p\)
Figures show the ECDF and calculated weights using \(T=25\) realizations
Pointwise solution creates rough estimates
Pointwise is better than constant
Smooth solution is better than pointwise

CDFs
Weights

The Smoothing Procedures

Penalized Smoothing
Basis Smoothing

Penalized cubic B-Splines for smoothing weights:

Let \(\varphi=(\varphi_1,\ldots, \varphi_L)\) be bounded basis functions on \((0,1)\) Then we approximate \(w_{t,k}\) by

\[\begin{align} w_{t,k}^{\text{smooth}} = \sum_{l=1}^L \beta_l \varphi_l = \beta'\varphi \end{align}\]

with parameter vector \(\beta\). The latter is estimated to penalize \(L_2\)-smoothing which minimizes

\[\begin{equation} \| w_{t,k} - \beta' \varphi \|^2_2 + \lambda \| \mathcal{D}^{d} (\beta' \varphi) \|^2_2 \label{eq_function_smooth} \end{equation}\]

with differential operator \(\mathcal{D}\)

Computation is easy, since we have an analytical solution

We receive the constant solution for high values of \(\lambda\) when setting \(d=1\)

Represent weights as linear combinations of bounded basis functions:

\[\begin{equation} w_{t,k} = \sum_{l=1}^L \beta_{t,k,l} \varphi_l = \boldsymbol \beta_{t,k}' \boldsymbol \varphi \end{equation}\]

A popular choice are are B-Splines as local basis functions

\(\boldsymbol \beta_{t,k}\) is calculated using a reduced regret matrix:

\[\begin{equation} \underbrace{\boldsymbol r_{t}}_{\text{LxK}} = \frac{L}{P} \underbrace{\boldsymbol B'}_{\text{LxP}} \underbrace{\left({\boldsymbol{QL}}_{\mathcal{P}}^{\nabla}(\widetilde{\boldsymbol X}_{t},Y_t)- {\boldsymbol{QL}}_{\mathcal{P}}^{\nabla}(\widehat{\boldsymbol X}_{t},Y_t)\right)}_{\text{PxK}} \end{equation}\]

\(\boldsymbol r_{t}\) is transformed from PxK to LxK

If \(L = P\) it holds that \(\boldsymbol \varphi = \boldsymbol{I}\) For \(L = 1\) we receive constant weights

Weights converge to the constant solution if \(L\rightarrow 1\)

The Proposed CRPS-Learning Algorithm

Initialization:

Array of expert predictions: \(\widehat{X}_{t,p,k}\)

Vector of Prediction targets: \(Y_t\)

Starting Weights: \(\boldsymbol w_0=(w_{0,1},\ldots, w_{0,K})\)

Penalization parameter: \(\lambda\geq 0\)

B-spline and penalty matrices \(\boldsymbol B\) and \(\boldsymbol D\) on \(\mathcal{P}= (p_1,\ldots,p_M)\)

Hat matrix: \[\boldsymbol{\mathcal{H}} = \boldsymbol B(\boldsymbol B'\boldsymbol B+ \lambda (\alpha \boldsymbol D_1'\boldsymbol D_1 + (1-\alpha) \boldsymbol D_2'\boldsymbol D_2))^{-1} \boldsymbol B'\]

Cumulative Regret: \(R_{0,k} = 0\)

Range parameter: \(E_{0,k}=0\)

Starting pseudo-weights: \(\boldsymbol \beta_0 = \boldsymbol B^{\text{pinv}}\boldsymbol w_0(\boldsymbol{\mathcal{P}})\)

Core:

for( t in 1:T ) {

\(\widetilde{\boldsymbol X}_{t} = \text{Sort}\left( \boldsymbol w_{t-1}'(\boldsymbol P) \widehat{\boldsymbol X}_{t} \right)\) # Prediction

\(\boldsymbol r_{t} = \frac{L}{M} \boldsymbol B' \left({\boldsymbol{QL}}_{\boldsymbol{\mathcal P}}^{\nabla}(\widetilde{\boldsymbol X}_{t},Y_t)- {\boldsymbol{QL}}_{\boldsymbol{\mathcal P}}^{\nabla}(\widehat{\boldsymbol X}_{t},Y_t)\right)\)

\(\boldsymbol E_{t} = \max(\boldsymbol E_{t-1}, \boldsymbol r_{t}^+ + \boldsymbol r_{t}^-)\)

\(\boldsymbol V_{t} = \boldsymbol V_{t-1} + \boldsymbol r_{t}^{ \odot 2}\)

\(\boldsymbol \eta_{t} =\min\left( \left(-\log(\boldsymbol \beta_{0}) \odot \boldsymbol V_{t}^{\odot -1} \right)^{\odot\frac{1}{2}} , \frac{1}{2}\boldsymbol E_{t}^{\odot-1}\right)\)

\(\boldsymbol R_{t} = \boldsymbol R_{t-1}+ \boldsymbol r_{t} \odot \left( \boldsymbol 1 - \boldsymbol \eta_{t} \odot \boldsymbol r_{t} \right)/2 + \boldsymbol E_{t} \odot \mathbb{1}\{-2\boldsymbol \eta_{t}\odot \boldsymbol r_{t} > 1\}\)

\(\boldsymbol \beta_{t} = K \boldsymbol \beta_{0} \odot \boldsymbol {SoftMax}\left( - \boldsymbol \eta_{t} \odot \boldsymbol R_{t} + \log( \boldsymbol \eta_{t}) \right)\)

\(\boldsymbol w_{t}(\boldsymbol P) = \underbrace{\boldsymbol B(\boldsymbol B'\boldsymbol B+ \lambda (\alpha \boldsymbol D_1'\boldsymbol D_1 + (1-\alpha) \boldsymbol D_2'\boldsymbol D_2))^{-1} \boldsymbol B'}_{\boldsymbol{\mathcal{H}}} \boldsymbol B \boldsymbol \beta_{t}\)

}

Simulation Study

BOA
Comparison to EWA and ML-Poly
Study Forget

Data Generating Process of the simple probabilistic example:

\[\begin{align*} Y_t &\sim \mathcal{N}(0,\,1)\\ \widehat{X}_{t,1} &\sim \widehat{F}_{1}=\mathcal{N}(-1,\,1) \\ \widehat{X}_{t,2} &\sim \widehat{F}_{2}=\mathcal{N}(3,\,4) \end{align*}\]

Constant solution \(\lambda \rightarrow \infty\)
Pointwise Solution of the proposed BOAG
Smoothed Solution of the proposed BOAG
- Weights are smoothed during learning
- Smooth weights are used to calculate Regret, adjust weights, etc.

QL Deviation
CRPS vs. Lambda
Knots

New DGP:

\[\begin{align*} Y_t &\sim \mathcal{N}\left(\frac{\sin(0.005 \pi t )}{2},\,1\right) \\ \widehat{X}_{t,1} &\sim \widehat{F}_{1} = \mathcal{N}(-1,\,1) \\ \widehat{X}_{t,2} &\sim \widehat{F}_{2} = \mathcal{N}(3,\,4) \end{align*}\]

Changing optimal weights

Single run example depicted aside

No forgetting leads to long-term constant weights

Application Study

Overview
Experts
Results

Data:

Forecasting European emission allowances (EUA)
Daily month-ahead prices
Jan 13 - Dec 20 (Phase III, 2092 Obs)
Rolling Window (length 250 ~ 1 year)

Combination methods:

Online
- Naive, BOAG, EWAG, ML-PolyG, BMA
Batch
- QRlin, QRconv

Tuning parameter grids:

Smoothing Penalty: \(\Lambda= \{0\}\cup \{2^x|x\in \{-4,-3.5,\ldots,12\}\}\)
Learning Rates: \(\mathcal{E}= \{2^x|x\in \{-1,-0.5,\ldots,9\}\}\)

Simple exponential smoothing with additive errors (ETS-ANN):

\[\begin{align*} Y_{t} = l_{t-1} + \varepsilon_t \quad \text{with} \quad l_t = l_{t-1} + \alpha \varepsilon_t \quad \text{and} \quad \varepsilon_t \sim \mathcal{N}(0,\sigma^2) \end{align*}\]

Quantile regression (QuantReg): For each \(p \in \mathcal{P}\) we assume:

\[\begin{align*} F^{-1}_{Y_t}(p) = \beta_{p,0} + \beta_{p,1} Y_{t-1} + \beta_{p,2} |Y_{t-1}-Y_{t-2}| \end{align*}\]

ARIMA(1,0,1)-GARCH(1,1) with Gaussian errors (ARMA-GARCH):

\[\begin{align*} Y_{t} = \mu + \phi(Y_{t-1}-\mu) + \theta \varepsilon_{t-1} + \varepsilon_t \quad \text{with} \quad \varepsilon_t = \sigma_t Z, \quad \sigma_t^2 = \omega + \alpha \varepsilon_{t-1}^2 + \beta \sigma_{t-1}^2 \quad \text{and} \quad Z_t \sim \mathcal{N}(0,1) \end{align*}\]

ARIMA(0,1,0)-I-EGARCH(1,1) with Gaussian errors (I-EGARCH):

\[\begin{align*} Y_{t} = \mu + Y_{t-1} + \varepsilon_t \quad \text{with} \quad \varepsilon_t = \sigma_t Z, \quad \log(\sigma_t^2) = \omega + \alpha Z_{t-1}+ \gamma (|Z_{t-1}|-\mathbb{E}|Z_{t-1}|) + \beta \log(\sigma_{t-1}^2) \quad \text{and} \quad Z_t \sim \mathcal{N}(0,1) \end{align*}\]

ARIMA(0,1,0)-GARCH(1,1) with student-t errors (I-GARCHt):

\[\begin{align*} Y_{t} = \mu + Y_{t-1} + \varepsilon_t \quad \text{with} \quad \varepsilon_t = \sigma_t Z, \quad \sigma_t^2 = \omega + \alpha \varepsilon_{t-1}^2 + \beta \sigma_{t-1}^2 \quad \text{and} \quad Z_t \sim t(0,1, \nu) \end{align*}\]

Significance
QL
Cumulative Loss Difference
Weights (BOAG P-Smooth)
Weights (Last)

ETS	QuantReg	ARMA-GARCH	I-EGARCH	I-GARCHt
2.101^(>.999)	1.358^(>.999)	0.52^(0.993)	0.511^(0.999)	-0.037^(0.406)

	BOAG	EWAG	ML-PolyG	BMA	QRlin	QRconv
Pointwise	-0.170^(0.055)	-0.089^(0.175)	-0.141^(0.112)	0.032^(0.771)	3.482^(>.999)	-0.019^(0.309)
B-Constant	-0.118^(0.146)	-0.049^(0.305)	-0.090^(0.218)	0.038^(0.834)	4.002^(>.999)	0.539^(0.996)
P-Constant	-0.138^(0.020)	-0.070^(0.137)	-0.133^(0.026)	0.039^(0.851)	5.275^(>.999)	0.009^(0.683)
B-Smooth	-0.173^(0.062)	-0.065^(0.276)	-0.141^(0.118)	-0.042^(0.386)	-	-
P-Smooth	-0.182^(0.039)	-0.107^(0.121)	-0.160^(0.065)	0.040^(0.804)	3.495^(>.999)	-0.012^(0.369)

CRPS difference to Naive. Negative values correspond to better performance (the best value is bold).
Additionally, we show the p-value of the DM-test, testing against Naive. The cells are colored with respect to their values (the greener better).

Multivariate Probabilistic CRPS Learning with an Application to Day-Ahead Electricity Prices

Berrisch, J., & Ziel, F. (2024). International Journal of Forecasting, 40(4), 1568-1586.

Multivariate CRPS Learning

We extend the B-Smooth and P-Smooth procedures to the multivariate setting:

Penalized Smoothing
Basis Smoothing

Let \(\boldsymbol{\psi}^{\text{mv}}=(\psi_1,\ldots, \psi_{D})\) and \(\boldsymbol{\psi}^{\text{pr}}=(\psi_1,\ldots, \psi_{P})\) be two sets of bounded basis functions on \((0,1)\):

\[\begin{equation*} \boldsymbol w_{t,k} = \boldsymbol{\psi}^{\text{mv}} \boldsymbol{b}_{t,k} {\boldsymbol{\psi}^{pr}}' \end{equation*}\]

with parameter matrix \(\boldsymbol b_{t,k}\). The latter is estimated to penalize \(L_2\)-smoothing which minimizes

\[\begin{align} & \| \boldsymbol{\beta}_{t,d, k}' \boldsymbol{\varphi}^{\text{pr}} - \boldsymbol b_{t, d, k}' \boldsymbol{\psi}^{\text{pr}} \|^2_2 + \lambda^{\text{pr}} \| \mathcal{D}_{q} (\boldsymbol b_{t, d, k}' \boldsymbol{\psi}^{\text{pr}}) \|^2_2 + \nonumber \\ & \| \boldsymbol{\beta}_{t, p, k}' \boldsymbol{\varphi}^{\text{mv}} - \boldsymbol b_{t, p, k}' \boldsymbol{\psi}^{\text{mv}} \|^2_2 + \lambda^{\text{mv}} \| \mathcal{D}_{q} (\boldsymbol b_{t, p, k}' \boldsymbol{\psi}^{\text{mv}}) \|^2_2 \nonumber \end{align}\]

with differential operator \(\mathcal{D}_q\) of order \(q\)

We have an analytical solution.

Linear combinations of bounded basis functions:

\[\begin{equation} \underbrace{\boldsymbol w_{t,k}}_{D \text{ x } P} = \sum_{j=1}^{\widetilde D} \sum_{l=1}^{\widetilde P} \beta_{t,j,l,k} \varphi^{\text{mv}}_{j} \varphi^{\text{pr}}_{l} = \underbrace{\boldsymbol \varphi^{\text{mv}}}_{D\text{ x }\widetilde D} \boldsymbol \beta_{t,k} \underbrace{{\boldsymbol\varphi^{\text{pr}}}'}_{\widetilde P \text{ x }P} \nonumber \end{equation}\]

A popular choice: B-Splines

\(\boldsymbol \beta_{t,k}\) is calculated using a reduced regret matrix:

\(\underbrace{\boldsymbol r_{t,k}}_{\widetilde P \times \widetilde D} = \boldsymbol \varphi^{\text{pr}} \underbrace{\left({\boldsymbol{QL}}_{\mathcal{P}}^{\nabla}(\widetilde{\boldsymbol X}_{t},Y_t)- {\boldsymbol{QL}}_{\mathcal{P}}^{\nabla}(\widehat{\boldsymbol X}_{t},Y_t)\right)}_{\text{PxD}}\boldsymbol \varphi^{\text{mv}}\)

If \(\widetilde P = P\) it holds that \(\boldsymbol \varphi^{pr} = \boldsymbol{I}\) (pointwise)

For \(\widetilde P = 1\) we receive constant weights

Application

Data

Day-Ahead electricity price forecasts from Marcjasz et al. (2023)
Produced using probabilistic neural networks
24-dimensional distributional forecasts
Distribution assumptions: JSU and Normal
8 experts (4 JSU, 4 Normal)
27th Dec. 2018 to 31st Dec. 2020 (736 days)
We extract 99 quantiles (percentiles)

Setup

Evaluation: Exclude first 182 observations

Extensions: Penalized smoothing | Forgetting

Tuning strategies:

Bayesian Fix
- Sophisticated Baesian Search algorithm
Online
- Dynamic based on past performance
Bayesian Online
- First Bayesian Fix then Online

Computation Time: ~30 Minutes

Results

JSU1	JSU2	JSU3	JSU4	Norm1	Norm2	Norm3	Norm4	Naive
1.487	1.444	1.499	1.374	1.414	1.535	1.420	1.422	1.295

Description	Parameter Tuning	BOA	ML-Poly	EWA
Constant		1.2933	1.2966	1.3188
Pointwise		1.2936	1.3010	1.3101
FTL		1.3752	1.3692	1.3863
B Constant Pr		1.2936	1.3000	1.3432
B Constant Mv		1.2918	1.2945	1.3076
Forget	Bayesian Fix	1.2930	1.2956	1.3096
Full	Bayesian Fix	1.2905	1.2902	1.2870.
Smooth.forget	Bayesian Fix	1.2911	1.2912	1.2869.
Smooth	Bayesian Fix	1.2918	1.2917	1.2873.
Forget	Bayesian Online	1.2855**	1.2961	1.3098
Full	Bayesian Online	1.2919	1.2873.	1.2873.
Smooth.forget	Bayesian Online	1.2845**	1.2862*	1.2864.
Smooth	Bayesian Online	1.2918	1.2918	1.2874.
Forget	Sampling Online	1.2855**	1.2961	1.3114
Full	Sampling Online	1.2886	1.2861*	1.2873.
Smooth.forget	Sampling Online	1.2845***	1.2867*	1.2866.
Smooth	Sampling Online	1.2918	1.2917	1.2877.

Coloring w.r.t. test statistic: <-5 -4 -3 -2 -1 0 1 2 3 4 >5

Significance denoted by: . p < 0.1; * p < 0.05; ** p < 0.01; *** p < 0.001;

Constant
Pointwise
B Constant PR
B Constant MV
Smooth.Forget

Results

Chosen Parameters
Weights: Hour 16:00-17:00
Weights: Median

Non-Equidistant Knots

Knot Placement Illustration
Knot Placement Details

bsplineData = FileAttachment("assets/mcrps_learning/basis_functions.csv").csv({ typed: true })

// Defined above
// function updateChartInner(g, x, y, linesGroup, color, line, data) {
//   // Update axes with transitions
//   x.domain([0, d3.max(data, d => d.x)]);
//   g.select(".x-axis").transition().duration(1500).call(d3.axisBottom(x).ticks(10));
//   y.domain([0, d3.max(data, d => d.y)]);
//   g.select(".y-axis").transition().duration(1500).call(d3.axisLeft(y).ticks(5));

//   // Group data by basis function
//   const dataByFunction = Array.from(d3.group(data, d => d.b));
//   const keyFn = d => d[0];

//   // Update basis function lines
//   const u = linesGroup.selectAll("path").data(dataByFunction, keyFn);
//   u.join(
//     enter => enter.append("path").attr("fill","none").attr("stroke-width",3)
//       .attr("stroke", (_, i) => color(i)).attr("d", d => line(d[1].map(pt => ({x: pt.x, y: 0}))))
//       .style("opacity",0),
//     update => update,
//     exit => exit.transition().duration(1000).style("opacity",0).remove()
//   )
//   .transition().duration(1000)
//     .attr("d", d => line(d[1]))
//     .attr("stroke", (_, i) => color(i))
//     .style("opacity",1);
// }

chart0 = {
  // State variables for selected parameters
  let selectedMu = 0.5;
  let selectedSig = 1;
  let selectedNonc = 0;
  let selectedTailw = 1;
  const filteredData = () => bsplineData.filter(d =>
    Math.abs(selectedMu - d.mu) < 0.001 &&
    d.sig === selectedSig &&
    d.nonc === selectedNonc &&
    d.tailw === selectedTailw
  );
  const container = d3.create("div")
    .style("max-width", "none")
    .style("width", "100%");;
  const controlsContainer = container.append("div")
    .style("display", "flex")
    .style("gap", "20px");
  // slider controls
  const sliders = [
    { label: 'Mu', get: () => selectedMu, set: v => selectedMu = v, min: 0.1, max: 0.9, step: 0.2 },
    { label: 'Sigma', get: () => Math.log2(selectedSig), set: v => selectedSig = 2 ** v, min: -2, max: 2, step: 1 },
    { label: 'Noncentrality', get: () => selectedNonc, set: v => selectedNonc = v, min: -4, max: 4, step: 2 },
    { label: 'Tailweight', get: () => Math.log2(selectedTailw), set: v => selectedTailw = 2 ** v, min: -2, max: 2, step: 1 }
  ];
  // Build slider controls with D3 data join
  const sliderCont = controlsContainer.selectAll('div').data(sliders).join('div')
    .style('display','flex').style('align-items','center').style('gap','10px')
    .style('flex','1').style('min-width','0px');
  sliderCont.append('label').text(d => d.label + ':').style('font-size','20px');
  sliderCont.append('input')
    .attr('type','range').attr('min', d => d.min).attr('max', d => d.max).attr('step', d => d.step)
    .property('value', d => d.get())
    .on('input', function(event, d) {
      const val = +this.value; d.set(val);
      d3.select(this.parentNode).select('span').text(d.label.match(/Sigma|Tailweight/) ? 2**val : val);
      updateChart(filteredData());
    })
    .style('width', '100%');
  sliderCont.append('span').text(d => (d.label.match(/Sigma|Tailweight/) ? d.get() : d.get()))
    .style('font-size','20px');
  
  // Add Reset button to clear all sliders to their defaults
  controlsContainer.append('button')
    .text('Reset')
    .style('font-size', '20px')
    .style('align-self', 'center')
    .style('margin-left', 'auto')
    .on('click', () => {
      // reset state vars
      selectedMu = 0.5;
      selectedSig = 1;
      selectedNonc = 0;
      selectedTailw = 1;
      // update input positions
      sliderCont.selectAll('input').property('value', d => d.get());
      // update displayed labels
      sliderCont.selectAll('span')
        .text(d => d.label.match(/Sigma|Tailweight/) ? (2**d.get()) : d.get());
      // redraw chart
      updateChart(filteredData());
    });

   // Build SVG
   const width = 1200;
   const height = 450;
   const margin = {top: 40, right: 20, bottom: 40, left: 40};
   const innerWidth = width - margin.left - margin.right;
   const innerHeight = height - margin.top - margin.bottom;

   // Set controls container width to match SVG plot width
   controlsContainer.style("max-width", "none").style("width", "100%");
   // Distribute each control evenly and make sliders full-width
   controlsContainer.selectAll("div").style("flex", "1").style("min-width", "0px");
   controlsContainer.selectAll("input").style("width", "100%").style("box-sizing", "border-box");
   
   // Create scales
   const x = d3.scaleLinear()
     .domain([0, 1])
     .range([0, innerWidth]);
     
   const y = d3.scaleLinear()
     .domain([0, 1])
     .range([innerHeight, 0]);
   
   // Create a color scale for the basis functions
   const color = d3.scaleOrdinal(["#9B26B0FF", "#3F51B4FF", "#02A9F3FF", "#009687FF", "#8BC34AFF", "#FFEB3AFF", "#FF9800FF", "#795447FF"]);
   
   // Create SVG
   const svg = d3.create("svg")
     .attr("width", "100%")
     .attr("height", "auto")
     .attr("viewBox", [0, 0, width, height])
     .attr("preserveAspectRatio", "xMidYMid meet")
     .attr("style", "max-width: 100%; height: auto;");
   
   // Create the chart group
   const g = svg.append("g")
     .attr("transform", `translate(${margin.left},${margin.top})`);
   
   // Add axes
   const xAxis = g.append("g")
     .attr("transform", `translate(0,${innerHeight})`)
     .attr("class", "x-axis")
     .call(d3.axisBottom(x).ticks(10))
     .style("font-size", "20px");
   
   const yAxis = g.append("g")
     .attr("class", "y-axis")
     .call(d3.axisLeft(y).ticks(5))
     .style("font-size", "20px");
   
   // Add a horizontal line at y = 0
   g.append("line")
     .attr("x1", 0)
     .attr("x2", innerWidth)
     .attr("y1", y(0))
     .attr("y2", y(0))
     .attr("stroke", "#000")
     .attr("stroke-opacity", 0.2);
   
   // Add gridlines
   g.append("g")
     .attr("class", "grid-lines")
     .selectAll("line")
     .data(y.ticks(5))
     .join("line")
     .attr("x1", 0)
     .attr("x2", innerWidth)
     .attr("y1", d => y(d))
     .attr("y2", d => y(d))
     .attr("stroke", "#ccc")
     .attr("stroke-opacity", 0.5);
   
   // Create a line generator
   const line = d3.line()
     .x(d => x(d.x))
     .y(d => y(d.y))
     .curve(d3.curveBasis);
   
   // Group to contain the basis function lines
   const linesGroup = g.append("g")
     .attr("class", "basis-functions");
   
   // Store the current basis functions for transition
   let currentBasisFunctions = new Map();
   
   // Function to update the chart with new data
   function updateChart(data) {
     updateChartInner(g, x, y, linesGroup, color, line, data);
   }
   
   // Store the update function
   svg.node().update = updateChart;
   
   // Initial render
   updateChart(filteredData());
   
   container.node().appendChild(svg.node());
   return container.node();
}

Non-central beta distribution Johnson et al. (1995):

\[\begin{equation*} \mathcal{B}(x, a, b, c) = \sum_{j=0}^{\infty} e^{-c/2} \frac{\left( \frac{c}{2} \right)^j}{j!} I_x \left( a + j , b \right) \end{equation*}\]

Penalty and \(\lambda\) need to be adjusted accordingly Li & Cao (2022)

Using non equidistant knots in profoc is straightforward:

mod <- online(
  y = Y,
  experts = experts,
  tau = 1:99 / 100,
  b_smooth_pr = list(
    knots = 9,
    mu = 0.3,
    sigma = 1,
    nonc = 0,
    tailweight = 1,
    deg = 3
  )
)

Basis specification b_smooth_pr is internally passed to make_basis_mats().

Profoc adjusts (and scales) penatly

Wrap-Up

Potential Downsides:

Pointwise optimization can induce quantile crossing
- Can be solved by sorting the predictions

Important:

The choice of the learning rate is crucial
The loss function has to meet certain criteria

Upsides:

Pointwise learning outperforms the Naive solution significantly
Online learning is much faster than batch methods
Smoothing further improves the predictive performance
Asymptotically not worse than the best convex combination

The profoc R Package:

Implements all algorithms discussed above
Is written using RcppArmadillo its fast
Accepts vectors for most parameters
- The best parameter combination is chosen online
Implements
- Forgetting, Fixed Share
- Different loss functions + gradients

Pubications:

Berrisch, J., & Ziel, F. (2023). CRPS learning. Journal of Econometrics, 237(2), 105221.

Berrisch, J., & Ziel, F. (2024). Multivariate probabilistic CRPS learning with an application to day-ahead electricity prices. International Journal of Forecasting, 40(4), 1568-1586.

Modeling Volatility and Dependence of European Carbon and Energy Prices

Berrisch, J., Pappert, S., Ziel, F., & Arsova, A. (2023). Finance Research Letters, 52, 103503.

Motivation

Understanding European Emission Allowances (EUA)

Portfolio & Risk Management,

Sustainability Planing

Political decisions

EUA prices are connected to energy markets

How can the dynamics be characterized?

Several Questions arise:

Data (Pre)processing
Modeling Approach
Evaluation

Data

Daily Observations: 03/15/2010 - 10/14/2022

EUA, Natural Gas, Brent Crude Oil, Coal

normalized w.r.t. \(\text{CO}_2\) emissions
Adjusted for inflation by Eurostat’s HICP, excluding energy

Emission-adjusted prices reflect one tonne of \(\text{CO}_2\)

Log transformation of the data to stabilize the variance

ADF Test: All series are stationary in first differences

Johansen’s likelihood ratio trace test suggests two cointegrating relationships (only in levels)

Data

Modeling Approach: Notation

Let \(\boldsymbol{X}_t\) be a \(K\)-dimensional vector at time \(t\)
The forecasting target:
- Conditional joint distribution
- \(F_{\boldsymbol{X}_t|\mathcal{F}_{t-1}}\)
- \(\mathcal{F}_{t}\) is the sigma field generated by all information available up to and including time \(t\)

Sklars theorem: decompose target into

marginal distributions: \(F_{X_{k,t}|\mathcal{F}_{t-1}}\) for \(k=1,\ldots, K\), and
copula function: \(C_{\boldsymbol{U}_{t}|\mathcal{F}_{t - 1}}\)

Let \(\boldsymbol{x}_t= (x_{1,t},\ldots, x_{K,t})^\intercal\) be the realized values

It holds that:

\[\begin{align} F_{\boldsymbol{X}_t|\mathcal{F}_{t-1}}(\boldsymbol{x}_t) = C_{\boldsymbol{U}_{t}|\mathcal{F}_{t - 1}}(\boldsymbol{u}_t) \nonumber \end{align}\]

with: \(\boldsymbol{u}_t =(u_{1,t},\ldots, u_{K,t})^\intercal\), \(u_{k,t} = F_{X_{k,t}|\mathcal{F}_{t-1}}(x_{k,t})\)

For brevity we drop the conditioning on \(\mathcal{F}_{t-1}\).

The model can be specified as follows

\[\begin{align} F(\boldsymbol{x}_t) = C \left[\mathbf{F}(\boldsymbol{x}_t; \boldsymbol{\mu}_t, \boldsymbol{ \sigma }_{t}^2, \boldsymbol{\nu}, \boldsymbol{\lambda}); \Xi_t, \Theta\right] \nonumber \end{align}\]

\(\Xi_{t}\) denotes time-varying dependence parameters \(\Theta\) denotes time-invariant dependence parameters

We take \(C\) as the \(t\)-copula

Modeling Approach: The General Framework

Individual marginal distributions:

\[\mathbf{F} = (F_1, \ldots, F_K)^{\intercal}\]

Generalized non-central t-distributions

Time varying:
- expectation \(\boldsymbol{\mu}_t = (\mu_{1,t}, \ldots, \mu_{K,t})^{\intercal}\)
- variance: \(\boldsymbol{\sigma}_{t}^2 = (\sigma_{1,t}^2, \ldots, \sigma_{K,t}^2)^{\intercal}\)
Time invariant
- degrees of freedom: \(\boldsymbol{\nu} = (\nu_1, \ldots, \nu_K)^{\intercal}\)
- noncentrality: \(\boldsymbol{\lambda} = (\lambda_1, \ldots, \lambda_K)^{\intercal}\)

VECM Model

\[\begin{align} \Delta \boldsymbol{\mu}_t = \Pi \boldsymbol{x}_{t-1} + \Gamma \Delta \boldsymbol{x}_{t-1} \nonumber \end{align}\]

where \(\Pi = \alpha \beta^{\intercal}\) is the cointegrating matrix of rank \(r\), \(0 \leq r\leq K\).

GARCH model

\[\begin{align} \sigma_{i,t}^2 = & \omega_i + \alpha^+_{i} (\epsilon_{i,t-1}^+)^2 + \alpha^-_{i} (\epsilon_{i,t-1}^-)^2 + \beta_i \sigma_{i,t-1}^2 \nonumber \end{align}\]

where \(\epsilon_{i,t-1}^+ = \max\{\epsilon_{i,t-1}, 0\}\) …

Positive vs. negative innovations (capture leverage effects).

Time-varying dependence parameters

\[\begin{align*} \Xi_{t} = & \Lambda\left(\boldsymbol{\xi}_{t}\right) \\ \xi_{ij,t} = & \eta_{0,ij} + \eta_{1,ij} \xi_{ij,t-1} + \eta_{2,ij} z_{i,t-1} z_{j,t-1}, \end{align*}\]

\(z_{i,t}\) is the \(i\)-th standardized residual from time series \(i\)

\(\Lambda(\cdot)\) is a link function:

ensures that \(\Xi_{t}\) is a valid variance covariance matrix
ensures that \(\Xi_{t}\) does not exceed its support space and remains semi-positive definite

Study Design and Evaluation

Rolling-window forecasting study

3257 observations total
Window size: 1000 days (~ four years)
We sample 250 of 2227 starting points
We draw \(2^{12}= 2048\) trajectories 30 steps ahead

Estimation

Joint (quasi) maximum likelihood estimation:

\[\begin{align*} f_{\mathbf{X}_t}(\mathbf{x}_t | \mathcal{F}_{t-1}) = c\left[\mathbf{F}(\mathbf{x}_t;\boldsymbol{\mu}_t, \boldsymbol{\sigma}_{t}^2, \boldsymbol{\nu}, \boldsymbol{\lambda});\Xi_t, \Theta\right] \cdot \\ \prod_{i=1}^K f_{X_{i,t}}(\mathbf{x}_t;\boldsymbol{\mu}_t, \boldsymbol{\sigma}_{t}^2, \boldsymbol{\nu}, \boldsymbol{\lambda}) \end{align*}\]

The copula density \(c\) can be derived analytically.

Evaluation

Our main objective is the Energy Score (ES)

\[\begin{align*} \text{ES}_t(F, \mathbf{x}_t) = \mathbb{E}_{F} \left(||\tilde{\mathbf{X}}_t - \mathbf{x}_t||_2\right) - \\ \frac{1}{2} \mathbb{E}_F \left(||\tilde{\mathbf{X}}_t - \tilde{\mathbf{X}}_t'||_2 \right) \end{align*}\]

where \(\mathbf{x}_t\) is the observed \(K\)-dimensional realization and \(\tilde{\mathbf{X}}_t\), respectively \(\tilde{\mathbf{X}}_t'\) are independent random vectors distributed according to \(F\)

For univariate cases the Energy Score becomes the Continuous Ranked Probability Score (CRPS)

Results

Energy Scores
CRPS Scores
RMSE
Evolution of Linear Dependence \(\Xi\)
Predictive Quantiles

Relative improvement in ES compared to \(\text{RW}^{\sigma, \rho}\)

Cellcolor: w.r.t. test statistic of Diebold-Mariano test (wether the model outperformes the benchmark, greener = better).

Model	\(\text{ES}^{\text{All}}_{1-30}\)	\(\text{ES}^{\text{EUA}}_{1-30}\)	\(\text{ES}^{\text{Oil}}_{1-30}\)	\(\text{ES}^{\text{NGas}}_{1-30}\)	\(\text{ES}^{\text{Coal}}_{1-30}\)	\(\text{ES}^{\text{All}}_{1}\)	\(\text{ES}^{\text{All}}_{5}\)	\(\text{ES}^{\text{All}}_{30}\)
\(\textrm{RW}^{\sigma, \rho}_{}\)	161.96	10.06	37.94	146.73	13.22	5.56	13.28	34.29
\(\textrm{RW}^{\sigma_t, \rho_t}_{}\)	9.40	3.75	-0.41	11.39	4.13	10.34	9.10	7.59
\(\textrm{RW}^{\sigma, \rho_t}_{\textrm{ncp}, \textrm{log}}\)	12.04	6.16	-0.56	14.33	7.35	9.22	9.82	10.02
\(\textrm{RW}^{\sigma, \rho}_{\textrm{log}}\)	12.10	6.25	-0.59	14.44	7.31	9.04	9.66	9.91
\(\textrm{VECM}^{\textrm{r0}, \sigma_t, \rho_t}_{\textrm{lev}, \textrm{ncp}}\)	9.68	-0.72	0.32	11.74	3.70	10.82	10.50	8.21
\(\textrm{VECM}^{\textrm{r0}, \sigma, \rho_t}_{\textrm{log}}\)	12.15	6.10	-0.70	14.57	7.80	8.05	9.99	10.04
\(\textrm{ETS}^{\sigma}\)	9.94	5.75	0.08	13.05	7.83	6.96	7.74	6.21
\(\textrm{ETS}^{\sigma}_{\textrm{log}}\)	8.12	7.80	-0.51	11.17	8.54	5.05	6.14	2.66
\(\textrm{VES}^{\sigma}\)	5.50	-4.43	-3.22	6.29	4.68	-25.99	-2.42	3.07
\(\textrm{VES}^{\sigma}_{\textrm{log}}\)	7.68	3.31	-4.34	9.07	8.30	-22.11	1.07	4.32

Coloring w.r.t. test statistic: <-5 -4 -3 -2 -1 0 1 2 3 4 >5

Benchmarks:
- \(\text{RW}^{\sigma, \rho}\): Random walk with constant volatility and correlation
- Univariate \(\text{ETS}^{\sigma}\) with constant volatility
- Vector ETS \(VES^{\sigma}\) with constant volatility
Heteroscedasticity is a main driver of ES
The VECM model without cointegration (VAR) is the best performing model in terms of ES overall
For EUA, the ETS Benchmark is the best performing model in terms of ES

CRPS solely evaluates the marginal distributions
- The cross-sectional dependence is ignored
VES models deliver poor performance in short horizons
For Oil prices the RW Benchmark can’t be outperformed 30 steps ahead
Both VECM models generally deliver good performance

Relative improvement in CRPS compared to \(\text{RW}^{\sigma, \rho}\)

	EUA			Oil			NGas			Coal
Model	H1	H5	H30	H1	H5	H30	H1	H5	H30	H1	H5	H30
\(\textrm{RW}^{\sigma, \rho}_{}\)	0.4	0.9	2.1	1.5	3.4	9.1	4.7	11.6	29.8	0.3	0.9	2.8
\(\textrm{RW}^{\sigma_t, \rho_t}_{}\)	5.6	6.0	2.8	2.1	2.7	-0.8	12.6	10.5	9.6	10.7	6.5	2.1
\(\textrm{RW}^{\sigma, \rho_t}_{\textrm{ncp}, \textrm{log}}\)	5.1	8.7	5.0	0.7	0.8	-0.4	11.4	11.5	12.4	8.0	7.3	6.7
\(\textrm{RW}^{\sigma, \rho}_{\textrm{log}}\)	4.7	8.9	5.2	0.0	0.3	-0.6	11.2	11.4	12.4	7.7	7.5	6.6
\(\textrm{VECM}^{\textrm{r0}, \sigma_t, \rho_t}_{\textrm{lev}, \textrm{ncp}}\)	3.6	0.6	-1.6	2.7	3.0	0.0	13.1	12.2	10.4	11.8	7.2	1.5
\(\textrm{VECM}^{\textrm{r0}, \sigma, \rho_t}_{\textrm{log}}\)	4.2	8.9	5.1	0.2	0.4	-0.8	9.9	11.8	12.7	7.8	7.9	7.3
\(\textrm{ETS}^{\sigma}\)	0.2	6.8	5.7	1.1	0.9	-0.2	10.9	11.3	10.9	7.5	6.7	5.6
\(\textrm{ETS}^{\sigma}_{\textrm{log}}\)	1.0	8.6	8.0	0.1	0.7	-0.6	8.9	9.4	7.1	7.3	7.8	6.7
\(\textrm{VES}^{\sigma}\)	-38.5	-6.4	-5.4	-33.3	-6.1	-2.4	-26.6	-2.6	3.6	-37.5	-5.5	4.7
\(\textrm{VES}^{\sigma}_{\textrm{log}}\)	-32.4	2.8	1.8	-30.4	-6.2	-3.2	-22.0	1.8	5.4	-27.0	2.3	6.4

Coloring w.r.t. test statistic: <-5 -4 -3 -2 -1 0 1 2 3 4 >5

RMSE measures the performance of the forecasts at their mean

Some models beat the benchmarks at short horizons

Conclusion: the Improvements seen before must be attributed to other parts of the multivariate predictive distribution

Relative improvement in RMSE compared to \(\text{RW}^{\sigma, \rho}\)

	EUA			Oil			NGas			Coal
Model	H1	H5	H30	H1	H5	H30	H1	H5	H30	H1	H5	H30
\(\textrm{RW}^{\sigma, \rho}_{}\)	0.9	2.0	5.0	2.9	6.4	16.7	17.8	42.8	85.4	0.9	2.9	7.0
\(\textrm{RW}^{\sigma_t, \rho_t}_{}\)	-0.1	-0.1	0.7	0.0	-0.3	-0.1	-0.2	0.3	1.3	-0.2	0.0	-1.8
\(\textrm{RW}^{\sigma, \rho_t}_{\textrm{ncp}, \textrm{log}}\)	-270.5	-154.1	-139.9	0.5	-0.5	-2.9	-0.8	0.7	-1.6	0.3	-31.2	-24.5
\(\textrm{RW}^{\sigma, \rho}_{\textrm{log}}\)	-705.0	-265.4	-125.2	0.6	0.2	-0.2	-0.4	0.1	-1.6	-0.9	-0.3	-8.3
\(\textrm{VECM}^{\textrm{r0}, \sigma_t, \rho_t}_{\textrm{lev}, \textrm{ncp}}\)	-0.9	0.2	0.5	0.5	0.2	0.0	-0.4	0.7	0.2	1.4	0.1	0.2
\(\textrm{VECM}^{\textrm{r0}, \sigma, \rho_t}_{\textrm{log}}\)	-271.5	-191.3	-114.3	1.7	-12.3	-3.6	-0.6	1.6	-4.1	0.0	-0.8	-6.7
\(\textrm{ETS}^{\sigma}\)	-0.3	0.3	1.6	0.7	0.1	-0.1	0.1	-0.1	0.2	-2.4	-3.9	2.5
\(\textrm{ETS}^{\sigma}_{\textrm{log}}\)	-1.0	0.4	1.6	0.9	0.0	-0.1	-1.9	-1.9	-13.9	-0.3	-3.6	-1.8
\(\textrm{VES}^{\sigma}\)	-37.4	-8.9	-6.0	-27.9	-7.4	-2.8	-27.2	-9.5	-2.4	-41.7	-1.2	1.6
\(\textrm{VES}^{\sigma}_{\textrm{log}}\)	-37.6	-9.2	-7.8	-26.8	-7.3	-3.0	-27.0	-6.8	-3.5	-41.2	-2.2	-0.3

Coloring w.r.t. test statistic: <-5 -4 -3 -2 -1 0 1 2 3 4 >5

Final Remarks

Contributions

Theoretical

Probabilistic Online Learning:

Aggregation
Regression

Practical

Applications

Energy Commodities
Electricity Prices
Electricity Load

Well received by the academic community:

of papers already published

108 citations since 2020 (Google Scholar)

Software

R Packages:

profoc, rcpptimer, dccpp

Python Packages:

ondil, sstudentt

Contributions to other projects:

RcppArmadillo
gamlss
NixOS/nixpkgs
OpenPrinting/foomatic-db

Awards:

Berrisch, J., Narajewski, M., & Ziel, F. (2023):

Won Western Power Distribution Competition
Won Best-Student-Presentation Award

Questions!

Artwork by @allison_horst

References

Berrisch, J. (2025). Rcpptimer: Rcpp tic-toc timer with OpenMP support. arXiv Preprint arXiv:2501.15856, abs/2501.15856. DOI: 10.48550/arXiv.2501.15856

Berrisch, J., Narajewski, M., & Ziel, F. (2023). High-resolution peak demand estimation using generalized additive models and deep neural networks. Energy and AI, 13, 100236. DOI: 10.1016/j.egyai.2023.100236

Berrisch, J., Pappert, S., Ziel, F., & Arsova, A. (2023). Modeling volatility and dependence of european carbon and energy prices. Finance Research Letters, 52, 103503. DOI: 10.1016/j.frl.2022.103503

Berrisch, J., & Ziel, F. (2022). Distributional modeling and forecasting of natural gas prices. Journal of Forecasting, 41(6), 1065–1086. DOI: 10.1002/for.2853

Berrisch, J., & Ziel, F. (2023). CRPS learning. Journal of Econometrics, 237(2, Part C), 105221. DOI: 10.1016/j.jeconom.2021.11.008

Berrisch, J., & Ziel, F. (2024). Multivariate probabilistic CRPS learning with an application to day-ahead electricity prices. International Journal of Forecasting, 40(4), 1568–1586. DOI: 10.1016/j.ijforecast.2024.01.005

Cesa-Bianchi, N., & Lugosi, G. (2006). Prediction, learning, and games (pp. I–XII, 1–394). Cambridge university press. DOI: 10.1017/CBO9780511546921

Gaillard, P., & Wintenberger, O. (2018). Efficient online algorithms for fast-rate regret bounds under sparsity. In S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, & R. Garnett (Eds.), Proceedings of the 32nd international conference on neural information processing systems (pp. 7026–7036). Cornell University. DOI: 10.48550/arxiv.1805.09174

Gneiting, T. (2011). Making and evaluating point forecasts. Journal of the American Statistical Association, 106(494), 746–762. DOI: 10.1198/jasa.2011.r10138

Gneiting, T., & Raftery, A.E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477), 359–378. DOI: 10.1198/016214506000001437

Hirsch, S., Berrisch, J., & Ziel, F. (2024). Online distributional regression. arXiv Preprint arXiv:2407.08750, abs/2407.08750. DOI: 10.48550/arXiv.2407.08750

Johnson, N.L., Kotz, S., & Balakrishnan, N. (1995). Continuous univariate distributions, volume 2 (Vol. 289). John wiley & sons.

Li, Z., & Cao, J. (2022). General p-splines for non-uniform b-splines. arXiv Preprint. DOI: 10.48550/arXiv.2201.06808

Marcjasz, G., Narajewski, M., Weron, R., & Ziel, F. (2023). Distributional neural networks for electricity price forecasting. Energy Economics, 125, 106843. DOI: 10.1016/j.eneco.2023.106843

Taylor, J.W., & Meng, X. (2023). Angular combining of forecasts of probability distributions. arXiv Preprint arXiv:2305.16735. DOI: 10.48550/arXiv.2305.16735

Wintenberger, O. (2017). Optimal learning with bernstein online aggregation. Machine Learning, 106(1), 119–141. DOI: 10.1007/s10994-016-5592-6