Distributed ARIMA models for Ultra-long Time Series

 Xiaoqian WangBeihang UniversityJoint work with Yanfei Kang, Rob J Hyndman and Feng Li

2020-10-261 / 26

Background

Ultra-long time series are increasingly accumulated in many cases.
- hourly electricity demands
- daily maximum temperatures
- streaming data generated in real-time
Forecasting these time series is challenging.
- time-consuming training process
- hardware requirements
- unrealistic assumption that the DGP remains invariant over a long time interval
Some attempts are made in the vast literature.
- discard the earliest observations
- allow the model itself to evolve over time
- apply a model-free prediction
- develop methods using the Spark’s MLlib library

2 / 26

P1: Ultra-long time series are becoming increasingly common. Examples include hourly electricity demands spanning several years, daily maximum temperatures recorded for hundreds of years, and streaming data

P2: It is challenging to deal with such long time series. We identify three significant challenges, including:

time-consuming training process, especially parameters optimization
hardware requirements
unrealistic assumption that the DGP remains invariant over a long time interval

P3: Forecasters have made some attempts to address these limitations.

a straightforward approach is to discard the earliest observationsonly, it only works well for forecasting a few future values
allow the model itself to evolve over time, such as ARIMA and ETS
apply a model-free prediction assuming that the series changes slowly and smoothly with time (2,3) require considerable computational time in model fitting and parameter optimization, making them less practically useful in modern enterprises
develop methods using the Spark’s MLlib library. However, the platform does not support the multi-step prediction, convert the multi-step time series prediction problem into H sub-problems, H is the forecast horizon We argue that there is a preferable way to resolve the challenges

Electricity load data

The Global Energy Forecasting Competition 2017 (GEFCom2017)
Hourly electricity load data of zones spanning New England
- $8$ bottom zones & $2$ aggregated zones
- without considering the hierarchical configuration
Ranging from March 1, 2003 to April 30, 2017 ( $124, 171$ time points)
- training periods
  March 1, 2003 - December 31, 2016
- test periods
  January 1, 2017 - April 30, 2017 ( $h = 2, 879$ )

3 / 26

We use electricity load data of the Global Energy Forecasting Competition as examples.
The dataset contains 10 hourly electricity demand series of New England. There are 8 bottom and 2 aggregated zones. But we don’t consider the hierarchical configuration here.
Each series spans 14 years. We aim to forecast observations of the next 4 months.

Electricity load data

The electricity demand example for NEMASSBOST zone from GEFCom2017 dataset.

4 / 26

Here is an example series. The top panel shows the whole series, the bottom panel shows a clip of half a month.

The yearly and hourly seasonal patterns can be observed.
The DGP of the series has changed over the 14 years, although the change is not obvious due to the large electricity demand values.

Distributed forecasting

Zhu, Li & Wang (2019) tackle regression problems on distributed systems by developing a distributed least squares approximation (DLSA) method.
Local estimators are computed by worker nodes in a parallel manner.
Local estimators are delivered to the master node to approximate global estimators by taking a weighted average.

5 / 26

In this work, we want to find a better way to resolve all challenges associated with forecasting ultra-long time series. (...) Inspired by this, we aim to extend the DLSA method to solve the time series modeling problem.

Distributed forecasting

Parameter estimation problem

For an ultra-long time series ${y_{t}; t = 1, 2, \dots, T}$ . Define $S = {1, 2, \dots, T}$ to be the timestamp sequence.
The parameter estimation problem can be formulated as $f (θ, Σ | y_{t}, t \in S) .$

6 / 26

Distributed forecasting

Parameter estimation problem

For an ultra-long time series ${y_{t}; t = 1, 2, \dots, T}$ . Define $S = {1, 2, \dots, T}$ to be the timestamp sequence.
The parameter estimation problem can be formulated as $f (θ, Σ | y_{t}, t \in S) .$
Suppose the whole time series is split into $K$ subseries with contiguous time intervals, that is $S = \cup_{k = 1}^{K} S_{k}$ .
The parameter estimation problem is transformed into $K$ sub-problems and one combination problem as follows: $f (θ, Σ | y_{t}, t \in S) = g (f_{1} (θ_{1}, Σ_{1} | y_{t}, t \in S_{1}), \dots, f_{K} (θ_{K}, Σ_{K} | y_{t}, t \in S_{K})) .$

6 / 26

We identify the parameter estimation problem as this formula ~.

Distributed forecasting

Illustration of forecasting problem and time series split.

7 / 26

Distributed forecasting

The proposed framework for time series forecasting on a distributed system.

8 / 26

Step 1: Preprocessing.
Step 2: Modeling.
Step 3: Linear transformation.
Step 4: Estimator combination.
Step 5: Forecasting.

Why focus on ARIMA models

Advantages

The most widely used statistical time series approachs.
ARIMA models can handle non-stationary series via differencing and seasonal patterns
ARIMA models frequently serve as benchmark methods for model combination because of their excellent performance in forecasting.
ARIMA models can be converted into AR representations (linear form).
Hyndman & Khandakar (2008) developed the automatic time series forecasting with ARIMA models to easily implement the order selection process.

Limitations

The likelihood function of such time series model could hardly scale up.

9 / 26

We suggest that our approach is general and can be applied to other types of forecasting models. But, in the work, we focus on ARIMA models due to their good properties.

The linear form makes it easy for the estimators combination Limitations:
Due to the nature of time dependence, the likelihood function of such time series model could hardly scale up, making it infeasible for massive time series forecasting.

Automatic ARIMA modeling

The procedure of an automatic ARIMA forecasting framework, taking the auto.arima() algorithm as an example.

10 / 26

The automatic ARIMA modeling mainly consists of 3 steps... Where the order selection and model refit process are time-consuming for ultra-long time series. The time spend in forecasting one electricity demand series ranges from 20 minutes to 2 hours. So, it is necessary to develop a new approach to make ARIMA models work well for ultra-long series.

AR representationA seasonal ARIMA model is generally defined as
(1−p∑i=1ϕiBi)(1−P∑i=1ΦiBim)(1−B)d(1−Bm)D(yt−μ0−μ1t)=(1+q∑i=1θiBi)(1+Q∑i=1ΘiBim)εt.(1−∑i=1pϕiBi)(1−∑i=1PΦiBim)(1−B)d(1−Bm)D(yt−μ0−μ1t)=(1+∑i=1qθiBi)(1+∑i=1QΘiBim)εt.
11 / 26

AR representation

A seasonal ARIMA model is generally defined as $\begin{aligned} (1 - \sum_{i = 1}^{p} ϕ_{i} B^{i}) (1 - \sum_{i = 1}^{P} Φ_{i} B^{i m}) (1 - B)^{d} {(1 - B^{m})}^{D} (y_{t} - μ_{0} - μ_{1} t) \\ = (1 + \sum_{i = 1}^{q} θ_{i} B^{i}) (1 + \sum_{i = 1}^{Q} Θ_{i} B^{i m}) ε_{t} . \end{aligned}$
Let the term $y_{t} - μ_{0} - μ_{1} t$ be denoted as $x_{t}$ .
By utilizing the polynomial multiplication, the seasonal ARIMA model is converted into a non-seasonal ARMA( $u, v$ ) model (possibly non-stationary) $\begin{aligned} (1 - \sum_{i = 1}^{u} ϕ_{i}^{'} B^{i}) x_{t} = (1 + \sum_{i = 1}^{v} θ_{i}^{'} B^{i}) ε_{t} . \end{aligned}$

11 / 26

AR representation

The AR representation of the ARMA( $u, v$ ) can be obtained by long division of AR and MA polynomials.
Given two polynomials $ϕ^{'} (B) = (1 - \sum_{i = 1}^{u} ϕ_{i}^{'} B^{i})$ and $θ^{'} (B) = (1 + \sum_{i = 1}^{v} θ_{i}^{'} B^{i})$ , we have $π (B) x_{t} = \frac{ϕ^{'} (B)}{θ^{'} (B)} x_{t} = ε_{t},$ where $π (B) = (1 - \sum_{i = 1}^{\infty} π_{i} B^{i})$ .

12 / 26

AR representation

The AR representation of the ARMA( $u, v$ ) can be obtained by long division of AR and MA polynomials.
Given two polynomials $ϕ^{'} (B) = (1 - \sum_{i = 1}^{u} ϕ_{i}^{'} B^{i})$ and $θ^{'} (B) = (1 + \sum_{i = 1}^{v} θ_{i}^{'} B^{i})$ , we have $π (B) x_{t} = \frac{ϕ^{'} (B)}{θ^{'} (B)} x_{t} = ε_{t},$ where $π (B) = (1 - \sum_{i = 1}^{\infty} π_{i} B^{i})$ .
The linear representation of the original seasonal ARIMA model can be given by $y_{t} = β_{0} + β_{1} t + \sum_{i = 1}^{\infty} π_{i} y_{t - i} + ε_{t},$ where $β_{0} = μ_{0} (1 - \sum_{i = 1}^{\infty} π_{i}) + μ_{1} \sum_{i = 1}^{\infty} i π_{i} and β_{1} = μ_{1} (1 - \sum_{i = 1}^{\infty} π_{i}) .$

12 / 26

For a general seasonal ARIMA models, by using multiplication and long division of polynomials, we can obtain the final converted linear representation in this form. In this way, all ARIMA models fitted for subseries can be converted into this linear form.

Estimators combination

Some excellent statistical properties of the global estimator obtained by DLSA has been proved (Zhu, Li & Wang, 2019).
We extend the DLSA method to solve time series modeling problem.
Define $L (θ; y_{t})$ to be a second order differentiable loss function, we have $\begin{aligned} L (θ) & = \frac{1}{T} \sum_{k = 1}^{K} \sum_{t \in S_{k}} L (θ; y_{t}) \\ = \frac{1}{T} \sum_{k = 1}^{K} \sum_{t \in S_{k}} {L (θ; y_{t}) - L ({\hat{θ}}_{k}; y_{t})} + c_{1} \\ \approx \frac{1}{T} \sum_{k = 1}^{K} \sum_{t \in S_{k}} {(θ - {\hat{θ}}_{k})}^{⊤} \ddot{L} ({\hat{θ}}_{k}; y_{t}) (θ - {\hat{θ}}_{k}) + c_{2} \\ \approx \sum_{k = 1}^{K} {(θ - {\hat{θ}}_{k})}^{⊤} (\frac{T_{k}}{T} {\hat{Σ}}_{k}^{- 1}) (θ - {\hat{θ}}_{k}) + c_{2} . \end{aligned}$

13 / 26

The next stage entails solving the problem of combining the local
Taylor’s theorem & the relationship between the Hessian and covariance matrix for Gaussian random variables
This leads to a weighted least squares form

Estimators combination

The global estimator calculated by minimizing the weighted least squares takes the following form $\begin{aligned} \tilde{θ} & = {(\sum_{k = 1}^{K} \frac{T_{k}}{T} {\hat{Σ}}_{k}^{- 1})}^{- 1} (\sum_{k = 1}^{K} \frac{T_{k}}{T} {\hat{Σ}}_{k}^{- 1} {\hat{θ}}_{k}), \\ \tilde{Σ} & = {(\sum_{k = 1}^{K} \frac{T_{k}}{T} {\hat{Σ}}_{k}^{- 1})}^{- 1} . \end{aligned}$
${\hat{Σ}}_{k}$ is not known and has to be estimated.
We approximate a GLS estimator by an OLS estimator (e.g., Hyndman et al., 2011) while still obtaining consistency.
We consider approximating ${\hat{Σ}}_{k}$ by ${\hat{σ}}_{k}^{2} I$ for each subseries.

14 / 26

The global estimator and its covariance matrix can be obtained. The covariance matrix of subseries is not known, so we estimate it by sigma2I.

Point forecasts

The $h$ -step-ahead forecast can be calculated as ${\hat{y}}_{T + h | T} = {\tilde{β}}_{0} + {\tilde{β}}_{1} (T + h) + {\begin{cases} \sum_{i = 1}^{p} {\tilde{π}}_{i} y_{T + 1 - i}, & h = 1 \\ \sum_{i = 1}^{h - 1} {\tilde{π}}_{i} {\hat{y}}_{T + h - i | T} + \sum_{i = h}^{p} {\tilde{π}}_{i} y_{T + h - i}, & 1 < h < p . \\ \sum_{i = 1}^{p} {\tilde{π}}_{i} {\hat{y}}_{T + h - i | T}, & h \geq p \end{cases}$

15 / 26

Then, the point forecasts and prediction intervals can be obtained.

Prediction intervals

In the global model, the standard error of $h$ -step ahead forecast is formally expressed as ${\tilde{σ}}_{h}^{2} = {\begin{cases} {\tilde{σ}}^{2}, & h = 1 \\ {\tilde{σ}}^{2} (1 + \sum_{i = 1}^{h - 1} {\tilde{θ}}_{i}^{2}), & h > 1 \end{cases},$ where ${\tilde{σ}}^{2} = tr (\tilde{Σ}) / p$ .
The central $(1 - α) \times 100 %$ prediction interval of $h$ -step ahead forecast can be defined as ${\hat{y}}_{T + h | T} \pm Φ^{- 1} (1 - α / 2) {\tilde{σ}}_{h} .$

16 / 26

Experimental setup

Number of subseries: $150$
- the length of each subseries about $800$
- the hourly time series in M4 ranges from $700$ to $900$
- the time consumed by automatic ARIMA modeling process is within 5 minutes
AR order: $2000$
The experiments are carried out on a Spark-on-YARN cluster
- one master node and two worker nodes
- Each node contains 32 logical cores, 64 GB RAM and two 80GB SSD local hard drives

17 / 26

Distributed forecasting results

Benchmarking the performance of DARIMA against ARIMA model and its AR representation.

The rationality of setting the AR order to $2000$ .
DARIMA always outperforms the benchmark method regardless of point forecasts or prediction intervals.

18 / 26

Distributed forecasting results

Benchmarking the performance of DARIMA against ARIMA for various forecast horizons.

If long-term observations are considered, DARIMA is preferable, especially for interval forecasting.
The achieved performance improvements of DARIMA become more pronounced as the forecast horizon increases.

19 / 26

Distributed forecasting results

20 / 26

Distributed forecasting results

Our approach has captured the decreasing yearly seasonal trend.
Both DARIMA and ARIMA have captured the hourly seasonality, while DARIMA results in forecasts closer to the true future values than ARIMA.

21 / 26

Distributed forecasting results

MSIS results across different confidence levels

Execution time

22 / 26

Sensitivity analysis

Number of split subseries

Relationship between the forecasting performance of DARIMA models and the number of subseries.

The number of subseries should be controlled within a reasonable range, with too few or too many subseries causing poor forecasting performance.

23 / 26

In particular, the relationship between the number of subseries and the MISS values shows as a concave line.

Sensitivity analysis

Maximum values of model orders

forecasting performance
computational efficiency
broader range of candidate models

24 / 26

Summary

A distributed time series forecasting framework using the industry-standard MapReduce framework.
The local estimators trained on each subseries are combined using weighted least squares to minimize a global loss function.
Our framework
- works better than competing methods for long-term forecasting.
- achieves improved computational efficiency in optimizing the model parameters.
- allows that the DGP of each subseries could vary.
- can be viewed as a model combination approach.

25 / 26

Thanks!

Spark implementation: https://github.com/xqnwang/darima
Website: https://xqnwang.rbind.io

26 / 26

Background

Ultra-long time series are increasingly accumulated in many cases.

hourly electricity demands
daily maximum temperatures
streaming data generated in real-time

Forecasting these time series is challenging.

time-consuming training process
hardware requirements
unrealistic assumption that the DGP remains invariant over a long time interval

Some attempts are made in the vast literature.

discard the earliest observations
allow the model itself to evolve over time
apply a model-free prediction
develop methods using the Spark’s MLlib library

2 / 26

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help

Distributed ARIMA models for Ultra-long Time Series

Xiaoqian Wang

Beihang University

Joint work with Yanfei Kang, Rob J Hyndman and Feng Li2020-10-26

Background

Electricity load data

Electricity load data

Distributed forecasting

Distributed forecasting

Parameter estimation problem

Distributed forecasting

Parameter estimation problem

Distributed forecasting

Distributed forecasting

Why focus on ARIMA models

Advantages

Limitations

Automatic ARIMA modeling

AR representation

AR representation

AR representation

AR representation

Estimators combination

Estimators combination

Point forecasts

Prediction intervals

Experimental setup

Distributed forecasting results

Distributed forecasting results

Distributed forecasting results

Distributed forecasting results

Distributed forecasting results

Sensitivity analysis

Number of split subseries

Sensitivity analysis

Maximum values of model orders

Summary

Thanks!

Background

Help

Joint work with Yanfei Kang, Rob J Hyndman and Feng Li

2020-10-26