ML-based Time Series Modelling with MARX
If you came here looking for applied Marxism, you will be disappointed. MARX here refers to Moving Average Rotation of X as introduced in Goulet Coulombe, Leroux, Surprenant, and Stevanovic (2021).
Key ML supervised algorithms can be seen (very generally) as regularized nonlinear nonparametric estimation of the conditional mean. All three terms are important but the first will be the object of interest for today – regularization, or what Bayesians would call priors. If you think it only matters in small data applications, think again. Convolutional neural nets work great for computer vision because they nest several notions of what makes sense and what does not when telling apart a dog from a cat (like translation invariance). And yet the algorithm is trained on huge amounts of data. Similar things can be said of machine translation. So prior knowledge about the nature of the problems matters.
In many setups, off-the-shelf regularization is just fine. However, when the data matrix has a special structure, like including many lags from many variables, it might not be. Sometimes it is easy to reconfigure an algorithm to implement a new prior (like Ridge). Most times it is not (like tree ensembles or even neural nets). Hence, it can be convenient that, in certain time series situation, one can simply transform the data to induce a new prior and run the same off-the-shelf algorithm. So that’s what Moving Average Rotation of X (MARX) does, as we originally proposed in an article with colleagues at UQAM. By transforming the data using 2-3 lines of code (loops of means, available here), one can get an alternative regularization in ML time series models. This matters since choosing an inadequate prior scheme is like imposing a bad (soft) constraint and often results in over-regularized models that throw the baby with the bathtub water. No big surprises here: winning at the bias-variance trade-off implies finding constraints that reduce variance without blowing up bias.
Here's the motivation for MARX. In many applications, especially when high frequency macro data is involved (e.g., days or weeks), what appears to be the most appropriate prior for the coefficient of a lag polynomial for a given variable is
β(p)=β(p-1)+u(p) , u(p)~N(0,σ_u)
As opposed to
β(p)~N(0,σ_β)
which is explicit in ridge and implicit in everything else (like random forest, see ESL on that). What does the first equation mean? It means the partial effect of day t-p on day t (or month/quarter, in mixed frequency applications) should be similar to that of day t-p-1 and t-p+1. In many contexts, this is more sensible than β(p)~N(0,σ_β) which basically says that each day are likely irrelevant and that irrelevance is iid accros p’s. In the paper, for our monthly example, we say
For instance, it seems more likely that the average of March, April, and May employment growth could impact, say, inflation, than only May’s. Mechanically, this means we expect March, April, and May ’s coefficients to be close to one another.
Obviously, we did not invent that kind of prior, which can be dated back (at least) to Shiller (1973). What we propose is a way of implementing it without altering any algorithm, which is extremely useful especially when priors/regularization are not stated explicitly. It turns out that replacing X_{t,p,k} by AVG(X_{t,1:p,k}) for every p and k in the dataset, and throwing it at your preferred off-the-shelf ML tool of choice does just that. MARXs are simply moving averages of increasing window size. Of course, the rationale for this comes from simple derivations that you may find in the paper. They are inspired from previous work I did on time-varying parameters and ridge regressions (as well as a very old Bayesian literature on the matter).
In fact, if MARX used in models where no standardization in done afterwards (unlike what is common to do, say, with ridge and neural networks), using AVG(X_{t,1:p,k}) rather than SUM(X_{t,1:p,k}) induces
β(p)=β(p-1)+u(p) , u(p)~N(0,σ_u ⁄ (p^2))
which, loosely in the spirit of a Minnesota prior, says that the regularization strength increases with p. However, here, it implies that lags coefficients are expected to be increasingly similar whereas in a Minnesota prior it means they are expected to be increasingly irrelevant. Of course, which one is preferable depends on the underlying DGP. One thing that is certain, though, is that the former penalizes much less aggressively distant lags and allow long-range dependencies (or memory) to live on, albeit in a restricted/grouped way.
Anyhow, let’s wrap up. If you have (higher frequency) time series data and need sensible regularization in an off-the-shelf ML model with many lags, try MARX. Another obvious application is estimating UMIDAS models with ML methods.
——
P.S. : Of course, equations don’t look too good here given this website limitations. Pretty ones can be found in the paper.