On the Faltering Virtues of Complexity

There’s been a wave of pushback lately against the “virtue of complexity” paradigm, popularized by Kelly, Malamud, and Zhou (KMZ). Their headline result was both surprising and seductive: the more complex your model—in terms of parameters—the better it will perform out-of-sample when predicting asset prices, and by extension, the better the resulting Sharpe ratios. For a solid presentation of the paper by Bryan Kelly, check this out.

In short, there is no trade-off anymore: just embrace complexity. But that narrative is beginning to fray.


The Pushback

Recent critiques have emerged on multiple fronts. Theoretically, some have pointed out that the “virtue” relies heavily on a specific set of assumptions. Empirically, others have noted questionable benchmarking choices. Perhaps most striking, it is has been shown that supposedly complex strategies don’t look so complex after all when you examine the information they’re really using.  While KMZ have yet to respond, and some critiques may well be addressable, many of the concerns reflect statistical common sense and carry intuitive weight.

IMHO, this reckoning was written in the sky. Applying double descent theory to asset pricing was always bound to hit turbulence. That’s not to say double descent is wrong or without merit—I’ve used it myself to justify overparameterized architectures in macro forecasting. But the claim that overparameterization is universally superior—especially outside deep neural networks trained via gradient descent—is less convincing. Even in the ridgeless regression literature, interpolation doesn’t consistently outperform standard regularization; results remain highly context-dependent. In complex models with multiple jointly estimated modules, interpolation can help smooth numerical optimization itself and reduce sensitivity to initialization. But that’s not what KMZ are doing: they are approximating a single-layer wide network with random Fourier features in a ridgeless regression. 

Complexity ≠ Parameters

One of the more subtle but crucial misunderstandings is equating the number of parameters with model complexity past the interpolation threshold. In practice, what double descent often delivers—especially in the overparameterized regime where parameters outnumber observations—is not a more complex model, but a more regularized one.

The model doesn’t “see” more—it simply re-expresses the same functional space under a more stable numerical structure, which makes a huge difference for gradient-based optimization. In truth, the real determinants of effective model complexity remain what they’ve always been: the number of observations, the signal-to-noise ratio, and the true functional form of the underlying data-generating process. Therefore, when KMZ state that “The virtue of complexity is present even in extremely data-scarce environments, e.g., for predictive models with less than twenty observations and tens of thousands of predictors,” it would be more accurate to replace “complexity” with “overparameterization,” as the latter—particularly in data-scarce settings—does not necessarily imply the former.

As it turns out, a brand new theoritical paper by Hasan Fallahgoul proves just this. In his own words:

Sample Complexity Reality: I derive sample complexity bounds showing when reliable learning becomes information-theoretically impossible under weak signal-to-noise ratios typical in finance. The math reveals that most practical applications fall in the "impossible learning" regime.
Effective Complexity Truth: VC-dimension analysis reveals that ridgeless regression's effective complexity is bounded by sample size rather than nominal feature dimension. Methods claiming to leverage thousands of parameters are actually operating in much simpler spaces.

Oh, well.

The ongoing discussion is nonetheless scientifically illuminating, and it is safe to bet that KMZ will reenter the ring forcefully. Until then, it seems like common statistical sense might make sense after all.

Some Lessons for Complexity in Macroeconomic Forecasting

Macroeconomic forecasting is another field where the allure of complexity can be strong. I’ve worked with deep neural networks featuring close to a million parameters trained on just 250 quarterly observations. Does that mean my models were more “complex” than those with already four times more parameters than observations? Not really. What made them work wasn’t raw complexity, but the stabilizing effect of overparameterization during optimization. It was a form of implicit regularization—one that worked well without needing endless hyperparameter tuning.

A particularly important insight comes from the recent Nagel paper, which shows that even if a model has a million parameters, its predictions can still be expressed as a linear combination of in-sample observations—essentially as a kernel. My co-authors and I have used this idea in three papers now, most notably in our work on the dual interpretation of machine learning forecasts. There, we argue that when a model is structurally complex—featuring many regressors, nonlinearities, and overparameterization—it is often easier to interpret the model through its dual representation: by analyzing the weights placed on observed outcomes rather than on predictors.

Nagel applies this perspective cleverly to market timing strategies, showing that in the double descent limit, these observation weights tend to follow remarkably simple financial patterns. This resonates with points I’ve been making in seminars: due to overparameterization and regularization, models that appear extremely complex in their primal form (with intricate predictor interactions) can be surprisingly sparse and interpretable in their dual form—when we examine how they actually reweight past outcomes to make forecasts.

In macroeconomic forecasting, this doesn’t mean the models are trivial. A forecast may indeed arise from a rich and dense interplay of variables. But the resulting predictions can often be traced back to a sparse, intuitive set of historical episodes—a narrative that’s much easier to communicate and interpret.

P.S.:for economists new to double descent, interpolation, etc., I recommend checking my slides on the matter as a jumping board into the ML literature. 

Previous
Previous

Spearman Neural Networks: A Bridge Between Trees and Deep Learning

Next
Next

Residual Connections in Macro Forecasting: A Simple Upgrade to Feedforward Neural Nets?