Is Random Forest a Risk-Averse Investor?

Jan 6

These are some observations that came out of the AEA meeting in Philadelphia last weekend—from the discussion of our paper "Dual Interpretation of Machine Learning Forecasts" (with Karin Klieber and Maximilian Göbel) by Ulrich Müller, and the conversations that followed.

Here's a fact about Random Forests that changes how you think about them: every RF prediction is a weighted average of training observations, where the weights are non-negative and sum to one. In the ML cannon, RF (and plain trees) are unique in this regard. While we discuss this to some extent in the paper, I thought the point could benefit additional thoughts.

It's Averages All the Way Down

When you think about it, a tree prediction is just the average of training observations in the leaf. The weights are 1/n for the n observations in that leaf, zero for everyone else.

A forest is an average of trees. An average of averages is just an average.

That's why RF weights can never be negative, unlike anything optimized with least squares, where negative weights emerge naturally from the math.

The Implications

In increasing order of non-triviality:

1. Random Forests cannot extrapolate.

This is well known. But the simplex framing makes it unavoidable: if all weights are non-negative and sum to one, the prediction is a convex combination of training outcomes. There is simply no way to escape the range of what you've seen before.

If your training outcomes are between 0 and 100, your predictions will be between 0 and 100. Not "usually"—always.

This is one reason I developed the Macro Random Forest approach in previous work: adding a linear component in the leaves restores the ability to extrapolate, which matters for economic forecasting.

2. No short position: RF cannot use "opposites."

In RF, if current conditions are very dissimilar from some historical observation, that observation gets a weight near zero. It's ignored.

In linear models, high dissimilarity implies negative weights. The model uses that observation's opposite, i.e., its reflection through zero. When ridge assigns -0.5 to some observation, it's saying: "use half of the opposite of that outcome."

Concrete example: CPI inflation crashed in 2008Q4 after an oil price collapse. If 2021 conditions look like the opposite of late 2008—rapid acceleration, rising commodities—a linear model can use the negative of that 2008 outcome to predict high inflation. RF cannot. It either finds similar historical spikes (like the 1970s) or drifts toward the unconditional mean. Arguably, the latter is more aligned with how humans actually think about prediction: finding similar situations or nothing.

The Conservative Investor Analogy

Think of the weights as a portfolio.

Random Forest is an investor who never uses leverage and never short-sells. Every position is long. The portfolio return must lie within the range of individual asset returns.

Linear models freely use leverage and short positions. They can bet against observations by assigning negative weights.

The Risk-Return Tradeoff

The simplex constraint makes RF fairly conservative. It doesn't utilize symmetry at all. This drastically reduces the pool of historical examples the algorithm can draw from—a pool which is already not very deep when we think about typical time series regressions in macroeconomics.

But this conservatism has benefits. RF is famously robust (not prone to overfitting, stable, resistant to outliers). The simplex constraint is a strong form of regularization: you can't blow up by taking extreme positions on reflected data that may not behave symmetrically.

Linear models can amplify signals through leverage and short positions. Efficient when symmetry holds. Dangerous when it doesn't—recessions might not look like mirror-image expansions. This is true of machine learning methods like ridge regression, but also of classical econometric workhorses like VARs and other models estimated equation-by-equation.

Interestingly, the attention mechanism in transformers brings back some of RF's qualities. The softmax operation forces attention weights to be non-negative and sum to one—right back on the simplex. So when large language models attend to context, they're implicitly making the same conservative choice as random forests, but in a framework amenable to gradient-based optimization.

The Bottom Line

Random Forest, despite its exotic name, might actually be the less risky modeling strategy.

Linear models use polar opposites to squeeze more out of limited data. When symmetry holds, this is clever—you're recycling knowledge. When it fails, you're amplifying noise.

RF makes no such bet. It only uses what actually happened—but by ruling out reflections, its effective sample is sparser. So linear models risk bias from bad reflections; RF risks variance from a thinner pool.

Which is riskier in practice? Empirical question. From my experience, RF is far more risk averse than standard linear models.

Philippe Goulet Coulombe