The number of trees in Random Forest is not a tuning parameter.

Here is a rather wonkish rant about something that often comes up when discussing with machine learning folks – or econ ones immersing themselves into ML. In one sentence: the number of trees (B) in Random Forest has never been and will never be a tuning parameter. If I hear again about tuning B (and derivatives from it), I might lose it. Believe me or not, this misconception is not specific to tree ensembles dilettantes. 

Obviously, the confusing emerges from RF being very close in appearance to Boosted Trees (BT), where the number of trees is a tuning parameter.  Indeed, in the case of BT, tuning the number of trees is necessary because, by nature of additive modeling, adding stuff increase model complexity, and too much/too little of it will harm hold-out sample performance. 

The number of trees plays a different role in RF. It is the number of randomly estimated classifiers/predictors we average, not add up recursively (BT). Each model is a partially random function of X, and the world has yet to see numerous averaged draws from the same distribution being less reliable than a few. Accordingly, nobody would ensemble many neural networks to average out initialization noise and prefers 20 runs over 100. Indeed, we simply want as many draws as necessary so to stabilize our conditional mean, so it is a fixed function given X – i.e., that we won’t get a different one for a different seed. Therefore, if there is a trade-off, it does not involve bias and variance, but computational burden. Obviously, that’s what one can read, in some way, from the Elements of Statistical Learning (ESL). 

I can already hear: yes, but I’ve once seen RF with 200 trees do worse than RF with 500 trees? Please try again on a large enough test sample. Or test the significance of the difference between MSEs via some variation of a Diebold-Mariano test.  Estimation uncertainty also pertains to test set evaluation results.

And you would think it all of this is obvious. Well, far from it, as it appears. Take the very well-known paper by Gu, Kelly, and Xiu (2020) in The Review of Financial Studies:

Hyperparameters include, for example, the penalization parameters in lasso and elastic net, the number of iterated trees in boosting, the number of random trees in a forest, and the depth of the trees.

Now, Gu et al. (2020) are certainly not alone —even yours truly in a very early draft of a paper with colleagues at UQAM, used to tune B. Obviously, this goes was way back to when I was learning ML. Nonetheless, this shows how pervasive this idea is, especially among those recently introduced to tree ensembles. So the exposition itself must be the culprit.

Unfortunately, things don’t end here. In the deep learning realm, take the influential Belkin et al. (2019) paper on Double Descent in deep neural nets in the Proceedings of the National Academy of Sciences. They manufacture an empirical demonstration of the phenomenon by implicitly treating the number of trees as a tuning parameter increasing complexity in RF. Don’t get me wrong, double descent is cool and indeed a surprising feature of neural networks, but not of random forests. In my own paper on the surprising origins of the non-overfitting properties of RF, I dedicate a section on dissecting Belkin et al. (2019) artificial example of double descent in RF. 

The misconception they leverage is, again, that B could be a tuning parameter, i.e., something that drives up and down model complexity. It is not. So how did this happen? It is true that increasing the number of trees can give the impression of increased model capacity/complexity as indeed, when increasing the number of trees from 10 to, say, 50 will indeed decrease training error. This is specifically what Belkin et al. (2019) exploit. However, this is an illusion because RF with very few trees is… a random model. Let’s think about the simple analogous example. You have 1000 regressors and 300 observations. You “bag” 5 linear regression models where you include 100 randomly selected regressors. This is an example to which a lot of people (like ESL) have referred to understand some of RF properties. If you bag 200 regressions rather than 5, indeed, training fit is very likely to increase. But not because complexity went up. Rather, it is simply because the B=5 model is… a random model. Any other B=5 model is likely to look very different, and, indeed, there may be one or two of them that, by sheer luck, will perform better than B=200.  In other words, the uncertainty over “variable inclusion” has been correctly integrated out at B=200 but is very likely not to be at B=5. Of course, seeds should not be model inputs, so we should discard B=5. Taking the Bayesian view on Bagging, this means that RF with B=5 is equivalent to consider 5 draws from the posterior mean. Not great. So the fact that B=200 fits both the training and test data better should not come as a surprise. 

Anyhow, this could go on and on, so let’s wrap up: it is about time the misconception stops and we start providing our students with an understanding of RF that makes statistical sense.

——
P.S. : What should you tune then? Mtry, the number of randomly eligible features at each split, can be important to balance bias and variance (although variance is always reasonably contained with RF). Then, if you have a small data set (observations-wise) and a low signal-to-noise ratio, maybe depth or minimal.node.size. They’re both one and the same, and guide the complexity of tree base learners. In very hostile learning environments, making underlying trees more “humble” can seldom help. Finally, in environments where observations are scarce, one might want to build the ensemble using subsampling rather than sampling with replacement and tune subsampling.rate. The latter can go reasonably go from 0.5 to 1. subsampling.rate=1 obviously means no bagging. While that may sound crazy, it’s not. In high-dimensional data setups like macro forecasting, where regressors are abundant and observations not quite, a lot of randomness in tree-building can be obtained from the use of Mtry only. Hence, setting subsampling.rate=1 allows for slightly more complex trees while maintaining an important level of diversification among the tree portfolio.

Now with all that being said, RF with default tuning parameters very often does an impressive job, at least, in all of my experience and research projects (which is mostly regression and a bit of classification).

Previous
Previous

ML-based Time Series Modelling with MARX

Next
Next

Careful With That Axe: ML, Recursive Forecasting and IRFs