Cheap Trick Inference for ML Performance Evaluation and Variable Importance Calculations
It is natural to want to know if the difference between the performance of two predictions for a certain test set is not merely due to sampling variability. Another application of this thought is, when one computes Variable Importance (VI) measures for Random Forest or other algorithms, is the contribution of a given variable statistically significant? VI is usually done by permuting randomly the rows of the variable of interest and comparing performances out-of-bag. Of course, you have probably recognize that is just the nonparametric equivalent of a t-test in a model where computing degrees of freedom and conducting theoretical derivations is daunting.
Anyhow, computing such confidence intervals is often done by bootstrapping, which implies even more computations. Alternatively, one can conduct a t-test on the loss differentials of two models. That is, if interested in MSEs: test the null that the vector of d_i=err^2{A,i}-err^2{B,i} has mean zero. For VI of variable k: model A is the full model and model B uses the shuffled predictor X_k. Of course, this is just a cross-sectional simplification of the well-known Diebold-Mariano test for time series. That also means any loss function is accommodated. Also, since these errors are obtained “out-of-bag” or simply out-of-sample, there is no need to do DoFs accounting.