In the previous post titled Naïve Random Forest Classifier I considered performance of a Random Forest trained on ~5’000 cases with ~3’500 features.
Presumably, only fraction of the features is true signal and the rest is noise. Fitting to noise is a major sin in Machine Learning; thus, taking preventive measures against doing so is one of the most significant preprocessing tasks while fitting a model.
First, why low signal/noise ratio is dangerous.
The Elements of Statistical Learning have two quite impressive examples:
In the experiment above pure Gaussian noise was generated and out of 5’000 features 100 most correlated to outcome were chosen by looking at the whole data set. Then, cross-validation on 50 subsamples showed statistical significance of 100 chosen features. Just to remind you: that was pure noise. So, what has just happened?
It is not allowed to peep into the whole set of data prior to validating results on held-out subsample. If the whole set of data has been looked upon prior to fitting model, there a chance of finding statistically significant features where only pure noise exists.
In the above picture in-sample cross-validated error against out-of sample test error is showed for RF and GBM models. We see, as signal-to-noise ratio drops — 2/5, 2/25, 2/50, 2/100, 2/150 — the gap between in-sample and out-of-sample for Random Forest expands much faster than that for Gradient Boosting Machine.
Random forest, by default, on every split fits to thus, if e.g. signal is only 10 features, but the total amount of features is 10’000, the probability of fitting to noise on every split is ~90%. This shows in the picture as a widening gap between in-sample and out-of-sample error for RF as signal-to-noise lowers. Note, that this effect is less pronounced for GBM.