In the previous post titled Naïve Random Forest Classifier I considered performance of a Random Forest trained on ~5’000 cases with ~3’500 features.

Presumably, only fraction of the features is true signal and the rest is noise. Fitting to noise is a major sin in Machine Learning; thus, taking preventive measures against doing so is one of the most significant preprocessing tasks while fitting a model.

First, why low signal/noise ratio is dangerous.

The Elements of Statistical Learning have two quite impressive examples:

  1. Discovering ‘signal’ in pure noise.wrong-and-right-cv

In the experiment above pure Gaussian noise was generated and out of 5’000 features 100 most correlated to outcome were chosen by looking at the whole data set. Then, cross-validation on 50 subsamples showed statistical significance of 100 chosen features. Just to remind you: that was pure noise. So, what has just happened?

The reason:

It is not allowed to peep into the whole set of data prior to validating results on held-out subsample. If the whole set of data has been looked upon prior to fitting model, there a chance of finding statistically significant features where only pure noise exists.

  1. Low signal-to-noise rationoise-vs-signal-in-RF-vs-GBM

In the above picture in-sample cross-validated error against out-of sample test error is showed for RF and GBM models. We see, as signal-to-noise ratio drops — 2/5, 2/25, 2/50, 2/100, 2/150 — the gap between in-sample and out-of-sample for Random Forest expands much faster than that for Gradient Boosting Machine.

The reason:

Random forest, by default, on every split fits to N = \sqrt{\text{number of features}} thus, if e.g. signal is only 10 features, but the total amount of features is 10’000, the probability of fitting to noise on every split is ~90%. This shows in the picture as a widening gap between in-sample and out-of-sample error for RF as signal-to-noise lowers. Note, that this effect is less pronounced for GBM.

Lessons to learn:

  • Increase signal-to-noise ratio by pre-filtering features for importance or by stability solutions (regression and classification)
  • Use models more robust against noise, such as Logistic regression or Gradient Boosting Machine.
Write a comment:


Your email address will not be published.

© 2014 In R we trust.
Follow us: