This is the third post in an exercise to predict post popularity on a NYTimes website. Two previous posts described fitting a simple Random Forest model and theory behind the need for feature selection (or feature reduction): Naïve Random Forest Classifier Fitting models on low signal-to-noise data Though theoretically the need for feature selection is…

In the previous post titled Naïve Random Forest Classifier I considered performance of a Random Forest trained on ~5’000 cases with ~3’500 features. Presumably, only fraction of the features is true signal and the rest is noise. Fitting to noise is a major sin in Machine Learning; thus, taking preventive measures against doing so is…

This exercise is about predicting ‘popularity’ of post on NYTimes website. A ‘popular’ defined as a post with 20 or more comments. The data consists of 8402 entries altogether for training and testings sets. The fields are: NewsDesk = the New York Times desk that produced the story (Business, Culture, Foreign, etc.) SectionName = the…

© 2014 In R we trust.
Top
Follow us: