This exercise is about predicting ‘popularity’ of post on NYTimes website.
A ‘popular’ defined as a post with 20 or more comments.
The data consists of 8402 entries altogether for training and testings sets.
The fields are:
- NewsDesk = the New York Times desk that produced the story (Business, Culture, Foreign, etc.)
- SectionName = the section the article appeared in (Opinion, Arts, Technology, etc.)
- SubsectionName = the subsection the article appeared in (Education, Small Business, Room for Debate, etc.)
- Headline = the title of the article
- Snippet = a small portion of the article text
- Abstract = a summary of the blog article, written by the New York Times
- WordCount = the number of words in the article
- PubDate = the publication date
- UniqueID = a unique identifier for each article
The plan is to use Pyhon’s sklearn package to build several classifiers (Random Forest, Gradient Boosting Classifier, Logistic Regression) while exploring effectiveness of different approaches to feature selection and model regularization.
The first attempt, described in this post, will build naïve Random Forest classifier built upon the following features:
- NewsDesk, SectionName, SubsectionName
- Bag of uni- and bi-grams extracted from Headline and Abstract
- Part of day (morning, afternoon, evening etc) and day of week extracted from PubDate.
The approach is ‘naïve’ because no feature filtering whatsoever is employed: just about 3’500 features are thrown at Random Forest without tuning hyperparameters, the model is trained on ~ 5200 cases, and then checked on 1’307 validation held-out set.
The performance of this model will be used later as a benchmark when more sophisticated models are developed.
The main purpose of the project is to:
- find model with best cross-validated AUC score on train set
- that generalizes well to held-out set
- with no humane intervention in the process of feature creation, i.e. pure unsupervised feature creation, feature selection, and model selection.
import numpy as np import pandas as pd
seed = 1
train = pd.read_csv('/home/sergey/R_Analyttics_Edge_MITx/kaggle/NYTimesBlogTrain.csv') test = pd.read_csv('/home/sergey/R_Analyttics_Edge_MITx/kaggle/NYTimesBlogTest.csv')
combo = pd.merge(train, test, how='outer') combo = combo.drop('Popular', axis=1) y = train['Popular']
print(train.shape) print(y.shape) print(test.shape) print(combo.shape)
(6532, 10) (6532,) (1870, 9) (8402, 9)
|0||Business||Crosswords/Games||NaN||More School Daze||A puzzle from Ethan Cooper that reminds me tha…||A puzzle from Ethan Cooper that reminds me tha…||508||2014-09-01 22:00:09||1|
|1||Culture||Arts||NaN||New 96-Page Murakami Work Coming in December||The Strange Library will arrive just three and…||The Strange Library will arrive just three and…||285||2014-09-01 21:14:07||2|
|2||Business||Business Day||Dealbook||Public Pension Funds Stay Mum on Corporate Expats||Public pension funds have major stakes in Amer…||Public pension funds have major stakes in Amer…||1211||2014-09-01 21:05:36||3|
combo_clean = combo.fillna('empty')
WordCount needs to be log-transformed because the distribution of counts is heavily right-skewed.
combo_clean['LogWC'] = np.log(.1+ combo_clean['WordCount'])
% matplotlib inline import matplotlib.pyplot as plt plt.subplot(1,2,1) combo_clean['WordCount'].plot(kind='kde', title='Wordcount', xlim=[-10,3000]) plt.subplot(1,2,2) combo_clean['LogWC'].plot(kind='kde', title='Log(.1 + WordCount)') plt.show()
import datetime as dt combo_clean['date'] = combo_clean['PubDate'].apply( lambda x: dt.datetime.strptime(x, '%Y-%m-%d %H:%M:%S').strftime('%A') ) combo_clean['hour'] = combo_clean['PubDate'].apply( lambda x: int(dt.datetime.strptime(x, '%Y-%m-%d %H:%M:%S').strftime('%H')) ) combo_clean[['PubDate','date','hour']][1:5]
bins = np.linspace(0,24,5) labels = ['night', 'morning', 'afternoon', 'evening'] combo_clean['hour_bins'] = pd.cut(combo_clean['hour'], bins=bins, labels=labels)
combo_clean.drop(['hour', 'WordCount', 'PubDate'], axis=1, inplace=True)
|0||Business||Crosswords/Games||empty||More School Daze||A puzzle from Ethan Cooper that reminds me tha…||A puzzle from Ethan Cooper that reminds me tha…||1||6.230678||Monday||evening|
|1||Culture||Arts||empty||New 96-Page Murakami Work Coming in December||The Strange Library will arrive just three and…||The Strange Library will arrive just three and…||2||5.652840||Monday||evening|
|2||Business||Business Day||Dealbook||Public Pension Funds Stay Mum on Corporate Expats||Public pension funds have major stakes in Amer…||Public pension funds have major stakes in Amer…||3||7.099284||Monday||evening|
categorical_dummies = pd.get_dummies(combo_clean[['NewsDesk','SectionName','SubsectionName','date', 'hour_bins']]) categorical_dummies.shape
import sklearn.feature_extraction.text as txt
tfidf = txt.TfidfVectorizer(stop_words='english', min_df=10, ngram_range=(1, 2),) tfidf
TfidfVectorizer(analyzer='word', binary=False, decode_error='strict', dtype=
, encoding='utf-8', input='content', lowercase=True, max_df=1.0, max_features=None, min_df=10, ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=True, stop_words='english', strip_accents=None, sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True, vocabulary=None)
bwHeadline = (tfidf.fit_transform(combo_clean['Headline'])).todense() bwHeadline_names = np.asarray(tfidf.get_feature_names()) bwHeadline_names = np.core.defchararray.add('H_', bwHeadline_names) print(bwHeadline.shape) print(bwHeadline_names.shape)
(8402, 943) (943,)
bwAbstract = (tfidf.fit_transform(combo_clean['Abstract'])).todense() bwAbstract_names = np.asarray(tfidf.get_feature_names()) bwAbstract_names = np.core.defchararray.add('A_', bwAbstract_names) print(bwAbstract.shape) print(bwAbstract_names.shape)
(8402, 2480) (2480,)
Put everything togeher
data = np.concatenate((categorical_dummies, bwHeadline, bwAbstract), axis=1) data.shape
feature_names = np.concatenate((categorical_dummies.columns, bwHeadline_names, bwAbstract_names), axis=0) feature_names.shape
Make 5 sets:
- data_test : this is where predictions are made and submitted
- data_train: this is where the hyperparameters are tuned via CV (80% ofthe original train set)
- y_train : labels for data_train
- data_val : this is data unseen for trained model to check generalization ability (20%)
- y_val : labels for data_val
data_train = data[: train.shape,:] data_test = data[train.shape:,:]
import sklearn.cross_validation as cv
data_train, data_val, y_train, y_val = cv.train_test_split(data_train, y, test_size=.2, random_state=seed)
print(data_train.shape) print(data_val.shape) print(y_val.shape)
(5225, 3472) (1307, 3472) (1307,)
The idea is to grow many trees and average feature importance over those trees. To further guard against overfitting, let’s use sklearn.ExtraTreesClassifier, which is a modified version of Random Forest. ExtraTreesClassifier, in addition to randomly selecting features at each split, randomizes samples over which it trains model.
import sklearn.ensemble as ens rf = ens.ExtraTreesClassifier(n_estimators=3000, random_state=seed, n_jobs=-1, bootstrap=True, class_weight='auto')
ExtraTreesClassifier(bootstrap=True, class_weight='auto', criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=3000, n_jobs=-1, oob_score=False, random_state=1, verbose=0, warm_start=False)
importance = np.mean([tree.feature_importances_ for tree in rf.estimators_],axis=0) std = np.std([tree.feature_importances_ for tree in rf.estimators_],axis=0) ind = np.argsort(importance)[-30:]
plt.figure(figsize=(15,15)) plt.title("Feature Importances for 'Randomized' Random Forest (10'000 trees)") plt.barh(range(30),importance[ind], color="r", xerr=std[ind], alpha=0.4, align="center") plt.yticks(range(30), feature_names[ind]) plt.ylim([-1, len(ind)]) plt.show()
The conclusions so far are not that reassuring: feature importance averaged over 3’000 trees grown tells that there are few strong, consistent predictors. From this graph, the only significant predictor is time when the post is published, which makes sense.
Learning curves allow to distinguish between models dominated by strong bias or strong variance, thus laying down a roadmap whether the model should be relaxed (made more flexible) or, to the contrary, regularized (made more rigid).
import sklearn.learning_curve as lc
train_size, train_scores, test_scores = lc.learning_curve(rf, data_train, y_train, train_sizes=np.linspace(0.1,1.0,5), cv=10, scoring='roc_auc', exploit_incremental_learning=False, n_jobs=-1)
array([ 470, 1528, 2586, 3644, 4702])
mean_train = train_scores.mean(axis=1) sd_train = train_scores.std(axis=1) upper_train = mean_train + sd_train lower_train = mean_train - sd_train
mean_test = test_scores.mean(axis=1) sd_test = test_scores.std(axis=1) upper_test = mean_test + sd_test lower_test = mean_test - sd_test
plt.plot(train_size, mean_train,'ro-', label='Training') plt.fill_between(train_size, upper_train, lower_train, alpha=0.1, color='r') plt.plot(train_size, mean_test,'bo-', label='Cross-validation') plt.fill_between(train_size, upper_test, lower_test, alpha=0.1, color='b') plt.title("Learning curve on RF with 3'000 n_trees") plt.grid() plt.xlabel('sample size') plt.ylabel('accuracy') plt.legend(loc='lower right', numpoints= 1) plt.show()
The behavior shown in the graph:
- perfect ‘memorizing’ train set
- quickly learning how to predict a held-out validation set
is characteristic for high-variance model, which is not surprising for a model with ~5’000 cases trained on ~3’000 features. Most probably such a model will fit to noise and generalization may be sub par.
Thus, the plan for model improvement includes:
- reducing # of features either through filtering for most important variables or penalizing flexibility (l1 or l2 penalties).
- further feature engineering
- trying other models, e.g. Logistic Regression or Gradient Boosted Trees, which are less sensitive to noise
import sklearn.metrics as mts
print('AUC on held-out sample is %0.5f' % mts.roc_auc_score(y_val, rf.predict_proba(data_val)[:,1]))
AUC on held-out sample is 0.92616
The resulting AUC of .92 is very good, provided it will generalize well. Let’s see what model improvements are possible.