Stochastic Gradient Descent (SGD) — это метод, который подходит как для online обучения, или обучения в режиме реального времени, когда данные поступают постепенно, так и для Big Data. Причина универсальности этого метода состоит в том, что SGD не нужен полный объем данных для решения задач классификации или регрессионного анализа: апдейт модели происходит постепенно, по мере…

import math as mt import numpy as np import collections as cl Допустим, у нас есть список клиентов, интересы которых нам известны. user_items = [ ["Hadoop", "Big Data", "HBase", "Java", "Spark", "Storm", "Cassandra"], ["NoSQL", "MongoDB", "Cassandra", "HBase", "Postgres"], ["Python", "scikit-learn", "scipy", "numpy", "statsmodels", "pandas"], ["R", "Python", "statistics", "regression", "probability"], ["machine learning", "regression", "decision trees", "libsvm"],…

  import numpy as np import matplotlib.pyplot as plt %matplotlib inline   from math import log def logloss(p, y): epsilon = 10e-12 if p == 0: p += epsilon if p == 1: p -= epsilon if y == 1: return -log(p) if y == 0: return -log(1-p) def evaluate_logloss(p,labels): return sum(list(map(lambda x: logloss(p,x), labels)))/len(labels)…

  Простейший способ распечатать несколько переменных:   a = 12 b = 123.45 print(a, b, a * b) 12 123.45 1481.4   Следует отметить, что IPython представляет еще более простую конструкцию печати посредством tuple:   a, b, a * b (12, 123.45, 1481.4)   Если мы пользуемся первым способом, то мы можем указать желаемый разделитель,…

import pandas as pd from functools import reduce Map in bare Python a = [1,2,3,4,5] list(map(lambda x: x**2, a)) [1, 4, 9, 16, 25] list(filter(lambda x: x >= 4, a)) [4, 5] reduce(lambda x,y: x*y, a) 120 b = range(1000) %timeit list(map(lambda x: x**2, b)) 1000 loops, best of 3: 579 µs per loop %timeit…

  c1 = ['Russia', 'US', 'Germany'] c2 = ['007', '001', '049']   a = dict(Russia = ’007′, US = ’001′, Germany = ’049′ ) a {‘Germany’: ’049′, ‘Russia’: ’007′, ‘US’: ’001′}   b = {‘Russia’ : ’007′, ‘US’: ’001′, ‘Germany’: ’049′} b {‘Germany’: ’049′, ‘Russia’: ’007′, ‘US’: ’001′}   c = dict(zip(c1,c2)) c {‘Germany’: ’049′,…

Difference between bytes and strings When working with data inputs in Python — processing text, doing statistical analysis — we are working with strings. In [7]: type(‘café’) Out[7]: str When reading files from disc into Python we decode binary data into strings and when saving text to disc we encode stings to binary. str.encode() method is…

  List comprehensions are an easy way to make a list out of another list. For example:   list1 = [1, 2, 3, 4, 5, 6] print(list1) list2 = [x**2 for x in list1] print(list2) [1, 2, 3, 4, 5, 6][1, 4, 9, 16, 25, 36]   A similar construct exists for dictionaries and is…

This is a fourth post in a series of exercises to predict popularity of a blog post on NYTimes website. Previous posts are located here: Naïve Random Forest Classifier. This post is about fitting plain vanilla Random Forest on readily available features, like sections where post was published, date, time, and bag of words extracted…

This is the 4-th post in the series of predicting popularity of a blog post on NYTimes. The first three are: Naïve Random Forest Classifier Fitting models on low signal-to-noise data Feature selection 1: Univariate In this post I am going to compare in-model feature section in plain vanilla LogisticRegression vs. that plus fitting Random…

© 2014 In R we trust.
Top
Follow us: