Prepare data import time import numpy as np import matplotlib.pyplot as plt from collections import OrderedDict from pyspark import SparkContext from pyspark.mllib.classification import LogisticRegressionWithSGD from pyspark.mllib.tree import RandomForest from pyspark.mllib.classification import LabeledPoint %matplotlib inline sc = SparkContext() clean.csv is original file with nan filled with mean of the column. features = sc.textFile(‘/home/sergey/MachineLearning/biline/clean.csv’) features = features.map(lambda…

Spark Joins SparkSQL and Spark DataFrames join(): inner outer left outer right outer semijoin Spark PairRDD: x.join(y): returns key-value pairs [(k,(v₁,v₂).....]where: k is a common key between x and y (v₁,v₂) are values in x and y leftOuterJoin() rightOuterJoin() fullOuterJoin() x = sc.parallelize([('a',2), ('b',3)]) y = sc.parallelize([('a',3), ('a',2), ('a',5)]) x.join(y).collect() [(‘a’, (2, 3)), (‘a’, (2,…

Apache Spark RDD: pyspark.SparkContext: sc.parallelize(data, num) sc.textFile(file, num) dir(sc) from pyspark import SparkContext sc = SparkContext() # xrange() is more efficient than range() because it’s generator data = range(1,11) rdd = sc.parallelize(data,2) type(sc) pyspark.context.SparkContext sc.version ’1.4.0′ type(rdd) pyspark.rdd.PipelinedRDD print(‘The id of rdd is {0}’.format(rdd.id())) The id of rdd is 1 rdd.setName(‘My first RDD’) My first…

© 2014 In R we trust.
Top
Follow us: