Apache Spark RDD: pyspark.SparkContext: sc.parallelize(data, num) sc.textFile(file, num) dir(sc) from pyspark import SparkContext sc = SparkContext() # xrange() is more efficient than range() because it’s generator data = range(1,11) rdd = sc.parallelize(data,2) type(sc) pyspark.context.SparkContext sc.version ’1.4.0′ type(rdd) pyspark.rdd.PipelinedRDD print(‘The id of rdd is {0}’.format(rdd.id())) The id of rdd is 1 rdd.setName(‘My first RDD’) My first…

© 2014 In R we trust.
Follow us: