Spark Joins

SparkSQL and Spark DataFrames join():

  • inner
  • outer
  • left outer
  • right outer
  • semijoin

Spark PairRDD:

  • x.join(y):

    returns key-value pairs [(k,(v₁,v₂).....]where:

    • k is a common key between x and y
    • (v₁,v₂) are values in x and y
  • leftOuterJoin()
  • rightOuterJoin()
  • fullOuterJoin()
x = sc.parallelize([('a',2), ('b',3)])
y = sc.parallelize([('a',3), ('a',2), ('a',5)])

x.join(y).collect()
[('a', (2, 3)), ('a', (2, 2)), ('a', (2, 5))]
x.leftOuterJoin(y).collect()
[('a', (2, 3)), ('a', (2, 2)), ('a', (2, 5)), ('b', (3, None))]
x.rightOuterJoin(y).collect()
[('a', (2, 3)), ('a', (2, 2)), ('a', (2, 5))]
x.fullOuterJoin(y).collect()
[('a', (2, 3)), ('a', (2, 2)), ('a', (2, 5)), ('b', (3, None))]

Union of 2 RDD’s

x.union(y).collect()
[('a', 2), ('b', 3), ('a', 3), ('a', 2), ('a', 5)]

Cartesian product of 2 RDD’s

a = x.cartesian(y).collect()
a
[(('a', 2), ('a', 3)),
 (('a', 2), ('a', 2)),
 (('a', 2), ('a', 5)),
 (('b', 3), ('a', 3)),
 (('b', 3), ('a', 2)),
 (('b', 3), ('a', 5))]
a[0]
(('a', 2), ('a', 3))
Write a comment:

*

Your email address will not be published.

© 2014 In R we trust.
Top
Follow us: