2016年9月13日星期二

Spark join的用法

用一个例子说明一下:
val a = sc.parallelize(List((1, "a"), (3, "c"), (5, "e"), (6, "f")))
val b = sc.parallelize(List((1, "a"), (2, "b"), (3, "c"), (4, "d")))

a.join(b).collect
    Array[(Int, (String, String))] = Array((1,(a,a)), (3,(c,c)))

a.leftOuterJoin(b).collect
    Array[(Int, (String, Option[String]))] = Array((1,(a,Some(a))), (6,(f,None)), (3,(c,Some(c))), (5,(e,None)))

a.rightOuterJoin(b).collect
    Array[(Int, (Option[String], String))] = Array((4,(None,d)), (1,(Some(a),a)), (3,(Some(c),c)), (2,(None,b)))

a.fullOuterJoin(b).collect
    Array[(Int, (Option[String], Option[String]))] = Array((4,(None,Some(d))), (1,(Some(a),Some(a))), (6,(Some(f),None)), (3,(Some(c),Some(c))), (5,(Some(e),None)), (2,(None,Some(b))))

a.subtractByKey(b).collect
    Array[(Int, String)] = Array((5,e), (6,f))

如果a或者b中有重复的key,join的结果会有多条。

没有评论:

发表评论