scala - How to select values from one field rdd only if it is present in second field of rdd -
i have rdd 3 fields mentioned below.
1,2,6 2,4,6 1,4,9 3,4,7 2,3,8
now, above rdd, want following rdd.
2,4,6 3,4,7 2,3,8
the resultant rdd not have rows starting 1, because 1 in second field in input rdd.
ok, if understood correctly want do, there 2 ways:
split
rdd
two, first rdd contains unique values of "second field" , second rdd has "first value" key. join rdds together. drawback of approachdistinct
,join
slow operations.val r: rdd[(string, string, int)] = sc.parallelize(seq( ("1", "2", 6), ("2", "4", 6), ("1", "4", 9), ("3", "4", 7), ("2", "3", 8) )) val uniquevalues: rdd[(string, unit)] = r.map(x => x._2 -> ()).distinct val r1: rdd[(string, (string, string, int))] = r.map(x => x._1 -> x) val result: rdd[(string, string, int)] = r1.join(uniquevalues).map {case (_, (x, _)) => x} result.collect.foreach(println)
if rdd relatively small ,
set
of second values can fit in memory in nodes, can create in-memory set first step, broadcast nodes , filter rdd:val r: rdd[(string, string, int)] = sc.parallelize(seq( ("1", "2", 6), ("2", "4", 6), ("1", "4", 9), ("3", "4", 7), ("2", "3", 8) )) val uniquevalues = sc.broadcast(r.map(x => x._2).distinct.collect.toset) val result: rdd[(string, string, int)] = r.filter(x => uniquevalues.value.contains(x._1)) result.collect.foreach(println)
both examples output:
(2,4,6) (2,3,8) (3,4,7)
Comments
Post a Comment