scala - How to select values from one field rdd only if it is present in second field of rdd -


i have rdd 3 fields mentioned below.

1,2,6 2,4,6 1,4,9 3,4,7 2,3,8 

now, above rdd, want following rdd.

2,4,6 3,4,7 2,3,8 

the resultant rdd not have rows starting 1, because 1 in second field in input rdd.

ok, if understood correctly want do, there 2 ways:

  1. split rdd two, first rdd contains unique values of "second field" , second rdd has "first value" key. join rdds together. drawback of approach distinct , join slow operations.

    val r: rdd[(string, string, int)] = sc.parallelize(seq(   ("1", "2", 6),   ("2", "4", 6),   ("1", "4", 9),   ("3", "4", 7),   ("2", "3", 8) ))  val uniquevalues: rdd[(string, unit)] = r.map(x => x._2 -> ()).distinct val r1: rdd[(string, (string, string, int))] = r.map(x => x._1 -> x)  val result: rdd[(string, string, int)] = r1.join(uniquevalues).map {case (_, (x, _)) => x}  result.collect.foreach(println) 
  2. if rdd relatively small , set of second values can fit in memory in nodes, can create in-memory set first step, broadcast nodes , filter rdd:

    val r: rdd[(string, string, int)] = sc.parallelize(seq(   ("1", "2", 6),   ("2", "4", 6),   ("1", "4", 9),   ("3", "4", 7),   ("2", "3", 8) ))  val uniquevalues = sc.broadcast(r.map(x => x._2).distinct.collect.toset)  val result: rdd[(string, string, int)] = r.filter(x => uniquevalues.value.contains(x._1))  result.collect.foreach(println) 

both examples output:

(2,4,6) (2,3,8) (3,4,7) 

Comments

Popular posts from this blog

java - nested exception is org.hibernate.exception.SQLGrammarException: could not extract ResultSet Hibernate+SpringMVC -

sql - Postgresql tables exists, but getting "relation does not exist" when querying -

asp.net mvc - breakpoint on javascript in CSHTML? -