rdd - Returning keys from key-value pairs based on maximum value in apache spark -


i'm new apache spark , need advice. have rdd [string, int] type. rdd values that:

  • ("a,x",3)
  • ("a,y",4)
  • ("a,z",1)
  • ("b,y",2)
  • ("c,w",5)
  • ("c,y",2)
  • ("e,x",1)
  • ("e,z",3)

what want accomplish rdd (string,string):

  • ("a","y") //among key's contains a, (a,y) has max value
  • ("b","y") //among key's contains b, (b,y) has max value
  • ("c","w") //among key's contains c, (c,w) has max value
  • ("e","z") //among key's contains e, (e,z) has max value

i tried loop concept (by using counter) in flatmap doesn't work. there easy way this?

just reshape , reducebykey:

val pattern = "^(.*?),(.*?)$".r  rdd   // split key parts   .flatmap{ case (pattern(x, y), z) => some((x, (y, z))) }   // reduce first part of key   .reducebykey( (a, b) => if (a._2 > b._2) else b )   // go original shape   .map { case (x, (y, z)) => (s"$x,$y", z) } 

Comments

Popular posts from this blog

java - nested exception is org.hibernate.exception.SQLGrammarException: could not extract ResultSet Hibernate+SpringMVC -

sql - Postgresql tables exists, but getting "relation does not exist" when querying -

asp.net mvc - breakpoint on javascript in CSHTML? -