rdd - Returning keys from key-value pairs based on maximum value in apache spark -
i'm new apache spark , need advice. have rdd [string, int] type. rdd values that:
- ("a,x",3)
- ("a,y",4)
- ("a,z",1)
- ("b,y",2)
- ("c,w",5)
- ("c,y",2)
- ("e,x",1)
- ("e,z",3)
what want accomplish rdd (string,string):
- ("a","y") //among key's contains a, (a,y) has max value
- ("b","y") //among key's contains b, (b,y) has max value
- ("c","w") //among key's contains c, (c,w) has max value
- ("e","z") //among key's contains e, (e,z) has max value
i tried loop concept (by using counter) in flatmap doesn't work. there easy way this?
just reshape , reducebykey
:
val pattern = "^(.*?),(.*?)$".r rdd // split key parts .flatmap{ case (pattern(x, y), z) => some((x, (y, z))) } // reduce first part of key .reducebykey( (a, b) => if (a._2 > b._2) else b ) // go original shape .map { case (x, (y, z)) => (s"$x,$y", z) }
Comments
Post a Comment