hadoop - How to correlate all combination of arrays in an RDD? -

- September 15, 2012

i have rdd model.productfeatures() returns rdd in form of (id, array("d", (...))). example:

(1, array("d", (0, 1, 2))) (2, array("d", (4, 3, 2))) (3, array("d", (5, 3, 0))) ...

i calculate pairwise correlation between each array, return each id id array has highest correlation.

the first thing need pairs of elements, except "diagonal" they're same.

>>> rdd.cartesian(rdd).filter(lambda (x, y): x != y).collect() [((1, array('d', [0.0, 1.0, 2.0])), (2, array('d', [4.0, 3.0, 2.0]))),      ((1, array('d', [0.0, 1.0, 2.0])), (3, array('d', [5.0, 3.0, 0.0]))),   ((2, array('d', [4.0, 3.0, 2.0])), (1, array('d', [0.0, 1.0, 2.0]))),   ((3, array('d', [5.0, 3.0, 0.0])), (1, array('d', [0.0, 1.0, 2.0]))),   ((2, array('d', [4.0, 3.0, 2.0])), (3, array('d', [5.0, 3.0, 0.0]))),   ((3, array('d', [5.0, 3.0, 0.0])), (2, array('d', [4.0, 3.0, 2.0])))]

then function calculate correlation , rearrange prepare last step. let's assume "correlation" mean done numpy.correlate.

def corr_pair(((id1, a1), (id2, a2))):     return id1, (id2, np.correlate(a1, a2)[0])  >>> rdd.cartesian(rdd).filter(lambda (p1, p2): p1 != p2).map(corr_pair).collect() [(1, (2, 7.0)), (1, (3, 3.0)), (2, (1, 7.0)), (3, (1, 3.0)), (2, (3, 29.0)), (3, (2, 29.0))]

to 2nd id maximum correlation each 1st id, can use reducebykey , keep bigger one:

def keep_higher((id1, c1), (id2, c2)):         if c1 > c2:         return id1, c1     else:         return id2, c2  >>> rdd.cartesian(rdd).filter(lambda (x, y): x != y).map(corr_pair).reducebykey(keep_higher).collect() [(1, (2, 7.0)), (2, (3, 29.0)), (3, (2, 29.0))]

Search This Blog

First Image

hadoop - How to correlate all combination of arrays in an RDD? -

Comments

Post a Comment

Popular posts from this blog

php - Passing multiple values in a url using checkbox -

java - nested exception is org.hibernate.exception.SQLGrammarException: could not extract ResultSet Hibernate+SpringMVC -

sql - Postgresql tables exists, but getting "relation does not exist" when querying -