hadoop - Why does this example result in NaN? -


i'm looking @ documentation statistics.corr in pyspark: https://spark.apache.org/docs/1.1.0/api/python/pyspark.mllib.stat.statistics-class.html#corr.

why correlation here result in nan?

>>> rdd = sc.parallelize([vectors.dense([1, 0, 0, -2]), vectors.dense([4, 5, 0, 3]), ...                       vectors.dense([6, 7, 0,  8]), vectors.dense([9, 0, 0, 1])]) >>> pearsoncorr = statistics.corr(rdd) >>> print str(pearsoncorr).replace('nan', 'nan') [[ 1.          0.05564149         nan  0.40047142]  [ 0.05564149  1.                 nan  0.91359586]  [        nan         nan  1.                 nan]  [ 0.40047142  0.91359586         nan  1.        ]] 

it pretty simple.pearson correlation coefficient defined follows:

enter image description here

since standard deviation second column ([0, 0, 0, 0]) equal 0, whole equation results in nan.


Comments

Popular posts from this blog

java - nested exception is org.hibernate.exception.SQLGrammarException: could not extract ResultSet Hibernate+SpringMVC -

sql - Postgresql tables exists, but getting "relation does not exist" when querying -

asp.net mvc - breakpoint on javascript in CSHTML? -