hadoop - Why does this example result in NaN? -
i'm looking @ documentation statistics.corr
in pyspark: https://spark.apache.org/docs/1.1.0/api/python/pyspark.mllib.stat.statistics-class.html#corr.
why correlation here result in nan
?
>>> rdd = sc.parallelize([vectors.dense([1, 0, 0, -2]), vectors.dense([4, 5, 0, 3]), ... vectors.dense([6, 7, 0, 8]), vectors.dense([9, 0, 0, 1])]) >>> pearsoncorr = statistics.corr(rdd) >>> print str(pearsoncorr).replace('nan', 'nan') [[ 1. 0.05564149 nan 0.40047142] [ 0.05564149 1. nan 0.91359586] [ nan nan 1. nan] [ 0.40047142 0.91359586 nan 1. ]]
it pretty simple.pearson correlation coefficient defined follows:
since standard deviation second column ([0, 0, 0, 0]
) equal 0, whole equation results in nan.
Comments
Post a Comment