Correlation

Correlation

Calculating the correlation between two series of data is a common operation in Statistics. Correlation is to measure if two variables or two feature columns tend to move in together in same or opposite direction. The idea is to detect if one variable or feature column can be predicted by another variable or feature column.

Spark.ml has correlation methods for Pearson’s and Spearman’s correlation.

Pearson correlation for checking correlation between two continuous variables (or feature columns)

Spearman correlation for checking correlation between two ordinal variables

Spearman correlation evaluates the monotonic relationship between two continuous or ordinal variables. In a monotonic relationship, the variables tend to change together, but not necessarily at a constant rate. The Spearman correlation coefficient is based on the ranked values for each variable rather than the raw data.

import org.apache.spark.ml.linalg.{Matrix, Vectors}
import org.apache.spark.ml.stat.Correlation
import org.apache.spark.sql.Row
val data = Seq(
//Vectors.sparse(4, Seq((0, 1.0), (3, -2.0))),
Vectors.dense(1.0, 0.0, 0.0,-2.0),
Vectors.dense(4.0, 5.0, 0.0, 3.0),
Vectors.dense(6.0, 7.0, 0.0, 8.0),
//Vectors.sparse(4, Seq((0, 9.0), (3, 1.0)))
Vectors.dense(9.0, 0.0, 0.0, 1.0)
)
//data.foreach(println)
val df = data.map(Tuple1.apply).toDF("features")
//val Row(coeff1: Matrix) = Correlation.corr(df, "features").head
//df.select("features").show(false)
//df.show(false)
//println(Correlation.corr(df,"features").head)
//println("")

val coeff1 = Correlation.corr(df, "features").head match {
  case Row(coeff1: Matrix) => coeff1
}

//println(coeff1)
println(s"Pearson correlation matrix:n $coeff1\n")
//val Row(coeff2: Matrix) = Correlation.corr(df, "features", "spearman").head
val coeff2 = Correlation.corr(df, "features","spearman").head match {
  case Row(coeff2: Matrix) => coeff2
}
//println(Correlation.corr(df, "features","spearman").head)
//println("")

println(s"Spearman correlation matrix:n $coeff2\n")

/*
Pearson correlation matrix:n 1.0                   0.055641488407465814  NaN  0.4004714203168137  
0.055641488407465814  1.0                   NaN  0.9135958615342522  
NaN                   NaN                   1.0  NaN                 
0.4004714203168137    0.9135958615342522    NaN  1.0                 

Spearman correlation matrix:n 1.0                  0.10540925533894532  NaN  0.40000000000000174  
0.10540925533894532  1.0                  NaN  0.9486832980505141   
NaN                  NaN                  1.0  NaN                  
0.40000000000000174  0.9486832980505141   NaN  1.0    

*/

Hypothesis testing

Hypothesis testing in statistics is to determine whether a result is statistically significant, whether this result occurred by chance or not. spark.ml currently supports Pearson’s Chi-squared tests for independence.

ChiSquareTest conducts Pearson’s independence test for every feature against the label. For each feature, the (feature, label) pairs are converted into a contingency matrix for which the Chi-squared statistic is computed. All label and feature values must be categorical.

The Chi Square statistic is commonly used for testing relationships between categorical variables or feature columns. The null hypothesis of the Chi-Square test is that no relationship exists on the categorical variables in the population; they are independent.

import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.ml.stat.ChiSquareTest
val data = Seq(
(0.0, Vectors.dense(0.5, 10.0)),
(0.0, Vectors.dense(1.5, 20.0)),
(1.0, Vectors.dense(1.5, 30.0)),
(0.0, Vectors.dense(3.5, 30.0)),
(0.0, Vectors.dense(3.5, 40.0)),
(1.0, Vectors.dense(3.5, 40.0))
)
val df = data.toDF("label", "features")
val chi = ChiSquareTest.test(df, "features", "label").head
println(s"pValues = ${chi.getAs[Vector](0)}")
println("degreesOfFreedom = " + chi.getSeq[Int](1).mkString("[", ",", "]"))
println(s"statistics ${chi.getAs[Vector](2)}")

/*
Output:
pValues = [0.6872892787909721,0.6822703303362126]
degreesOfFreedom = [2,3]
statistics [0.75,1.5]


*/

Last updated