Correlation

Correlation

Calculating the correlation between two series of data is a common operation in Statistics. Correlation is to measure if two variables or two feature columns tend to move in together in same or opposite direction. The idea is to detect if one variable or feature column can be predicted by another variable or feature column.
Spark.ml has correlation methods for Pearson’s and Spearman’s correlation.
Pearson correlation for checking correlation between two continuous variables (or feature columns)
Spearman correlation for checking correlation between two ordinal variables
Spearman correlation evaluates the monotonic relationship between two continuous or ordinal variables. In a monotonic relationship, the variables tend to change together, but not necessarily at a constant rate. The Spearman correlation coefficient is based on the ranked values for each variable rather than the raw data.
1
import org.apache.spark.ml.linalg.{Matrix, Vectors}
2
import org.apache.spark.ml.stat.Correlation
3
import org.apache.spark.sql.Row
4
val data = Seq(
5
//Vectors.sparse(4, Seq((0, 1.0), (3, -2.0))),
6
Vectors.dense(1.0, 0.0, 0.0,-2.0),
7
Vectors.dense(4.0, 5.0, 0.0, 3.0),
8
Vectors.dense(6.0, 7.0, 0.0, 8.0),
9
//Vectors.sparse(4, Seq((0, 9.0), (3, 1.0)))
10
Vectors.dense(9.0, 0.0, 0.0, 1.0)
11
)
12
//data.foreach(println)
13
val df = data.map(Tuple1.apply).toDF("features")
14
//val Row(coeff1: Matrix) = Correlation.corr(df, "features").head
15
//df.select("features").show(false)
16
//df.show(false)
17
//println(Correlation.corr(df,"features").head)
18
//println("")
19
20
val coeff1 = Correlation.corr(df, "features").head match {
21
case Row(coeff1: Matrix) => coeff1
22
}
23
24
//println(coeff1)
25
println(s"Pearson correlation matrix:n $coeff1\n")
26
//val Row(coeff2: Matrix) = Correlation.corr(df, "features", "spearman").head
27
val coeff2 = Correlation.corr(df, "features","spearman").head match {
28
case Row(coeff2: Matrix) => coeff2
29
}
30
//println(Correlation.corr(df, "features","spearman").head)
31
//println("")
32
33
println(s"Spearman correlation matrix:n $coeff2\n")
34
35
/*
36
Pearson correlation matrix:n 1.0 0.055641488407465814 NaN 0.4004714203168137
37
0.055641488407465814 1.0 NaN 0.9135958615342522
38
NaN NaN 1.0 NaN
39
0.4004714203168137 0.9135958615342522 NaN 1.0
40
41
Spearman correlation matrix:n 1.0 0.10540925533894532 NaN 0.40000000000000174
42
0.10540925533894532 1.0 NaN 0.9486832980505141
43
NaN NaN 1.0 NaN
44
0.40000000000000174 0.9486832980505141 NaN 1.0
45
46
*/
Copied!

Hypothesis testing

Hypothesis testing in statistics is to determine whether a result is statistically significant, whether this result occurred by chance or not. spark.ml currently supports Pearson’s Chi-squared tests for independence.
ChiSquareTest conducts Pearson’s independence test for every feature against the label. For each feature, the (feature, label) pairs are converted into a contingency matrix for which the Chi-squared statistic is computed. All label and feature values must be categorical.
The Chi Square statistic is commonly used for testing relationships between categorical variables or feature columns. The null hypothesis of the Chi-Square test is that no relationship exists on the categorical variables in the population; they are independent.
1
import org.apache.spark.ml.linalg.{Vector, Vectors}
2
import org.apache.spark.ml.stat.ChiSquareTest
3
val data = Seq(
4
(0.0, Vectors.dense(0.5, 10.0)),
5
(0.0, Vectors.dense(1.5, 20.0)),
6
(1.0, Vectors.dense(1.5, 30.0)),
7
(0.0, Vectors.dense(3.5, 30.0)),
8
(0.0, Vectors.dense(3.5, 40.0)),
9
(1.0, Vectors.dense(3.5, 40.0))
10
)
11
val df = data.toDF("label", "features")
12
val chi = ChiSquareTest.test(df, "features", "label").head
13
println(s"pValues = ${chi.getAs[Vector](0)}")
14
println("degreesOfFreedom = " + chi.getSeq[Int](1).mkString("[", ",", "]"))
15
println(s"statistics ${chi.getAs[Vector](2)}")
16
17
/*
18
Output:
19
pValues = [0.6872892787909721,0.6822703303362126]
20
degreesOfFreedom = [2,3]
21
statistics [0.75,1.5]
22
23
24
*/
Copied!
Last modified 1yr ago