PCA
PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. A PCA class trains a model to project vectors to a low-dimensional space using PCA. The example below shows how to project 5-dimensional feature vectors into 3-dimensional principal components.
1
import org.apache.spark.ml.feature.PCA
2
import org.apache.spark.ml.linalg.Vectors
3
val data = Array(
4
Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
5
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
6
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
7
)
8
val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
9
df.show(false)
10
​
11
val pca = new PCA()
12
.setInputCol("features")
13
.setOutputCol("pcaFeatures")
14
.setK(3)
15
.fit(df)
16
val result = pca.transform(df).select("pcaFeatures")
17
result.show(false)
18
​
19
/*
20
​
21
Output:
22
+---------------------+
23
|features |
24
+---------------------+
25
|(5,[1,3],[1.0,7.0]) |
26
|[2.0,0.0,3.0,4.0,5.0]|
27
|[4.0,0.0,0.0,6.0,7.0]|
28
+---------------------+
29
​
30
+-----------------------------------------------------------+
31
|pcaFeatures |
32
+-----------------------------------------------------------+
33
|[1.6485728230883807,-4.013282700516296,-5.524543751369388] |
34
|[-4.645104331781534,-1.1167972663619026,-5.524543751369387]|
35
|[-6.428880535676489,-5.337951427775355,-5.524543751369389] |
36
+-----------------------------------------------------------+
37
​
38
*/
Copied!
​
Last modified 1yr ago
Copy link