PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. A PCA class trains a model to project vectors to a low-dimensional space using PCA. The example below shows how to project 5-dimensional feature vectors into 3-dimensional principal components.

import org.apache.spark.ml.feature.PCA
import org.apache.spark.ml.linalg.Vectors
val data = Array(
Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")

val pca = new PCA()
val result = pca.transform(df).select("pcaFeatures")


|features             |
|(5,[1,3],[1.0,7.0])  |

|pcaFeatures                                                |
|[1.6485728230883807,-4.013282700516296,-5.524543751369388] |
|[-6.428880535676489,-5.337951427775355,-5.524543751369389] |


Last updated

Was this helpful?