StandardScaler

StandardScaler

transforms a dataset of Vector rows, normalizing each feature to have unit standard deviation and/or zero mean. It takes parameters:
withStd: True by default. Scales the data to unit standard deviation.
withMean: False by default. Centers the data with mean before scaling. It will build a dense output, so take care when applying to sparse input.
StandardScaler is an Estimator which can be fit on a dataset to produce a StandardScalerModel; this amounts to computing summary statistics. The model can then transform a Vector column in a dataset to have unit standard deviation and/or zero mean features.
1
import org.apache.spark.ml.feature.StandardScaler
2
val dataFrame = spark.read.format("libsvm").load("file:///opt/spark/data/mllib/sample_libsvm_data.txt")
3
val scaler = new StandardScaler()
4
.setInputCol("features")
5
.setOutputCol("scaledFeatures")
6
.setWithStd(true)
7
.setWithMean(false)
8
// Compute summary statistics by fitting the StandardScaler.
9
val scalerModel = scaler.fit(dataFrame)
10
// Normalize each feature to have unit standard deviation.
11
val scaledData = scalerModel.transform(dataFrame)
12
scaledData.show(3)
13
​
14
/*
15
Output:
16
+-----+--------------------+--------------------+
17
|label| features| scaledFeatures|
18
+-----+--------------------+--------------------+
19
| 0.0|(692,[127,128,129...|(692,[127,128,129...|
20
| 1.0|(692,[158,159,160...|(692,[158,159,160...|
21
| 1.0|(692,[124,125,126...|(692,[124,125,126...|
22
+-----+--------------------+--------------------+
23
*/
Copied!
Last modified 1yr ago
Copy link