Binarizer
Binarization is the process of thresholding numerical features to binary (0/1) features.
Binarizer takes the common parameters inputCol and outputCol, as well as the threshold for binarization. Feature values greater than the threshold are binarized to 1.0; values equal to or less than the threshold are binarized to 0.0. Both Vector and Double types are supported for inputCol.
1
import org.apache.spark.ml.feature.Binarizer
2
​
3
val data = Array((0, 0.1), (1, 0.8), (2, 0.2))
4
val dataFrame = spark.createDataFrame(data).toDF("id", "feature")
5
​
6
val binarizer: Binarizer = new Binarizer()
7
.setInputCol("feature")
8
.setOutputCol("binarized_feature")
9
.setThreshold(0.5)
10
​
11
val binarizedDataFrame = binarizer.transform(dataFrame)
12
​
13
println(s"Binarizer output with Threshold = ${binarizer.getThreshold}")
14
binarizedDataFrame.show()
15
​
16
/*
17
Binarizer output with Threshold = 0.5
18
+---+-------+-----------------+
19
| id|feature|binarized_feature|
20
+---+-------+-----------------+
21
| 0| 0.1| 0.0|
22
| 1| 0.8| 1.0|
23
| 2| 0.2| 0.0|
24
+---+-------+-----------------+
25
​
26
*/
27
​
Copied!
Last modified 1yr ago
Copy link