VectorSlicer
VectorSlicer is a transformer that takes a feature vector and outputs a new feature vector with a sub-array of the original features. It is useful for extracting features from a vector column.
VectorSlicer accepts a vector column with specified indices, then outputs a new vector column whose values are selected via those indices. There are two types of indices,
Integer indices that represent the indices into the vector, setIndices().
String indices that represent the names of features into the vector, setNames(). This requires the vector column to have an AttributeGroup since the implementation matches on the name field of an Attribute.
Specification by integer and string are both acceptable. Moreover, you can use integer index and string name simultaneously. At least one feature must be selected. Duplicate features are not allowed, so there can be no overlap between selected indices and names. Note that if names of features are selected, an exception will be thrown if empty input attributes are encountered.
The output vector will order features with the selected indices first (in the order given), followed by the selected names (in the order given).
Examples
Suppose that we have a DataFrame with the column userFeatures:

userFeatures

[0.0, 10.0, 0.5] userFeatures is a vector column that contains three user features. Assume that the first column of userFeatures are all zeros, so we want to remove it and select only the last two columns. The VectorSlicer selects the last two elements with setIndices(1, 2) then produces a new vector column named features:
userFeatures
features
[0.0, 10.0, 0.5]
[10.0, 0.5]
Suppose also that we have potential input attributes for the userFeatures, i.e. ["f1", "f2", "f3"], then we can use setNames("f2", "f3") to select them.
userFeatures
features
[0.0, 10.0, 0.5]
[10.0, 0.5]
["f1", "f2", "f3"]
["f2", "f3"]
1
import java.util.Arrays
2
​
3
import org.apache.spark.ml.attribute.{Attribute, AttributeGroup, NumericAttribute}
4
import org.apache.spark.ml.feature.VectorSlicer
5
import org.apache.spark.ml.linalg.Vectors
6
import org.apache.spark.sql.{Row, SparkSession}
7
import org.apache.spark.sql.types.StructType
8
​
9
val data = Arrays.asList(
10
Row(Vectors.sparse(3, Seq((0, -2.0), (1, 2.3)))),
11
Row(Vectors.dense(-2.0, 2.3, 0.0))
12
)
13
​
14
val defaultAttr = NumericAttribute.defaultAttr
15
val attrs = Array("f1", "f2", "f3").map(defaultAttr.withName)
16
val attrGroup = new AttributeGroup("userFeatures", attrs.asInstanceOf[Array[Attribute]])
17
​
18
val dataset = spark.createDataFrame(data, StructType(Array(attrGroup.toStructField())))
19
​
20
val slicer = new VectorSlicer().setInputCol("userFeatures").setOutputCol("features")
21
​
22
slicer.setIndices(Array(1)).setNames(Array("f3"))
23
// or slicer.setIndices(Array(1, 2)), or slicer.setNames(Array("f2", "f3"))
24
​
25
val output = slicer.transform(dataset)
26
output.show(false)
27
​
28
/*
29
Output:
30
+--------------------+-------------+
31
|userFeatures |features |
32
+--------------------+-------------+
33
|(3,[0,1],[-2.0,2.3])|(2,[0],[2.3])|
34
|[-2.0,2.3,0.0] |[2.3,0.0] |
35
+--------------------+-------------+
36
​
37
*/
Copied!
Last modified 1yr ago
Copy link
Contents
userFeatures