StopWordRemover
Stop words are words which should be excluded from the input, typically because the words appear frequently and don’t carry as much meaning.
StopWordsRemover takes as input a sequence of strings (e.g. the output of a Tokenizer) and drops all the stop words from the input sequences. The list of stopwords is specified by the stopWords parameter. Default stop words for some languages are accessible by calling StopWordsRemover.loadDefaultStopWords(language), for which available options are "danish", "dutch", "english", "finnish", "french", "german", "hungarian", "italian", "norwegian", "portuguese", "russian", "spanish", "swedish" and "turkish". A boolean parameter caseSensitive indicates if the matches should be case sensitive (false by default).
​
Examples
Assume that we have the following DataFrame with columns id and raw:
id
raw
0
[I, saw, the, red, baloon]
1
[Mary, had, a, little, lamb]
Applying StopWordsRemover with raw as the input column and filtered as the output column, we should get the following:
id
raw
filtered
0
[I, saw, the, red, baloon]
[saw, red, baloon]
1
[Mary, had, a, little, lamb]
[Mary, little, lamb]
1
import org.apache.spark.ml.feature.StopWordsRemover
2
​
3
val remover = new StopWordsRemover()
4
.setInputCol("raw")
5
.setOutputCol("filtered")
6
​
7
val dataSet = spark.createDataFrame(Seq(
8
(0, Seq("I", "saw", "the", "red", "balloon")),
9
(1, Seq("Mary", "had", "a", "little", "lamb"))
10
)).toDF("id", "raw")
11
​
12
remover.transform(dataSet).show(false)
13
​
14
/*
15
Output:
16
+---+----------------------------+--------------------+
17
|id |raw |filtered |
18
+---+----------------------------+--------------------+
19
|0 |[I, saw, the, red, balloon] |[saw, red, balloon] |
20
|1 |[Mary, had, a, little, lamb]|[Mary, little, lamb]|
21
+---+----------------------------+--------------------+
22
word I,the,had,a are removed, because they are stop word
23
​
24
*/
Copied!
Last modified 1yr ago
Copy link