StopWordRemover
Stop words are words which should be excluded from the input, typically because the words appear frequently and don’t carry as much meaning.
StopWordsRemover takes as input a sequence of strings (e.g. the output of a Tokenizer) and drops all the stop words from the input sequences. The list of stopwords is specified by the stopWords parameter. Default stop words for some languages are accessible by calling StopWordsRemover.loadDefaultStopWords(language), for which available options are "danish", "dutch", "english", "finnish", "french", "german", "hungarian", "italian", "norwegian", "portuguese", "russian", "spanish", "swedish" and "turkish". A boolean parameter caseSensitive indicates if the matches should be case sensitive (false by default).
Examples
Assume that we have the following DataFrame with columns id and raw:
id | raw |
0 | [I, saw, the, red, baloon] |
1 | [Mary, had, a, little, lamb] |
Applying StopWordsRemover with raw as the input column and filtered as the output column, we should get the following:
id | raw | filtered |
0 | [I, saw, the, red, baloon] | [saw, red, baloon] |
1 | [Mary, had, a, little, lamb] | [Mary, little, lamb] |
Last updated