n-gram
An n-gram is a sequence of n tokens (typically words) for some integer n. The NGram class can be used to transform input features into n-grams.
NGram takes as input a sequence of strings (e.g. the output of a Tokenizer). The parameter n is used to determine the number of terms in each n-gram. The output will consist of a sequence of n-grams where each n-gram is represented by a space-delimited string of n consecutive words. If the input sequence contains fewer than n strings, no output is produced.
Concept: combining neighboring words can be more meaningful than individual words
1
import org.apache.spark.ml.feature.NGram
2
​
3
val wordDataFrame = spark.createDataFrame(Seq(
4
(0, Array("Hi", "I", "heard", "about", "Spark")),
5
(1, Array("I", "wish", "Java", "could", "use", "case", "classes")),
6
(2, Array("Logistic", "regression", "models", "are", "neat"))
7
)).toDF("id", "words")
8
​
9
val ngram = new NGram().setN(2).setInputCol("words").setOutputCol("ngrams")
10
​
11
val ngramDataFrame = ngram.transform(wordDataFrame)
12
ngramDataFrame.select("ngrams").show(false)
13
​
14
/*
15
​
16
setN(2) means n=2, 2 neighboring words in each n-gram
17
+------------------------------------------------------------------+
18
|ngrams |
19
+------------------------------------------------------------------+
20
|[Hi I, I heard, heard about, about Spark] |
21
|[I wish, wish Java, Java could, could use, use case, case classes]|
22
|[Logistic regression, regression models, models are, are neat] |
23
+------------------------------------------------------------------+
24
​
25
*/
26
​
Copied!
Last modified 1yr ago
Copy link