CountVectorizer
CountVectorizer converts text documents to vectors of term counts. IDF: IDF is an Estimator which is fit on a dataset and produces an IDFModel. The IDFModel takes feature vectors (generally created from HashingTF or CountVectorizer) and scales each feature. Intuitively, it down-weights features which appear frequently in a corpus.
1
import org.apache.spark.ml.feature.{RegexTokenizer, Tokenizer}
2
​
3
​
4
val tokenizer = new Tokenizer().setInputCol("message")
5
.setOutputCol("words")
6
​
7
​
8
val wordsData = tokenizer.transform(df_select)
9
wordsData.show(3, false)
10
​
11
​
12
​
13
import org.apache.spark.ml.feature.{CountVectorizer}
14
​
15
​
16
val count = new CountVectorizer().setInputCol("words")
17
.setOutputCol("rawFeatures")
18
​
19
​
20
val model = count.fit(wordsData)
21
​
22
​
23
val featurizedData = model.transform(wordsData)
24
​
25
​
26
featurizedData.show(3,false)
Copied!
Last modified 1yr ago
Copy link