TF-IDF

TF-IDF

Term frequency-inverse document frequency (TF-IDF) is a feature vectorization method widely used in text mining to reflect the importance of a term to a document in the corpus.
TF: Both HashingTF and CountVectorizer can be used to generate the term frequency vectors.
HashingTF:
HashingTF is a Transformer which takes sets of terms and converts those sets into fixed-length feature vectors. In text processing, a β€œset of terms” might be a bag of words. HashingTF utilizes the hashing trick.
1
// $example on$
2
val sentenceData = spark.createDataFrame(Seq(
3
(0.0, "Hi I heard about Spark"),
4
(0.0, "I wish Java could use case classes"),
5
(1.0, "Logistic regression models are neat")
6
)).toDF("label", "sentence")
7
​
8
val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
9
val wordsData = tokenizer.transform(sentenceData)
10
​
11
val hashingTF = new HashingTF()
12
.setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)
13
​
14
val featurizedData = hashingTF.transform(wordsData)
15
// alternatively, CountVectorizer can also be used to get term frequency vectors
16
​
17
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
18
val idfModel = idf.fit(featurizedData)
19
​
20
val rescaledData = idfModel.transform(featurizedData)
21
rescaledData.select("label", "features").show(false)
22
// $example off$
23
​
24
/*
25
+-----+----------------------------------------------------------------------------------------------------------------------+
26
|label|features |
27
+-----+----------------------------------------------------------------------------------------------------------------------+
28
|0.0 |(20,[0,5,9,17],[0.6931471805599453,0.6931471805599453,0.28768207245178085,1.3862943611198906]) |
29
|0.0 |(20,[2,7,9,13,15],[0.6931471805599453,0.6931471805599453,0.8630462173553426,0.28768207245178085,0.28768207245178085]) |
30
|1.0 |(20,[4,6,13,15,18],[0.6931471805599453,0.6931471805599453,0.28768207245178085,0.28768207245178085,0.6931471805599453])|
31
+-----+----------------------------------------------------------------------------------------------------------------------+
32
*/
33
​
Copied!
Last modified 1yr ago
Copy link