CountVectorizer

CountVectorizer converts text documents to vectors of term counts. IDF: IDF is an Estimator which is fit on a dataset and produces an IDFModel. The IDFModel takes feature vectors (generally created from HashingTF or CountVectorizer) and scales each feature. Intuitively, it down-weights features which appear frequently in a corpus.

import org.apache.spark.ml.feature.{RegexTokenizer, Tokenizer}


val tokenizer = new Tokenizer().setInputCol("message")
  .setOutputCol("words")


val wordsData = tokenizer.transform(df_select)
wordsData.show(3, false)



import org.apache.spark.ml.feature.{CountVectorizer}


val count = new CountVectorizer().setInputCol("words")
   .setOutputCol("rawFeatures")


val model = count.fit(wordsData)


val featurizedData = model.transform(wordsData)


featurizedData.show(3,false)

Last updated