CountVectorizer converts text documents to vectors of term counts. IDF: IDF is an Estimator which is fit on a dataset and produces an IDFModel. The IDFModel takes feature vectors (generally created from HashingTF or CountVectorizer) and scales each feature. Intuitively, it down-weights features which appear frequently in a corpus.
import org.apache.spark.ml.feature.{RegexTokenizer, Tokenizer}
val tokenizer = new Tokenizer().setInputCol("message")
.setOutputCol("words")
val wordsData = tokenizer.transform(df_select)
wordsData.show(3, false)
import org.apache.spark.ml.feature.{CountVectorizer}
val count = new CountVectorizer().setInputCol("words")
.setOutputCol("rawFeatures")
val model = count.fit(wordsData)
val featurizedData = model.transform(wordsData)
featurizedData.show(3,false)