CountVectorizer converts text documents to vectors of term counts. IDF: IDF is an Estimator which is fit on a dataset and produces an IDFModel. The IDFModel takes feature vectors (generally created from HashingTF or CountVectorizer) and scales each feature. Intuitively, it down-weights features which appear frequently in a corpus.
import{RegexTokenizer, Tokenizer}
val tokenizer = new Tokenizer().setInputCol("message")
val wordsData = tokenizer.transform(df_select), false)
val count = new CountVectorizer().setInputCol("words")
val model =
val featurizedData = model.transform(wordsData),false)