Latent Dirichlet allocation or LDA

Latent Dirichlet allocation or LDA is implemented as an Estimator that supports both EMLDAOptimizer and OnlineLDAOptimizer, and generates a LDAModel as the base model.

import org.apache.spark.ml.clustering.LDA
// Loads data.
val dataset = spark.read.format("libsvm")
.load("file:///opt/spark/data/mllib/sample_libsvm_data.txt")
// Trains a LDA model.
val lda = new LDA().setK(10).setMaxIter(10)
val model = lda.fit(dataset)
val ll = model.logLikelihood(dataset)
val lp = model.logPerplexity(dataset)
println(s"The lower bound on the log likelihood of the entire corpus: $ll")
println(s"The upper bound bound on perplexity: $lp")
// Describe topics.
val topics = model.describeTopics(3)
println("The topics described by their top-weighted terms:")
topics.show(false)

/*
Output:

The lower bound on the log likelihood of the entire corpus: -1.2892413081774298E7
The upper bound bound on perplexity: 5.296387332859924
The topics described by their top-weighted terms:
+-----+---------------+--------------------------------------------------------------------+
|topic|termIndices    |termWeights                                                         |
+-----+---------------+--------------------------------------------------------------------+
|0    |[569, 597, 598]|[0.01048331300016006, 0.010116199318706288, 0.009445101538413367]   |
|1    |[233, 261, 205]|[0.022233380425792586, 0.01683403182240119, 0.014646518135972245]   |
|2    |[342, 343, 553]|[0.01101987068888023, 0.010051896006494202, 0.009974954658255184]   |
|3    |[125, 124, 331]|[0.010249053484287323, 0.008001789628260321, 0.007856951022221307]  |
|4    |[406, 434, 378]|[0.016726468396808178, 0.016551662166306314, 0.016312669466501947]  |
|5    |[301, 272, 538]|[0.011187985574975348, 0.01026802560070681, 0.009910574054908557]   |
|6    |[265, 237, 181]|[0.016311101183295176, 0.01450491494274881, 0.013849316888254096]   |
|7    |[542, 514, 682]|[0.0426212232584461, 0.040669536800267865, 0.04004669879586029]     |
|8    |[48, 420, 421] |[0.001968951371791888, 0.0018823651925661982, 0.0018553426747176778]|
|9    |[664, 637, 465]|[0.04727237035523583, 0.04361701605039732, 0.03568842133530933]     |
+-----+---------------+--------------------------------------------------------------------+


*/

Last updated