Latent Dirichlet allocation or LDA
Latent Dirichlet allocation or LDA is implemented as an Estimator that supports both EMLDAOptimizer and OnlineLDAOptimizer, and generates a LDAModel as the base model.
import org.apache.spark.ml.clustering.LDA
// Loads data.
val dataset = spark.read.format("libsvm")
.load("file:///opt/spark/data/mllib/sample_libsvm_data.txt")
// Trains a LDA model.
val lda = new LDA().setK(10).setMaxIter(10)
val model = lda.fit(dataset)
val ll = model.logLikelihood(dataset)
val lp = model.logPerplexity(dataset)
println(s"The lower bound on the log likelihood of the entire corpus: $ll")
println(s"The upper bound bound on perplexity: $lp")
// Describe topics.
val topics = model.describeTopics(3)
println("The topics described by their top-weighted terms:")
topics.show(false)
/*
Output:
The lower bound on the log likelihood of the entire corpus: -1.2892413081774298E7
The upper bound bound on perplexity: 5.296387332859924
The topics described by their top-weighted terms:
+-----+---------------+--------------------------------------------------------------------+
|topic|termIndices |termWeights |
+-----+---------------+--------------------------------------------------------------------+
|0 |[569, 597, 598]|[0.01048331300016006, 0.010116199318706288, 0.009445101538413367] |
|1 |[233, 261, 205]|[0.022233380425792586, 0.01683403182240119, 0.014646518135972245] |
|2 |[342, 343, 553]|[0.01101987068888023, 0.010051896006494202, 0.009974954658255184] |
|3 |[125, 124, 331]|[0.010249053484287323, 0.008001789628260321, 0.007856951022221307] |
|4 |[406, 434, 378]|[0.016726468396808178, 0.016551662166306314, 0.016312669466501947] |
|5 |[301, 272, 538]|[0.011187985574975348, 0.01026802560070681, 0.009910574054908557] |
|6 |[265, 237, 181]|[0.016311101183295176, 0.01450491494274881, 0.013849316888254096] |
|7 |[542, 514, 682]|[0.0426212232584461, 0.040669536800267865, 0.04004669879586029] |
|8 |[48, 420, 421] |[0.001968951371791888, 0.0018823651925661982, 0.0018553426747176778]|
|9 |[664, 637, 465]|[0.04727237035523583, 0.04361701605039732, 0.03568842133530933] |
+-----+---------------+--------------------------------------------------------------------+
*/
Last updated