PrefixSpan

PrefixSpan is a sequential pattern mining algorithm described in Pei et al., Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach. We refer the reader to the referenced paper for formalizing the sequential pattern mining problem.

spark.ml's PrefixSpan implementation takes the following parameters:

minSupport: the minimum support required to be considered a frequent sequential pattern.

maxPatternLength: the maximum length of a frequent sequential pattern. Any frequent pattern exceeding this length will not be included in the results.

maxLocalProjDBSize: the maximum number of items allowed in a prefix-projected database before local iterative processing of the projected database begins. This parameter should be tuned with respect to the size of your executors.

sequenceCol: the name of the sequence column in dataset (default "sequence"), rows with nulls in this column are ignored.

import org.apache.spark.ml.fpm.PrefixSpan

val smallTestData = Seq(
  Seq(Seq(1, 2), Seq(3)),
  Seq(Seq(1), Seq(3, 2), Seq(1, 2)),
  Seq(Seq(1, 2), Seq(5)),
  Seq(Seq(6)))

val df = smallTestData.toDF("sequence")
val result = new PrefixSpan()
  .setMinSupport(0.5)
  .setMaxPatternLength(5)
  .setMaxLocalProjDBSize(32000000)
  .findFrequentSequentialPatterns(df)
  .show()
  
  /*
  Output:
+----------+----+
|  sequence|freq|
+----------+----+
|     [[3]]|   2|
|     [[2]]|   3|
|     [[1]]|   3|
|  [[1, 2]]|   3|
|[[1], [3]]|   2|
+----------+----+
  
  */

Last updated