PrefixSpan

PrefixSpan is a sequential pattern mining algorithm described in Pei et al., Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach. We refer the reader to the referenced paper for formalizing the sequential pattern mining problem.
spark.ml's PrefixSpan implementation takes the following parameters:
minSupport: the minimum support required to be considered a frequent sequential pattern.
maxPatternLength: the maximum length of a frequent sequential pattern. Any frequent pattern exceeding this length will not be included in the results.
maxLocalProjDBSize: the maximum number of items allowed in a prefix-projected database before local iterative processing of the projected database begins. This parameter should be tuned with respect to the size of your executors.
sequenceCol: the name of the sequence column in dataset (default "sequence"), rows with nulls in this column are ignored.
import org.apache.spark.ml.fpm.PrefixSpan
​
val smallTestData = Seq(
Seq(Seq(1, 2), Seq(3)),
Seq(Seq(1), Seq(3, 2), Seq(1, 2)),
Seq(Seq(1, 2), Seq(5)),
Seq(Seq(6)))
​
val df = smallTestData.toDF("sequence")
val result = new PrefixSpan()
.setMinSupport(0.5)
.setMaxPatternLength(5)
.setMaxLocalProjDBSize(32000000)
.findFrequentSequentialPatterns(df)
.show()
/*
Output:
+----------+----+
| sequence|freq|
+----------+----+
| [[3]]| 2|
| [[2]]| 3|
| [[1]]| 3|
| [[1, 2]]| 3|
|[[1], [3]]| 2|
+----------+----+
*/