PrefixSpan is a sequential pattern mining algorithm described in Pei et al., Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach. We refer the reader to the referenced paper for formalizing the sequential pattern mining problem.'s PrefixSpan implementation takes the following parameters:
minSupport: the minimum support required to be considered a frequent sequential pattern.
maxPatternLength: the maximum length of a frequent sequential pattern. Any frequent pattern exceeding this length will not be included in the results.
maxLocalProjDBSize: the maximum number of items allowed in a prefix-projected database before local iterative processing of the projected database begins. This parameter should be tuned with respect to the size of your executors.
sequenceCol: the name of the sequence column in dataset (default "sequence"), rows with nulls in this column are ignored.
val smallTestData = Seq(
Seq(Seq(1, 2), Seq(3)),
Seq(Seq(1), Seq(3, 2), Seq(1, 2)),
Seq(Seq(1, 2), Seq(5)),
val df = smallTestData.toDF("sequence")
val result = new PrefixSpan()
| sequence|freq|
| [[3]]| 2|
| [[2]]| 3|
| [[1]]| 3|
| [[1, 2]]| 3|
|[[1], [3]]| 2|