PrefixSpan
PrefixSpan is a sequential pattern mining algorithm described in Pei et al., Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach. We refer the reader to the referenced paper for formalizing the sequential pattern mining problem.
spark.ml's PrefixSpan implementation takes the following parameters:
minSupport: the minimum support required to be considered a frequent sequential pattern.
maxPatternLength: the maximum length of a frequent sequential pattern. Any frequent pattern exceeding this length will not be included in the results.
maxLocalProjDBSize: the maximum number of items allowed in a prefix-projected database before local iterative processing of the projected database begins. This parameter should be tuned with respect to the size of your executors.
sequenceCol: the name of the sequence column in dataset (default "sequence"), rows with nulls in this column are ignored.
1
import org.apache.spark.ml.fpm.PrefixSpan
2
​
3
val smallTestData = Seq(
4
Seq(Seq(1, 2), Seq(3)),
5
Seq(Seq(1), Seq(3, 2), Seq(1, 2)),
6
Seq(Seq(1, 2), Seq(5)),
7
Seq(Seq(6)))
8
​
9
val df = smallTestData.toDF("sequence")
10
val result = new PrefixSpan()
11
.setMinSupport(0.5)
12
.setMaxPatternLength(5)
13
.setMaxLocalProjDBSize(32000000)
14
.findFrequentSequentialPatterns(df)
15
.show()
16
17
/*
18
Output:
19
+----------+----+
20
| sequence|freq|
21
+----------+----+
22
| [[3]]| 2|
23
| [[2]]| 3|
24
| [[1]]| 3|
25
| [[1, 2]]| 3|
26
|[[1], [3]]| 2|
27
+----------+----+
28
29
*/
Copied!
Last modified 1yr ago
Copy link