MinHash for Jaccard Distance
MinHash is an LSH family for Jaccard distance where input features are sets of natural numbers. Jaccard distance of two sets is defined by the cardinality of their intersection and union:
MinHash applies a random hash function g to each element in the set and take the minimum of all hashed values:
The input sets for MinHash are represented as binary vectors, where the vector indices represent the elements themselves and the non-zero values in the vector represent the presence of that element in the set. While both dense and sparse vectors are supported, typically sparse vectors are recommended for efficiency. For example, Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)]) means there are 10 elements in the space. This set contains elem 2, elem 3 and elem 5. All non-zero values are treated as binary "1" values.
Note: Empty sets cannot be transformed by MinHash, which means any input vector must have at least 1 non-zero entry.
1
import org.apache.spark.ml.feature.MinHashLSH
2
import org.apache.spark.ml.linalg.Vectors
3
import org.apache.spark.sql.SparkSession
4
import org.apache.spark.sql.functions.col
5
​
6
val dfA = spark.createDataFrame(Seq(
7
(0, Vectors.sparse(6, Seq((0, 1.0), (1, 1.0), (2, 1.0)))),
8
(1, Vectors.sparse(6, Seq((2, 1.0), (3, 1.0), (4, 1.0)))),
9
(2, Vectors.sparse(6, Seq((0, 1.0), (2, 1.0), (4, 1.0))))
10
)).toDF("id", "features")
11
​
12
val dfB = spark.createDataFrame(Seq(
13
(3, Vectors.sparse(6, Seq((1, 1.0), (3, 1.0), (5, 1.0)))),
14
(4, Vectors.sparse(6, Seq((2, 1.0), (3, 1.0), (5, 1.0)))),
15
(5, Vectors.sparse(6, Seq((1, 1.0), (2, 1.0), (4, 1.0))))
16
)).toDF("id", "features")
17
​
18
val key = Vectors.sparse(6, Seq((1, 1.0), (3, 1.0)))
19
​
20
val mh = new MinHashLSH()
21
.setNumHashTables(5)
22
.setInputCol("features")
23
.setOutputCol("hashes")
24
​
25
val model = mh.fit(dfA)
26
​
27
// Feature Transformation
28
println("The hashed dataset where hashed values are stored in the column 'hashes':")
29
model.transform(dfA).show()
30
​
31
// Compute the locality sensitive hashes for the input rows, then perform approximate
32
// similarity join.
33
// We could avoid computing hashes by passing in the already-transformed dataset, e.g.
34
// `model.approxSimilarityJoin(transformedA, transformedB, 0.6)`
35
println("Approximately joining dfA and dfB on Jaccard distance smaller than 0.6:")
36
model.approxSimilarityJoin(dfA, dfB, 0.6, "JaccardDistance")
37
.select(col("datasetA.id").alias("idA"),
38
col("datasetB.id").alias("idB"),
39
col("JaccardDistance")).show()
40
​
41
// Compute the locality sensitive hashes for the input rows, then perform approximate nearest
42
// neighbor search.
43
// We could avoid computing hashes by passing in the already-transformed dataset, e.g.
44
// `model.approxNearestNeighbors(transformedA, key, 2)`
45
// It may return less than 2 rows when not enough approximate near-neighbor candidates are
46
// found.
47
println("Approximately searching dfA for 2 nearest neighbors of the key:")
48
model.approxNearestNeighbors(dfA, key, 2).show()
49
​
50
/*
51
Output:
52
The hashed dataset where hashed values are stored in the column 'hashes':
53
+---+--------------------+--------------------+
54
| id| features| hashes|
55
+---+--------------------+--------------------+
56
| 0|(6,[0,1,2],[1.0,1...|[[2.25592966E8], ...|
57
| 1|(6,[2,3,4],[1.0,1...|[[2.25592966E8], ...|
58
| 2|(6,[0,2,4],[1.0,1...|[[2.25592966E8], ...|
59
+---+--------------------+--------------------+
60
​
61
Approximately joining dfA and dfB on Jaccard distance smaller than 0.6:
62
+---+---+---------------+
63
|idA|idB|JaccardDistance|
64
+---+---+---------------+
65
| 1| 4| 0.5|
66
| 0| 5| 0.5|
67
| 1| 5| 0.5|
68
| 2| 5| 0.5|
69
+---+---+---------------+
70
​
71
Approximately searching dfA for 2 nearest neighbors of the key:
72
+---+--------------------+--------------------+-------+
73
| id| features| hashes|distCol|
74
+---+--------------------+--------------------+-------+
75
| 1|(6,[2,3,4],[1.0,1...|[[2.25592966E8], ...| 0.75|
76
+---+--------------------+--------------------+-------+
77
*/
Copied!
Last modified 1yr ago
Copy link