Locality Sensitive Hashing (LSH) is an important class of hashing techniques, which is commonly used in clustering, approximate nearest neighbor search and outlier detection with large datasets.

The general idea of LSH is to use a family of functions ("LSH families") to hash data points into buckets, so that the data points which are close to each other are in the same buckets with high probability, while data points that are far away from each other are very likely in different buckets.

In Spark, different LSH families are implemented in separate classes (e.g., MinHash), and APIs for feature transformation, approximate similarity join and approximate nearest neighbor are provided in each class.

In LSH, we define a false positive as a pair of distant input features (with d(p,q)âĨr2) which are hashed into the same bucket, and we define a false negative as a pair of nearby features (with d(p,q)âĪr1) which are hashed into different buckets.

LSH Operations We describe the major types of operations which LSH can be used for. A fitted LSH model has methods for each of these operations.

Feature Transformation Feature transformation is the basic functionality to add hashed values as a new column. This can be useful for dimensionality reduction. Users can specify input and output column names by setting inputCol and outputCol.

LSH also supports multiple LSH hash tables. Users can specify the number of hash tables by setting numHashTables. This is also used for OR-amplification in approximate similarity join and approximate nearest neighbor. Increasing the number of hash tables will increase the accuracy but will also increase communication cost and running time.

The type of outputCol is Seq[Vector] where the dimension of the array equals numHashTables, and the dimensions of the vectors are currently set to 1. In future releases, we will implement AND-amplification so that users can specify the dimensions of these vectors.

Approximate Similarity Join Approximate similarity join takes two datasets and approximately returns pairs of rows in the datasets whose distance is smaller than a user-defined threshold. Approximate similarity join supports both joining two different datasets and self-joining. Self-joining will produce some duplicate pairs.

Approximate similarity join accepts both transformed and untransformed datasets as input. If an untransformed dataset is used, it will be transformed automatically. In this case, the hash signature will be created as outputCol.

In the joined dataset, the origin datasets can be queried in datasetA and datasetB. A distance column will be added to the output dataset to show the true distance between each pair of rows returned.

Approximate Nearest Neighbor Search Approximate nearest neighbor search takes a dataset (of feature vectors) and a key (a single feature vector), and it approximately returns a specified number of rows in the dataset that are closest to the vector.

Approximate nearest neighbor search accepts both transformed and untransformed datasets as input. If an untransformed dataset is used, it will be transformed automatically. In this case, the hash signature will be created as outputCol.

A distance column will be added to the output dataset to show the true distance between each output row and the searched key.

Note: Approximate nearest neighbor search will return fewer than k rows when there are not enough candidates in the hash bucket.

Bucketed Random Projection accepts arbitrary vectors as input features, and supports both sparse and dense vectors.

LSH Algorithms

Bucketed Random Projection for Euclidean Distance

Bucketed Random Projection is an LSH family for Euclidean distance. The Euclidean distance is defined as follows:

Its LSH family projects feature vectors x onto a random unit vector v and portions the projected results into hash buckets:

where r is a user-defined bucket length. The bucket length can be used to control the average size of hash buckets (and thus the number of buckets). A larger bucket length (i.e., fewer buckets) increases the probability of features being hashed to the same bucket (increasing the numbers of true and false positives).

Bucketed Random Projection accepts arbitrary vectors as input features, and supports both sparse and dense vectors.

import org.apache.spark.ml.feature.BucketedRandomProjectionLSHimport org.apache.spark.ml.linalg.Vectorsimport org.apache.spark.sql.SparkSessionimport org.apache.spark.sql.functions.colâval dfA = spark.createDataFrame(Seq((0, Vectors.dense(1.0, 1.0)),(1, Vectors.dense(1.0, -1.0)),(2, Vectors.dense(-1.0, -1.0)),(3, Vectors.dense(-1.0, 1.0)))).toDF("id", "features")âval dfB = spark.createDataFrame(Seq((4, Vectors.dense(1.0, 0.0)),(5, Vectors.dense(-1.0, 0.0)),(6, Vectors.dense(0.0, 1.0)),(7, Vectors.dense(0.0, -1.0)))).toDF("id", "features")âval key = Vectors.dense(1.0, 0.0)âval brp = new BucketedRandomProjectionLSH().setBucketLength(2.0).setNumHashTables(3).setInputCol("features").setOutputCol("hashes")âval model = brp.fit(dfA)â// Feature Transformationprintln("The hashed dataset where hashed values are stored in the column 'hashes':")model.transform(dfA).show()â// Compute the locality sensitive hashes for the input rows, then perform approximate// similarity join.// We could avoid computing hashes by passing in the already-transformed dataset, e.g.// `model.approxSimilarityJoin(transformedA, transformedB, 1.5)`println("Approximately joining dfA and dfB on Euclidean distance smaller than 1.5:")model.approxSimilarityJoin(dfA, dfB, 1.5, "EuclideanDistance").select(col("datasetA.id").alias("idA"),col("datasetB.id").alias("idB"),col("EuclideanDistance")).show()â// Compute the locality sensitive hashes for the input rows, then perform approximate nearest// neighbor search.// We could avoid computing hashes by passing in the already-transformed dataset, e.g.// `model.approxNearestNeighbors(transformedA, key, 2)`println("Approximately searching dfA for 2 nearest neighbors of the key:")model.approxNearestNeighbors(dfA, key, 2).show()â/*Output:The hashed dataset where hashed values are stored in the column 'hashes':+---+-----------+--------------------+| id| features| hashes|+---+-----------+--------------------+| 0| [1.0,1.0]|[[0.0], [0.0], [-...|| 1| [1.0,-1.0]|[[-1.0], [-1.0], ...|| 2|[-1.0,-1.0]|[[-1.0], [-1.0], ...|| 3| [-1.0,1.0]|[[0.0], [0.0], [-...|+---+-----------+--------------------+âApproximately joining dfA and dfB on Euclidean distance smaller than 1.5:+---+---+-----------------+|idA|idB|EuclideanDistance|+---+---+-----------------+| 1| 4| 1.0|| 0| 6| 1.0|| 1| 7| 1.0|| 3| 5| 1.0|| 0| 4| 1.0|| 3| 6| 1.0|| 2| 7| 1.0|| 2| 5| 1.0|+---+---+-----------------+âApproximately searching dfA for 2 nearest neighbors of the key:+---+----------+--------------------+-------+| id| features| hashes|distCol|+---+----------+--------------------+-------+| 0| [1.0,1.0]|[[0.0], [0.0], [-...| 1.0|| 1|[1.0,-1.0]|[[-1.0], [-1.0], ...| 1.0|+---+----------+--------------------+-------+â*/