# FeatureHasher

Feature hashing projects a set of categorical or numerical features into a feature vector of specified dimension (typically substantially smaller than that of the original feature space). This is done using the hashing trick to map features to indices in the feature vector.

The FeatureHasher transformer operates on multiple columns. Each column may contain either numeric or categorical features. Behavior and handling of column data types is as follows:

Numeric columns: For numeric features, the hash value of the column name is used to map the feature value to its index in the feature vector. By default, numeric features are not treated as categorical (even when they are integers). To treat them as categorical, specify the relevant columns using the categoricalCols parameter. String columns: For categorical features, the hash value of the string “column\_name=value” is used to map to the vector index, with an indicator value of 1.0. Thus, categorical features are “one-hot” encoded (similarly to using OneHotEncoder with dropLast=false).

Boolean columns: Boolean values are treated in the same way as string columns. That is, boolean features are represented as “column\_name=true” or “column\_name=false”, with an indicator value of 1.0.

Null (missing) values are ignored (implicitly zero in the resulting feature vector).

The hash function used here is also the MurmurHash 3 used in HashingTF. Since a simple modulo on the hashed value is used to determine the vector index, it is advisable to use a power of two as the numFeatures parameter; otherwise the features will not be mapped evenly to the vector indices.

Examples

Assume that we have a DataFrame with 4 input columns real, bool, stringNum, and string. These different data types as input will illustrate the behavior of the transform to produce a column of feature vectors.

| real | bool  | stringNum | string |
| ---- | ----- | --------- | ------ |
| 2.2  | true  | 1         | foo    |
| 3.3  | false | 2         | bar    |
| 4.4  | false | 3         | baz    |
| 5.5  | false | 4         | foo    |

Then the output of FeatureHasher.transform on this DataFrame is:

| real | bool  | stringNum | string | features                                                  |
| ---- | ----- | --------- | ------ | --------------------------------------------------------- |
| 2.2  | true  | 1         | foo    | (262144,\[51871, 63643,174475,253195],\[1.0,1.0,2.2,1.0]) |
| 3.3  | false | 2         | bar    | (262144,\[6031,  80619,140467,174475],\[1.0,1.0,1.0,3.3]) |
| 4.4  | false | 3         | baz    | (262144,\[24279,140467,174475,196810],\[1.0,1.0,4.4,1.0]) |
| 5.5  | false | 4         | foo    | (262144,\[63643,140467,168512,174475],\[1.0,1.0,1.0,5.5]) |

The resulting feature vectors could then be passed to a learning algorithm.

```
import org.apache.spark.ml.feature.FeatureHasher

val dataset = spark.createDataFrame(Seq(
  (2.2, true, "1", "foo"),
  (3.3, false, "2", "bar"),
  (4.4, false, "3", "baz"),
  (5.5, false, "4", "foo")
)).toDF("real", "bool", "stringNum", "string")

val hasher = new FeatureHasher()
  .setInputCols("real", "bool", "stringNum", "string")
  .setOutputCol("features")

val featurized = hasher.transform(dataset)
featurized.show(false)

/*
Output:
+----+-----+---------+------+--------------------------------------------------------+
|real|bool |stringNum|string|features                                                |
+----+-----+---------+------+--------------------------------------------------------+
|2.2 |true |1        |foo   |(262144,[174475,247670,257907,262126],[2.2,1.0,1.0,1.0])|
|3.3 |false|2        |bar   |(262144,[70644,89673,173866,174475],[1.0,1.0,1.0,3.3])  |
|4.4 |false|3        |baz   |(262144,[22406,70644,174475,187923],[1.0,1.0,4.4,1.0])  |
|5.5 |false|4        |foo   |(262144,[70644,101499,174475,257907],[1.0,1.0,5.5,1.0]) |
+----+-----+---------+------+--------------------------------------------------------+

*/
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://george-jen.gitbook.io/data-science-and-apache-spark/featurehasher.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
