# ChiSqSelector

ChiSqSelector stands for Chi-Squared feature selection. It operates on labeled data with categorical features. ChiSqSelector uses the Chi-Squared test of independence to decide which features to choose. It supports five selection methods: numTopFeatures, percentile, fpr, fdr, fwe:

numTopFeatures chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power. percentile is similar to numTopFeatures but chooses a fraction of all features instead of a fixed number. fpr chooses all features whose p-values are below a threshold, thus controlling the false positive rate of selection. fdr uses the Benjamini-Hochberg procedure to choose all features whose false discovery rate is below a threshold. fwe chooses all features whose p-values are below a threshold. The threshold is scaled by 1/numFeatures, thus controlling the family-wise error rate of selection. By default, the selection method is numTopFeatures, with the default number of top features set to 50. The user can choose a selection method using setSelectorType. Examples

Assume that we have a DataFrame with the columns id, features, and clicked, which is used as our target to be predicted:

| id | features               | clicked |
| -- | ---------------------- | ------- |
| 7  | \[0.0, 0.0, 18.0, 1.0] | 1.0     |
| 8  | \[0.0, 1.0, 12.0, 0.0] | 0.0     |
| 9  | \[1.0, 0.0, 15.0, 0.1] | 0.0     |

If we use ChiSqSelector with numTopFeatures = 1, then according to our label clicked the last column in our features is chosen as the most useful feature:

| id | features               | clicked | selectedFeatures |
| -- | ---------------------- | ------- | ---------------- |
| 7  | \[0.0, 0.0, 18.0, 1.0] | 1.0     | \[1.0]           |
| 8  | \[0.0, 1.0, 12.0, 0.0] | 0.0     | \[0.0]           |
| 9  | \[1.0, 0.0, 15.0, 0.1] | 0.0     | \[0.1]           |

```
import org.apache.spark.ml.feature.ChiSqSelector
import org.apache.spark.ml.linalg.Vectors

val data = Seq(
  (7, Vectors.dense(0.0, 0.0, 18.0, 1.0), 1.0),
  (8, Vectors.dense(0.0, 1.0, 12.0, 0.0), 0.0),
  (9, Vectors.dense(1.0, 0.0, 15.0, 0.1), 0.0)
)

val df = spark.createDataset(data).toDF("id", "features", "clicked")

val selector = new ChiSqSelector()
  .setNumTopFeatures(1)
  .setFeaturesCol("features")
  .setLabelCol("clicked")
  .setOutputCol("selectedFeatures")

val result = selector.fit(df).transform(df)

println(s"ChiSqSelector output with top ${selector.getNumTopFeatures} features selected")
result.show()

/*
Output:
ChiSqSelector output with top 1 features selected
+---+------------------+-------+----------------+
| id|          features|clicked|selectedFeatures|
+---+------------------+-------+----------------+
|  7|[0.0,0.0,18.0,1.0]|    1.0|          [18.0]|
|  8|[0.0,1.0,12.0,0.0]|    0.0|          [12.0]|
|  9|[1.0,0.0,15.0,0.1]|    0.0|          [15.0]|
+---+------------------+-------+----------------+

*/
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://george-jen.gitbook.io/data-science-and-apache-spark/chisqselector.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
