# ChiSqSelector

ChiSqSelector stands for Chi-Squared feature selection. It operates on labeled data with categorical features. ChiSqSelector uses the Chi-Squared test of independence to decide which features to choose. It supports five selection methods: numTopFeatures, percentile, fpr, fdr, fwe:

numTopFeatures chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power. percentile is similar to numTopFeatures but chooses a fraction of all features instead of a fixed number. fpr chooses all features whose p-values are below a threshold, thus controlling the false positive rate of selection. fdr uses the Benjamini-Hochberg procedure to choose all features whose false discovery rate is below a threshold. fwe chooses all features whose p-values are below a threshold. The threshold is scaled by 1/numFeatures, thus controlling the family-wise error rate of selection. By default, the selection method is numTopFeatures, with the default number of top features set to 50. The user can choose a selection method using setSelectorType. Examples

Assume that we have a DataFrame with the columns id, features, and clicked, which is used as our target to be predicted:

| id | features               | clicked |
| -- | ---------------------- | ------- |
| 7  | \[0.0, 0.0, 18.0, 1.0] | 1.0     |
| 8  | \[0.0, 1.0, 12.0, 0.0] | 0.0     |
| 9  | \[1.0, 0.0, 15.0, 0.1] | 0.0     |

If we use ChiSqSelector with numTopFeatures = 1, then according to our label clicked the last column in our features is chosen as the most useful feature:

| id | features               | clicked | selectedFeatures |
| -- | ---------------------- | ------- | ---------------- |
| 7  | \[0.0, 0.0, 18.0, 1.0] | 1.0     | \[1.0]           |
| 8  | \[0.0, 1.0, 12.0, 0.0] | 0.0     | \[0.0]           |
| 9  | \[1.0, 0.0, 15.0, 0.1] | 0.0     | \[0.1]           |

```
import org.apache.spark.ml.feature.ChiSqSelector
import org.apache.spark.ml.linalg.Vectors

val data = Seq(
  (7, Vectors.dense(0.0, 0.0, 18.0, 1.0), 1.0),
  (8, Vectors.dense(0.0, 1.0, 12.0, 0.0), 0.0),
  (9, Vectors.dense(1.0, 0.0, 15.0, 0.1), 0.0)
)

val df = spark.createDataset(data).toDF("id", "features", "clicked")

val selector = new ChiSqSelector()
  .setNumTopFeatures(1)
  .setFeaturesCol("features")
  .setLabelCol("clicked")
  .setOutputCol("selectedFeatures")

val result = selector.fit(df).transform(df)

println(s"ChiSqSelector output with top ${selector.getNumTopFeatures} features selected")
result.show()

/*
Output:
ChiSqSelector output with top 1 features selected
+---+------------------+-------+----------------+
| id|          features|clicked|selectedFeatures|
+---+------------------+-------+----------------+
|  7|[0.0,0.0,18.0,1.0]|    1.0|          [18.0]|
|  8|[0.0,1.0,12.0,0.0]|    0.0|          [12.0]|
|  9|[1.0,0.0,15.0,0.1]|    0.0|          [15.0]|
+---+------------------+-------+----------------+

*/
```
