ChiSqSelector
ChiSqSelector stands for Chi-Squared feature selection. It operates on labeled data with categorical features. ChiSqSelector uses the Chi-Squared test of independence to decide which features to choose. It supports five selection methods: numTopFeatures, percentile, fpr, fdr, fwe:
numTopFeatures chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power. percentile is similar to numTopFeatures but chooses a fraction of all features instead of a fixed number. fpr chooses all features whose p-values are below a threshold, thus controlling the false positive rate of selection. fdr uses the Benjamini-Hochberg procedure to choose all features whose false discovery rate is below a threshold. fwe chooses all features whose p-values are below a threshold. The threshold is scaled by 1/numFeatures, thus controlling the family-wise error rate of selection. By default, the selection method is numTopFeatures, with the default number of top features set to 50. The user can choose a selection method using setSelectorType. Examples
Assume that we have a DataFrame with the columns id, features, and clicked, which is used as our target to be predicted:
id | features | clicked |
7 | [0.0, 0.0, 18.0, 1.0] | 1.0 |
8 | [0.0, 1.0, 12.0, 0.0] | 0.0 |
9 | [1.0, 0.0, 15.0, 0.1] | 0.0 |
If we use ChiSqSelector with numTopFeatures = 1, then according to our label clicked the last column in our features is chosen as the most useful feature:
id | features | clicked | selectedFeatures |
7 | [0.0, 0.0, 18.0, 1.0] | 1.0 | [1.0] |
8 | [0.0, 1.0, 12.0, 0.0] | 0.0 | [0.0] |
9 | [1.0, 0.0, 15.0, 0.1] | 0.0 | [0.1] |
Last updated