scikit-learn（工程中用的相对较多的模型介绍）：1.13. Feature selection

时间：2015-08-07 09:38:23 阅读：225 评论：0 收藏：0 [点我收藏+]

标签：机器学习 scikit-learn 特征选择工程应用 feature selection

参考：http://scikit-learn.org/stable/modules/feature_selection.html

The classes in the sklearn.feature_selection module can be used for feature selection/dimensionality reduction on sample sets, either to improve estimators’ accuracy scores or to boost their performance on very high-dimensional datasets.

1、removing features with low variance

VarianceThreshold 是特征选择的简单baseline方法，他删除方差达不到阈值的特征。默认情况下，删除all zero-variance features, i.e. features that have the same value in all samples.

假设我们想要删除超过80%的样本数都是0或都是1（假设是boolean features）的所有特征，由于boolean features是bernoulli随机变量，所以方差为Var[X] = p(1-p)，所以我们可以使用阈值0.8*（1-0.8）：

>>> from sklearn.feature_selection import VarianceThreshold
>>> X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
>>> sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
>>> sel.fit_transform(X)
array([[0, 1],
       [1, 0],
       [0, 0],
       [1, 1],
       [1, 0],
       [1, 1]])

删除了第一列，因为p=5/6 > 0.8。

2、Univariate feature selection（单变量特征选择）（我用这个非常多）

Univariate feature selection基于univariate statistical tests（单变量统计检验），分为：

SelectKBest removes all but the $技术分享$ highest scoring features
SelectPercentile removes all but a user-specified highest scoring percentage of features
using common univariate statistical tests for each feature: false positive rate SelectFpr, false discovery rate SelectFdr, or family wise error SelectFwe.
GenericUnivariateSelect allows to perform univariate feature selection with a configurable strategy. This allows to select the best univariate selection strategy with hyper-parameter search estimator.

例如，我们可以对样本集使用卡方检测，进而选择最好的两个features：

>>> from sklearn.datasets import load_iris
>>> from sklearn.feature_selection import SelectKBest
>>> from sklearn.feature_selection import chi2
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X.shape
(150, 4)
>>> X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
>>> X_new.shape
(150, 2)

几个注意点：

1）These objects take as input a scoring function that returns univariate p-values：For regression: f_regression；For classification: chi2 or f_classif。

Beware not to use a regression scoring function with a classification problem, you will get useless results.

2）Feature selection with sparse data：If you use sparse data (i.e. data represented as sparse matrices), only chi2 will deal with the data without making it dense.

例子：Univariate Feature Selection

3、recursive feature elimination（递归特征消除）

Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features.（从最初的所有特征集到逐步删除一个feature< features whose absolute weights are the smallest are pruned from the current set features>，最后达到满足条件的features个数）。

RFECV performs RFE in a cross-validation loop to find the optimal number of features.：

Recursive feature elimination: A recursive feature elimination example showing the relevance of pixels in a digit classification task.
Recursive feature elimination with cross-validation: A recursive feature elimination example with automatic tuning of the number of features selected with cross-validation.

4、L1-based feature selection

L1的sparse作用就不说了：

>>> from sklearn.svm import LinearSVC
>>> from sklearn.datasets import load_iris
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X.shape
(150, 4)
>>> X_new = LinearSVC(C=0.01, penalty="l1", dual=False).fit_transform(X, y)
>>> X_new.shape
(150, 3)

With SVMs and logistic-regression, the parameter C controls the sparsity: the smaller C the fewer features selected. With Lasso, the higher the alpha parameter, the fewer features selected.

Examples:

Classification of text documents using sparse features: Comparison of different algorithms for document classification including L1-based feature selection.

5、Tree-based features selection（这个也用的比价多）

Tree-based estimators (see the sklearn.tree module and forest of trees in the sklearn.ensemble module) can be used to compute feature importances, which in turn can be used to discard irrelevant features:

>>>
>>> from sklearn.ensemble import ExtraTreesClassifier
>>> from sklearn.datasets import load_iris
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X.shape
(150, 4)
>>> clf = ExtraTreesClassifier()
>>> X_new = clf.fit(X, y).transform(X)
>>> clf.feature_importances_  
array([ 0.04...,  0.05...,  0.4...,  0.4...])
>>> X_new.shape               
(150, 2)

Feature importances with forests of trees: example on synthetic data showing the recovery of the actually meaningful features.
Pixel importances with a parallel forest of trees: example on face recognition data.

6、Feature selection as part of a pipeline

Feature selection is usually used as a pre-processing step before doing the actual learning. The recommended way to do this in scikit-learn is to use asklearn.pipeline.Pipeline:

clf = Pipeline([
  (‘feature_selection‘, LinearSVC(penalty="l1")),
  (‘classification‘, RandomForestClassifier())
])
clf.fit(X, y)

In this snippet we make use of a sklearn.svm.LinearSVC to evaluate feature importances and select the most relevant features. Then, asklearn.ensemble.RandomForestClassifier is trained on the transformed output, i.e. using only relevant features. You can perform similar operations with the other feature selection methods and also classifiers that provide a way to evaluate feature importances of course. See the sklearn.pipeline.Pipelineexamples for more details.

scikit-learn（工程中用的相对较多的模型介绍）：1.13. Feature selection

标签：机器学习 scikit-learn 特征选择工程应用 feature selection

原文地址：http://blog.csdn.net/mmc2015/article/details/47333579

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行