scikit-learn：数据集预处理（clean数据、reduce降维、expand增维、generate特征提取）

时间：2015-07-17 10:09:01 阅读：145 评论：0 收藏：0 [点我收藏+]

标签：

本文参考：http://scikit-learn.org/stable/data_transforms.html

本篇主要讲数据预处理，包括四部分：

数据清洗、数据降维（PCA类）、数据增维（Kernel类）、提取自定义特征。哇哈哈，还是关注预处理比较靠谱。。。。

重要的不翻译：scikit-learn provides a library of transformers, which may clean (see Preprocessing data), reduce (seeUnsupervised dimensionality reduction), expand (see Kernel Approximation) or generate (see Feature extraction) feature representations.

fit、transform、fit_transform三者区别：

fit：从训练集中学习模型的参数（例如，方差、中位数等；也可能是不同的词汇表）

transform：将训练集/测试集中的数据转换为fit学到的参数的维度上（测试集的方差、中位数等；测试集在fit得到的词汇表下的向量值等）。

fit_transform：同时进行fit和transform操作。

Like other estimators, these are represented by classes with fit method, which learns model parameters (e.g. mean and standard deviation for normalization) from a training set, and a transform method which applies this transformation model to unseen data. fit_transform may be more convenient and efficient for modelling and transforming the training data simultaneously.

八大块内容，翻译会在之后慢慢更新：

4.1. Pipeline and FeatureUnion: combining estimators

4.1.1. Pipeline: chaining estimators

4.1.2. FeatureUnion: composite feature spaces

4.2. Feature extraction

4.2.3. Text feature extraction

4.2.4. Image feature extraction

4.3. Preprocessing data

4.3.1. Standardization, or mean removal and variance scaling

4.3.2. Normalization

4.3.3. Binarization

4.3.4. Encoding categorical features