scikit-learn：4. 数据集预处理（clean数据、reduce降维、expand增维、generate特征提取）

时间：2017-07-24 10:07:54 阅读：419 评论：0 收藏：0 [点我收藏+]

本文參考：http://scikit-learn.org/stable/data_transforms.html

本篇主要讲数据预处理，包含四部分：

数据清洗、数据降维（PCA类）、数据增维（Kernel类）、提取自己定义特征。

哇哈哈。还是关注预处理比較靠谱。

。。

。

重要的不翻译：scikit-learn providesa library of transformers, which mayclean (see Preprocessing data), reduce (seeUnsupervised dimensionality reduction), expand (see Kernel Approximation) or generate (see Feature extraction) feature representations.

fit、transform、fit_transform三者差别：

fit：从训练集中学习模型的參数（比如，方差、中位数等；也可能是不同的词汇表）

transform：将训练集/測试集中的数据转换为fit学到的參数的维度上（測试集的方差、中位数等；測试集在fit得到的词汇表下的向量值等）。

fit_transform：同一时候进行fit和transform操作。

Like other estimators, these are represented by classes with fit method, which learns model parameters (e.g. mean and standard deviation for normalization) from a training set, and a transform method which applies this transformation model to unseen data. fit_transform may be more convenient and efficient for modelling and transforming the training data simultaneously.

八大块内容。翻译会在之后慢慢更新：

4.1. Pipeline and FeatureUnion: combining estimators

4.1.1. Pipeline: chaining estimators

4.1.2. FeatureUnion: composite feature spaces

翻译之后的文章，參考：http://blog.csdn.net/mmc2015/article/details/46991465

4.2. Feature extraction

4.2.3. Text feature extraction

翻译之后的文章，參考：http://blog.csdn.net/mmc2015/article/details/46997379

4.2.4. Image feature extraction

翻译之后的文章，參考：http://blog.csdn.net/mmc2015/article/details/46992105

4.3. Preprocessing data

翻译之后的文章。參考：http://blog.csdn.net/mmc2015/article/details/47016313