码迷,mamicode.com
首页 > 其他好文 > 详细

scikit-learn:数据集预处理(clean数据、reduce降维、expand增维、generate特征提取)

时间:2015-07-17 10:09:01      阅读:145      评论:0      收藏:0      [点我收藏+]

标签:

本文参考:http://scikit-learn.org/stable/data_transforms.html


本篇主要讲数据预处理,包括四部分:

数据清洗、数据降维(PCA类)、数据增维(Kernel类)、提取自定义特征。哇哈哈,还是关注预处理比较靠谱。。。。

重要的不翻译:scikit-learn provides a library of transformers, which may clean (see Preprocessing data), reduce (seeUnsupervised dimensionality reduction), expand (see Kernel Approximation) or generate (see Feature extraction) feature representations.


fit、transform、fit_transform三者区别:

fit:从训练集中学习模型的参数(例如,方差、中位数等;也可能是不同的词汇表)

transform:将训练集/测试集中的数据转换为fit学到的参数的维度上(测试集的方差、中位数等;测试集在fit得到的词汇表下的向量值等)。

fit_transform:同时进行fit和transform操作。

Like other estimators, these are represented by classes with fit method, which learns model parameters (e.g. mean and standard deviation for normalization) from a training set, and a transform method which applies this transformation model to unseen data. fit_transform may be more convenient and efficient for modelling and transforming the training data simultaneously.


八大块内容,翻译会在之后慢慢更新:

4.1. Pipeline and FeatureUnion: combining estimators

4.1.1. Pipeline: chaining estimators

4.1.2. FeatureUnion: composite feature spaces

4.2. Feature extraction

4.2.3. Text feature extraction

4.2.4. Image feature extraction

4.3. Preprocessing data

4.3.1. Standardization, or mean removal and variance scaling

4.3.2. Normalization

4.3.3. Binarization

4.3.4. Encoding categorical features

4.3.5. Imputation of missing values

4.4. Unsupervised dimensionality reduction

4.4.1. PCA: principal component analysis

4.4.2. Random projections

4.4.3. Feature agglomeration (特征聚集)

4.5. Random Projection

4.5.1. The Johnson-Lindenstrauss lemma

4.5.2. Gaussian random projection

4.5.3. Sparse random projection

4.6. Kernel Approximation

4.6.1. Nystroem Method for Kernel Approximation

4.6.2. Radial Basis Function Kernel

4.6.3. Additive Chi Squared Kernel

4.6.4. Skewed Chi Squared Kernel

4.6.5. Mathematical Details

4.7. Pairwise metrics, Affinities and Kernels

4.7.1. Cosine similarity

4.7.2. Linear kernel

4.7.3. Polynomial kernel

4.7.4. Sigmoid kernel

4.7.5. RBF kernel

4.7.6. Chi-squared kernel

4.8. Transforming the prediction target (y)

4.8.1. Label binarization

4.8.2. Label encoding




版权声明:本文为博主原创文章,未经博主允许不得转载。

scikit-learn:数据集预处理(clean数据、reduce降维、expand增维、generate特征提取)

标签:

原文地址:http://blog.csdn.net/mmc2015/article/details/46917287

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!