标签:data science 过度拟合 overfitting
主要内容:
————————————————————————————————————
过度拟合的模型往往不能进行一般化推广(generalization)
拟合问题需要在两个方面进行权衡
需要注意的是 如果用来训练的数据和测试的数据是同样的,那么这样的检验是没有意义的,就像 "Table Model" 一样
一般我们会将数据集分为training/testing(holdout) 两个部分
注: 在python中可以这样做
from sklearn.cross_validation import train_test_split Xtrain, Xtest, ytrain, ytest = train_test_split(X, y)
识别 overfitting 的方式
基于两个模型类型讨论 Overfitting
二维情况下,两个点可以用一条直线拟合
三维情况下,三个点可以用一个平面拟合
……
此时很容易过拟合(我们需缩减特征(attributes)数量来防止过拟合)
SVM 与 Logistics Regression 的比较
SVM对个别样本更不敏感(相对逻辑斯蒂模型)
过度拟合的劣势
Overfitting 识别的进一步分析
Further idea : Buliding a modeling "labortory"
Learning curves (学习曲线)
Logistics regression 和 decision tree 的学习曲线
避免 Tree induction 过拟合的方式
Nest cross-Validation
Tr:training data Te : testing data
Regularization
The problem of multiple comparison is the underlying reason for overfitting
Sidebar: Beware of “multiple comparisons” Consider the following scenario. You run an investment firm. Five years ago, you wanted to have some marketable small-cap mutual fund products to sell, but your analysts had been awful at picking small-cap stocks. So you undertook the following procedure. You started 1,000 different mutual funds, each including a small set of stocks randomly chosen from those that make up the Russell 2000 index (the main index for small-cap stocks). Your firm invested in all 1,000 of these funds, but told no one about them. Now, five years later, you look at their performance. Since they have different stocks in them, they will have had different returns. Some will be about the same as the index, some will be worse, and some will be better. The best one might be a lot better. Now, you liquidate all the funds but the best few, and you present these to the public. You can “honestly” claim that their 5-year return is substantially better than the return of the Russell 2000 index. So, what’s the problem? The problem is that you randomly chose the stocks! You have no idea whether the stocks in these “best” funds performed better because they indeed are fundamentally better, or because you cherry-picked the best from a large set that simply varied in performance. If you flip 1,000 fair coins many times each, one of them will have come up heads much more than 50% of the time. However, choosing that coin as the “best” of the coins for later flipping obviously is silly. These are instances of “the problem of multiple comparisons,” a very important statistical phenomenon that business analysts and data scientists should always keep in mind. Beware whenever someone does many tests and then picks the results that look good. Statistics books will warn against running multiple statistical hypothesis tests, and then looking at the ones that give “significant” results. These usually violate the assumptions behind the statistical tests, and the actual significance of the results is dubious. |
版权声明:本文为博主原创文章,未经博主允许不得转载。
Overfitting and Its Avoidance【总结】
标签:data science 过度拟合 overfitting
原文地址:http://blog.csdn.net/u014135091/article/details/48054057