码迷,mamicode.com
首页 > 其他好文 > 详细

Notes : <Hands-on ML with Sklearn & TF> Chapter 7

时间:2017-06-09 23:31:57      阅读:433      评论:0      收藏:0      [点我收藏+]

标签:conf   rss   vbs   mtk   lru   xsd   book   rdl   tld   

 

 

 

aggregate the predictions of a group predictions, you will often get better predictions than with the best individual predictor
A group of predictions called ensemble
discuss the most popular Ensemble mothod including:bagging, boosting, stracking, and a few others

 

Voting Classifiers

  1. voting=hard为投票,voting=soft为平均所有独立的分类器
  2. 根据大数定律,Ensemble Learning会比独立的各个分类器表现的更好,而且各个分类器相互之间越独立表现的越好
In [1]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)  #chapter 5

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

log_clf = LogisticRegression(random_state=42)
rnd_clf = RandomForestClassifier(random_state=42)
svm_clf = SVC(probability=True, random_state=42)  #voting=soft需要predict_proba(),田间probability=True来产生此方法

voting_clf = VotingClassifier(
        estimators=[(‘lr‘, log_clf), (‘rf‘, rnd_clf), (‘svc‘, svm_clf)],
        voting=‘soft‘
    )
voting_clf.fit(X_train, y_train)  #统一fit

from sklearn.metrics import accuracy_score

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))
 
LogisticRegression 0.864
RandomForestClassifier 0.872
SVC 0.888
VotingClassifier 0.912
 

Bagging And Pasting

可以使用完全不相同的算法来产生多个分类器,也可以使用相同的算法不相同的训练集来产生不相同的算法

  1. 同一个数据集重新采样来代表整个数据集。有放回的采样bagging,无放回的叫pasting
  2. 使用重采样之后的数据集会产生更大的倾斜,但将它们aggregation(汇聚)之后会降低倾斜和波动
  3. 各个分类器可以分别训练(并行)
 

Bagging and pasting in scikit-learn

In [2]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500, max_samples=100, bootstrap=True, n_jobs=-1)

bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)
In [3]:
accuracy_score(y_test, y_pred)
Out[3]:
0.91200000000000003
In [4]:
past_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500, max_samples=100, bootstrap=False, n_jobs=-1)

past_clf.fit(X_train, y_train)
y_pred = past_clf.predict(X_test)
accuracy_score(y_test, y_pred)
Out[4]:
0.91200000000000003
 

bagging has a slightly higher bias than pasing because of bootstrapping introduces a bit more diversity

 

Out-of-Bag Evaluation

  1. bagging多次采样之后约有$\frac{1}{e}=0.367879$没被采样到,被称为oob
  2. 可以使用这一部分计算误差
In [5]:
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500,max_samples=100, bootstrap=True, n_jobs=-1, oob_score=True)
bag_clf.fit(X_train, y_train)
bag_clf.oob_score_
Out[5]:
0.92266666666666663
In [6]:
y_pred=bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)
Out[6]:
0.91200000000000003
In [7]:
bag_clf.oob_decision_function_[2]
Out[7]:
array([ 0.99744898,  0.00255102])
 

Random Pathes and Random Subspaces

  1. 对于feature很多维的输入,也有针对feature的采样。分别称为max_features和bootstrap_features,和max_samples和bootstrap用法相同
  2. Sampling both instances and features is called Random Pathes
  3. Keeping all training instances but sampling features is called Random Subspaces
  4. trade a bit more bias for a lower variance
 

Random Forest

  1. use RandomForestClassifier
In [9]:
from sklearn.ensemble import RandomForestClassifier

rnd_clf=RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train)

y_pred_rf = rnd_clf.predict(X_test)
In [10]:
bad_clf = BaggingClassifier(DecisionTreeClassifier(splitter=‘random‘,max_leaf_nodes=16), n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1)
In [11]:
accuracy_score(y_test, y_pred_rf)
Out[11]:
0.91200000000000003
In [12]:
accuracy_score(y_test, bag_clf.predict(X_test))
Out[12]:
0.91200000000000003
 

Extra-Trees

  1. when growing a tree in a random forest, at each node only a arandom subset of features is considered for splitting.
  2. possible to make trees even more random by also using random threshold rather than searching for the best possible threshold.
  3. use the $ExtraTreesClassifier()$ to creat a Extra-Trees class, it has the same API as the RandomForestClassifier.
  4. 也说不定那个表现的更好,对于某一问题需要cross-validation检验一下
 

Feature Importance

  1. importance features are likely appear closer to the root
  2. using featureimportances variable to access the average depth at which it appears across all trees in the forest
In [13]:
from sklearn.datasets import load_iris
iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rnd_clf.fit(iris[‘data‘], iris[‘target‘])
for name, score in zip(iris[‘feature_names‘], rnd_clf.feature_importances_):
    print(name, score)
 
sepal length (cm) 0.102324356672
sepal width (cm) 0.0257240474133
petal length (cm) 0.439143949318
petal width (cm) 0.432807646597
In [14]:
from six.moves import urllib
from sklearn.datasets import fetch_mldata
try:
    mnist = fetch_mldata(‘MNIST original‘)
except urllib.error.HTTPError as ex:
    print("Could not download MNIST data from mldata.org, trying alternative...")

    # Alternative method to load MNIST, if mldata.org is down
    from scipy.io import loadmat
    mnist_alternative_url = "https://github.com/amplab/datascience-sp14/raw/master/lab7/mldata/mnist-original.mat"
    mnist_path = "./mnist-original.mat"
    response = urllib.request.urlopen(mnist_alternative_url)
    with open(mnist_path, "wb") as f:
        content = response.read()
        f.write(content)
    mnist_raw = loadmat(mnist_path)
    mnist = {
        "data": mnist_raw["data"].T,
        "target": mnist_raw["label"][0],
        "COL_NAMES": ["label", "data"],
        "DESCR": "mldata.org dataset: mnist-original",
    }
    print("Success!")
In [15]:
rnd_clf = RandomForestClassifier(random_state=42)
rnd_clf.fit(mnist["data"], mnist["target"])
Out[15]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion=‘gini‘,
            max_depth=None, max_features=‘auto‘, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=42,
            verbose=0, warm_start=False)
In [20]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
def plot_digit(data):
    image = data.reshape(28, 28)
    plt.imshow(image, cmap = matplotlib.cm.hot,
               interpolation="nearest")
    plt.axis("off")
In [21]:
plot_digit(rnd_clf.feature_importances_)

cbar = plt.colorbar(ticks=[rnd_clf.feature_importances_.min(), rnd_clf.feature_importances_.max()])
cbar.ax.set_yticklabels([‘Not important‘, ‘Very important‘])

plt.show()
 
技术分享
 

Boosting

  1. combine several weak learners into a strong leaner
  2. 依次训练predictor,每个都试图纠正predecessor
 

Adaptive Boosting

  1. new predictor pay more attention to the training instances that the predecessor underfitted.
  2. For example, a second classifier is trained using the first updated weights and again it makes predictions on the training set ,weights are updated, and so on.
In [24]:
import numpy as np
m = len(X_train)

from matplotlib.colors import ListedColormap

def plot_decision_boundary(clf, X, y, axes=[-1.5, 2.5, -1, 1.5], alpha=0.5, contour=True):
    x1s = np.linspace(axes[0], axes[1], 100)
    x2s = np.linspace(axes[2], axes[3], 100)
    x1, x2 = np.meshgrid(x1s, x2s)
    X_new = np.c_[x1.ravel(), x2.ravel()]
    y_pred = clf.predict(X_new).reshape(x1.shape)
    custom_cmap = ListedColormap([‘#fafab0‘,‘#9898ff‘,‘#a0faa0‘])
    plt.contourf(x1, x2, y_pred, alpha=0.3, cmap=custom_cmap, linewidth=10)
    if contour:
        custom_cmap2 = ListedColormap([‘#7d7d58‘,‘#4c4c7f‘,‘#507d50‘])
        plt.contour(x1, x2, y_pred, cmap=custom_cmap2, alpha=0.8)
    plt.plot(X[:, 0][y==0], X[:, 1][y==0], "yo", alpha=alpha)
    plt.plot(X[:, 0][y==1], X[:, 1][y==1], "bs", alpha=alpha)
    plt.axis(axes)
    plt.xlabel(r"$x_1$", fontsize=18)
    plt.ylabel(r"$x_2$", fontsize=18, rotation=0)

plt.figure(figsize=(11, 4))
for subplot, learning_rate in ((121, 1), (122, 0.5)):
    sample_weights = np.ones(m)
    for i in range(5):
        plt.subplot(subplot)
        svm_clf = SVC(kernel="rbf", C=0.05)
        svm_clf.fit(X_train, y_train, sample_weight=sample_weights)
        y_pred = svm_clf.predict(X_train)
        sample_weights[y_pred != y_train] *= (1 + learning_rate)
        plot_decision_boundary(svm_clf, X, y, alpha=0.2)
        plt.title("learning_rate = {}".format(learning_rate - 1), fontsize=16)

plt.subplot(121)
plt.text(-0.7, -0.65, "1", fontsize=14)
plt.text(-0.6, -0.10, "2", fontsize=14)
plt.text(-0.5,  0.10, "3", fontsize=14)
plt.text(-0.4,  0.55, "4", fontsize=14)
plt.text(-0.3,  0.90, "5", fontsize=14)
plt.show()
 
技术分享
 
$$ Weighted\ error\ rate\ (加权错误率)of\ the\ j^{th}\ predictor: \\ r_j=\frac{\sum_{i=1,\widehat{y}_{j}^{(i)}\neq y^{(i)}}^{m}w^{(i)}}{\sum_{i=1}^{m}w^{(i)}}\\ where\ \widehat y_{j}^{(i)}\ is\ the\ j^{th} \ predictor‘s\ prediction\ for\ the\ i^{th}\ instance. \\ \alpha_j=\eta log \frac{1-r_j}{r_j} \\ Weight\ update\ rule: \\ for\ i=1,2,...,m:\ w^{(i)}\leftarrow \begin{align*} \left\{\begin{matrix} w^{(i)} &if\ \widehat{y}_{j}^{(i)}=y^{(i)}\\ w^{(i)}e^{\alpha_j} &if\ \widehat{y}_{j}^{(i)}\neq y^{(i)} \end{matrix}\right. \end{align*} $$
 
  1. 默认$w^{(i)}=\frac{1}{m}$
  2. 第一次训练,过程中计算$r_j,\alpha_j$
  3. 更新$w^{(i)}$,标准化,例如:$\ /\sum_{i=1}^{m}w^{(i)}$
  4. 做下一次训练,重复这个过程
 
$$ AdaBoost\ predictions: \\ \widehat y(x) = \underset{k}{argmax} \sum_{j=1,\widehat y_j(x)=k}^{N} \alpha_j \\ where\ N\ is\ the\ number\ of\ predictors. $$
  1. SAMME:Stagewise Additive Modeling using a Multiclass Exponential loss function, multiclass version of AdaBoost
In [29]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1), n_estimators=500, algorithm=‘SAMME.R‘, learning_rate=0.5)
ada_clf.fit(X_train, y_train)
Out[29]:
AdaBoostClassifier(algorithm=‘SAMME.R‘,
          base_estimator=DecisionTreeClassifier(class_weight=None, criterion=‘gini‘, max_depth=1,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter=‘best‘),
          learning_rate=0.5, n_estimators=500, random_state=None)
In [30]:
accuracy_score(y_test, ada_clf.predict(X_test))
Out[30]:
0.88
 

Grandient Boosting

  1. 和AdaBoost类似,后面的对前面的改进,但它改的是residual errors
In [39]:
from sklearn.tree import DecisionTreeRegressor
import numpy.random as rnd

rnd.seed(42)
X = rnd.rand(100, 1) - 0.5
y = 3*X[:, 0]**2 + 0.05 * rnd.randn(100)

tree_reg1 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg1.fit(X, y)

y2 = y - tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg2.fit(X, y2)

y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg3.fit(X, y3)

X_new = np.array([[0.8]])
y_pred = sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))
print(y_pred)
 
[ 0.75026781]
In [40]:
from sklearn.ensemble import GradientBoostingRegressor
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0)
gbrt.fit(X,y)
Out[40]:
GradientBoostingRegressor(alpha=0.9, criterion=‘friedman_mse‘, init=None,
             learning_rate=1.0, loss=‘ls‘, max_depth=2, max_features=None,
             max_leaf_nodes=None, min_impurity_split=1e-07,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=3, presort=‘auto‘,
             random_state=None, subsample=1.0, verbose=0, warm_start=False)
In [41]:
def plot_predictions(regressors, X, y, axes, label=None, style="r-", data_style="b.", data_label=None):
    x1 = np.linspace(axes[0], axes[1], 500)
    y_pred = sum(regressor.predict(x1.reshape(-1, 1)) for regressor in regressors)
    plt.plot(X[:, 0], y, data_style, label=data_label)
    plt.plot(x1, y_pred, style, linewidth=2, label=label)
    if label or data_label:
        plt.legend(loc="upper center", fontsize=16)
    plt.axis(axes)

plt.figure(figsize=(11,11))

plt.subplot(321)
plot_predictions([tree_reg1], X, y, axes=[-0.5, 0.5, -0.1, 0.8], label="$h_1(x_1)$", style="g-", data_label="Training set")
plt.ylabel("$y$", fontsize=16, rotation=0)
plt.title("Residuals and tree predictions", fontsize=16)

plt.subplot(322)
plot_predictions([tree_reg1], X, y, axes=[-0.5, 0.5, -0.1, 0.8], label="$h(x_1) = h_1(x_1)$", data_label="Training set")
plt.ylabel("$y$", fontsize=16, rotation=0)
plt.title("Ensemble predictions", fontsize=16)

plt.subplot(323)
plot_predictions([tree_reg2], X, y2, axes=[-0.5, 0.5, -0.5, 0.5], label="$h_2(x_1)$", style="g-", data_style="k+", data_label="Residuals")
plt.ylabel("$y - h_1(x_1)$", fontsize=16)

plt.subplot(324)
plot_predictions([tree_reg1, tree_reg2], X, y, axes=[-0.5, 0.5, -0.1, 0.8], label="$h(x_1) = h_1(x_1) + h_2(x_1)$")
plt.ylabel("$y$", fontsize=16, rotation=0)

plt.subplot(325)
plot_predictions([tree_reg3], X, y3, axes=[-0.5, 0.5, -0.5, 0.5], label="$h_3(x_1)$", style="g-", data_style="k+")
plt.ylabel("$y - h_1(x_1) - h_2(x_1)$", fontsize=16)
plt.xlabel("$x_1$", fontsize=16)

plt.subplot(326)
plot_predictions([tree_reg1, tree_reg2, tree_reg3], X, y, axes=[-0.5, 0.5, -0.1, 0.8], label="$h(x_1) = h_1(x_1) + h_2(x_1) + h_3(x_1)$")
plt.xlabel("$x_1$", fontsize=16)
plt.ylabel("$y$", fontsize=16, rotation=0)

plt.show()
 
技术分享
In [42]:
# learn_rate hypermeter scales the contribution of each tree
from sklearn.ensemble import GradientBoostingRegressor

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=0.1, random_state=42)
gbrt.fit(X, y)

gbrt_slow = GradientBoostingRegressor(max_depth=2, n_estimators=200, learning_rate=0.1, random_state=42)
gbrt_slow.fit(X, y)

plt.figure(figsize=(11,4))

plt.subplot(121)
plot_predictions([gbrt], X, y, axes=[-0.5, 0.5, -0.1, 0.8], label="Ensemble predictions")
plt.title("learning_rate={}, n_estimators={}".format(gbrt.learning_rate, gbrt.n_estimators), fontsize=14)

plt.subplot(122)
plot_predictions([gbrt_slow], X, y, axes=[-0.5, 0.5, -0.1, 0.8])
plt.title("learning_rate={}, n_estimators={}".format(gbrt_slow.learning_rate, gbrt_slow.n_estimators), fontsize=14)

plt.show()
 
技术分享
 

to find the optimal number of trees, can use early stopping, A simple way to implement this is to use the $stage_predict()$ method: returns an iterator over the predictions made by the ensumble at each stage of training.

In [44]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X_train, X_val, y_train, y_val = train_test_split(X, y)

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=120)
gbrt.fit(X_train, y_train)

errors = [mean_squared_error(y_val, y_pred) for y_pred in gbrt.staged_predict(X_val)]
bst_n_estimators = np.argmin(errors)

gbrt_best = GradientBoostingRegressor(max_depth=2, n_estimators=bst_n_estimators)
gbrt_best.fit(X_train, y_train)
Out[44]:
GradientBoostingRegressor(alpha=0.9, criterion=‘friedman_mse‘, init=None,
             learning_rate=0.1, loss=‘ls‘, max_depth=2, max_features=None,
             max_leaf_nodes=None, min_impurity_split=1e-07,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=67, presort=‘auto‘,
             random_state=None, subsample=1.0, verbose=0, warm_start=False)
In [47]:
gbrt = GradientBoostingRegressor(max_depth=2, warm_start=True)

min_val_error = float("inf")
error_going_up = 0
for n_estimators in range(1, 120):
    gbrt.n_estimators = n_estimators
    gbrt.fit(X_train, y_train)
    y_pred = gbrt.predict(X_val)
    val_error = mean_squared_error(y_val, y_pred)
    if val_error < min_val_error:
        min_val_error = val_error
        error_going_up = 0
    else:
        error_going_up += 1
        if error_going_up == 5:
            break
 

打开warm_start参数,当达到一种以后5种状态的误差都比现在这种状态大的时候就停止

  1. subsample hyperparameter specifies the fraction of training instances to be used for training each tree
  2. use Gradient Boosting with another cost function
 

Stacking

  1. short for stacking generalization
  2. split the training set into three subsets
  3. first one used to train the first layer
  4. second one used to creat the training set to train the second layer(using predictions made by the predictors of the first layer)
  5. thrid one used to creat the training set to train the thrid layer(using predictions made by the predictors of the second layer) 技术分享
In [ ]:
 
In [ ]:
 

Notes : <Hands-on ML with Sklearn & TF> Chapter 7

标签:conf   rss   vbs   mtk   lru   xsd   book   rdl   tld   

原文地址:http://www.cnblogs.com/yaoz/p/6973973.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!