aggregate the predictions of a group predictions, you will often get better predictions than with the best individual predictor
A group of predictions called ensemble
discuss the most popular Ensemble mothod including:bagging, boosting, stracking, and a few others

Voting Classifiers¶

voting=hard为投票，voting=soft为平均所有独立的分类器
根据大数定律，Ensemble Learning会比独立的各个分类器表现的更好，而且各个分类器相互之间越独立表现的越好

from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)  #chapter 5

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

log_clf = LogisticRegression(random_state=42)
rnd_clf = RandomForestClassifier(random_state=42)
svm_clf = SVC(probability=True, random_state=42)  #voting=soft需要predict_proba（），田间probability=True来产生此方法

voting_clf = VotingClassifier(
        estimators=[(‘lr‘, log_clf), (‘rf‘, rnd_clf), (‘svc‘, svm_clf)],
        voting=‘soft‘
    )
voting_clf.fit(X_train, y_train)  #统一fit

from sklearn.metrics import accuracy_score

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.872
SVC 0.888
VotingClassifier 0.912

Bagging And Pasting¶

可以使用完全不相同的算法来产生多个分类器，也可以使用相同的算法不相同的训练集来产生不相同的算法

同一个数据集重新采样来代表整个数据集。有放回的采样bagging，无放回的叫pasting
使用重采样之后的数据集会产生更大的倾斜，但将它们aggregation（汇聚）之后会降低倾斜和波动
各个分类器可以分别训练（并行）

Bagging and pasting in scikit-learn

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500, max_samples=100, bootstrap=True, n_jobs=-1)

bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)

accuracy_score(y_test, y_pred)

0.91200000000000003

past_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500, max_samples=100, bootstrap=False, n_jobs=-1)

past_clf.fit(X_train, y_train)
y_pred = past_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.91200000000000003

bagging has a slightly higher bias than pasing because of bootstrapping introduces a bit more diversity

Out-of-Bag Evaluation

bagging多次采样之后约有$\frac{1}{e}=0.367879$没被采样到，被称为oob
可以使用这一部分计算误差

bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500,max_samples=100, bootstrap=True, n_jobs=-1, oob_score=True)
bag_clf.fit(X_train, y_train)
bag_clf.oob_score_

0.92266666666666663

y_pred=bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.91200000000000003

bag_clf.oob_decision_function_[2]

array([ 0.99744898,  0.00255102])

Random Pathes and Random Subspaces¶

对于feature很多维的输入，也有针对feature的采样。分别称为max_features和bootstrap_features,和max_samples和bootstrap用法相同
Sampling both instances and features is called Random Pathes
Keeping all training instances but sampling features is called Random Subspaces
trade a bit more bias for a lower variance

Random Forest¶

use RandomForestClassifier

from sklearn.ensemble import RandomForestClassifier

rnd_clf=RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train)

y_pred_rf = rnd_clf.predict(X_test)

bad_clf = BaggingClassifier(DecisionTreeClassifier(splitter=‘random‘,max_leaf_nodes=16), n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1)

accuracy_score(y_test, y_pred_rf)

0.91200000000000003

accuracy_score(y_test, bag_clf.predict(X_test))

0.91200000000000003

Extra-Trees

when growing a tree in a random forest, at each node only a arandom subset of features is considered for splitting.
possible to make trees even more random by also using random threshold rather than searching for the best possible threshold.
use the $ExtraTreesClassifier()$ to creat a Extra-Trees class, it has the same API as the RandomForestClassifier.
也说不定那个表现的更好，对于某一问题需要cross-validation检验一下

Feature Importance

importance features are likely appear closer to the root
using featureimportances variable to access the average depth at which it appears across all trees in the forest

from sklearn.datasets import load_iris
iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rnd_clf.fit(iris[‘data‘], iris[‘target‘])
for name, score in zip(iris[‘feature_names‘], rnd_clf.feature_importances_):
    print(name, score)

sepal length (cm) 0.102324356672
sepal width (cm) 0.0257240474133
petal length (cm) 0.439143949318
petal width (cm) 0.432807646597

from six.moves import urllib
from sklearn.datasets import fetch_mldata
try:
    mnist = fetch_mldata(‘MNIST original‘)
except urllib.error.HTTPError as ex:
    print("Could not download MNIST data from mldata.org, trying alternative...")

    # Alternative method to load MNIST, if mldata.org is down
    from scipy.io import loadmat
    mnist_alternative_url = "https://github.com/amplab/datascience-sp14/raw/master/lab7/mldata/mnist-original.mat"
    mnist_path = "./mnist-original.mat"
    response = urllib.request.urlopen(mnist_alternative_url)
    with open(mnist_path, "wb") as f:
        content = response.read()
        f.write(content)
    mnist_raw = loadmat(mnist_path)
    mnist = {
        "data": mnist_raw["data"].T,
        "target": mnist_raw["label"][0],
        "COL_NAMES": ["label", "data"],
        "DESCR": "mldata.org dataset: mnist-original",
    }
    print("Success!")

rnd_clf = RandomForestClassifier(random_state=42)
rnd_clf.fit(mnist["data"], mnist["target"])

RandomForestClassifier(bootstrap=True, class_weight=None, criterion=‘gini‘,
            max_depth=None, max_features=‘auto‘, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=42,
            verbose=0, warm_start=False)

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
def plot_digit(data):
    image = data.reshape(28, 28)
    plt.imshow(image, cmap = matplotlib.cm.hot,
               interpolation="nearest")
    plt.axis("off")

plot_digit(rnd_clf.feature_importances_)

cbar = plt.colorbar(ticks=[rnd_clf.feature_importances_.min(), rnd_clf.feature_importances_.max()])
cbar.ax.set_yticklabels([‘Not important‘, ‘Very important‘])

plt.show()

Boosting¶

combine several weak learners into a strong leaner
依次训练predictor,每个都试图纠正predecessor

Adaptive Boosting

new predictor pay more attention to the training instances that the predecessor underfitted.
For example, a second classifier is trained using the first updated weights and again it makes predictions on the training set ,weights are updated, and so on.

import numpy as np
m = len(X_train)

from matplotlib.colors import ListedColormap

def plot_decision_boundary(clf, X, y, axes=[-1.5, 2.5, -1, 1.5], alpha=0.5, contour=True):
    x1s = np.linspace(axes[0], axes[1], 100)
    x2s = np.linspace(axes[2], axes[3], 100)
    x1, x2 = np.meshgrid(x1s, x2s)
    X_new = np.c_[x1.ravel(), x2.ravel()]
    y_pred = clf.predict(X_new).reshape(x1.shape)
    custom_cmap = ListedColormap([‘#fafab0‘,‘#9898ff‘,‘#a0faa0‘])
    plt.contourf(x1, x2, y_pred, alpha=0.3, cmap=custom_cmap, linewidth=10)
    if contour:
        custom_cmap2 = ListedColormap([‘#7d7d58‘,‘#4c4c7f‘,‘#507d50‘])
        plt.contour(x1, x2, y_pred, cmap=custom_cmap2, alpha=0.8)
    plt.plot(X[:, 0][y==0], X[:, 1][y==0], "yo", alpha=alpha)
    plt.plot(X[:, 0][y==1], X[:, 1][y==1], "bs", alpha=alpha)
    plt.axis(axes)
    plt.xlabel(r"$x_1$", fontsize=18)
    plt.ylabel(r"$x_2$", fontsize=18, rotation=0)

plt.figure(figsize=(11, 4))
for subplot, learning_rate in ((121, 1), (122, 0.5)):
    sample_weights = np.ones(m)
    for i in range(5):
        plt.subplot(subplot)
        svm_clf = SVC(kernel="rbf", C=0.05)
        svm_clf.fit(X_train, y_train, sample_weight=sample_weights)
        y_pred = svm_clf.predict(X_train)
        sample_weights[y_pred != y_train] *= (1 + learning_rate)
        plot_decision_boundary(svm_clf, X, y, alpha=0.2)
        plt.title("learning_rate = {}".format(learning_rate - 1), fontsize=16)

plt.subplot(121)
plt.text(-0.7, -0.65, "1", fontsize=14)
plt.text(-0.6, -0.10, "2", fontsize=14)
plt.text(-0.5,  0.10, "3", fontsize=14)
plt.text(-0.4,  0.55, "4", fontsize=14)
plt.text(-0.3,  0.90, "5", fontsize=14)
plt.show()

$$ Weighted\ error\ rate\ (加权错误率)of\ the\ j^{th}\ predictor: \\ r_j=\frac{\sum_{i=1,\widehat{y}_{j}^{(i)}\neq y^{(i)}}^{m}w^{(i)}}{\sum_{i=1}^{m}w^{(i)}}\\ where\ \widehat y_{j}^{(i)}\ is\ the\ j^{th} \ predictor‘s\ prediction\ for\ the\ i^{th}\ instance. \\ \alpha_j=\eta log \frac{1-r_j}{r_j} \\ Weight\ update\ rule: \\ for\ i=1,2,...,m:\ w^{(i)}\leftarrow \begin{align*} \left\{\begin{matrix} w^{(i)} &if\ \widehat{y}_{j}^{(i)}=y^{(i)}\\ w^{(i)}e^{\alpha_j} &if\ \widehat{y}_{j}^{(i)}\neq y^{(i)} \end{matrix}\right. \end{align*} $$

默认$w^{(i)}=\frac{1}{m}$
第一次训练,过程中计算$r_j，\alpha_j$
更新$w^{(i)}$,标准化，例如：$\ /\sum_{i=1}^{m}w^{(i)}$
做下一次训练，重复这个过程

$$ AdaBoost\ predictions: \\ \widehat y(x) = \underset{k}{argmax} \sum_{j=1,\widehat y_j(x)=k}^{N} \alpha_j \\ where\ N\ is\ the\ number\ of\ predictors. $$

SAMME:Stagewise Additive Modeling using a Multiclass Exponential loss function, multiclass version of AdaBoost

from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1), n_estimators=500, algorithm=‘SAMME.R‘, learning_rate=0.5)
ada_clf.fit(X_train, y_train)

AdaBoostClassifier(algorithm=‘SAMME.R‘,
          base_estimator=DecisionTreeClassifier(class_weight=None, criterion=‘gini‘, max_depth=1,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter=‘best‘),
          learning_rate=0.5, n_estimators=500, random_state=None)

accuracy_score(y_test, ada_clf.predict(X_test))

0.88

Grandient Boosting

和AdaBoost类似，后面的对前面的改进，但它改的是residual errors

from sklearn.tree import DecisionTreeRegressor
import numpy.random as rnd

rnd.seed(42)
X = rnd.rand(100, 1) - 0.5
y = 3*X[:, 0]**2 + 0.05 * rnd.randn(100)

tree_reg1 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg1.fit(X, y)

y2 = y - tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg2.fit(X, y2)

y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg3.fit(X, y3)

X_new = np.array([[0.8]])
y_pred = sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))
print(y_pred)

[ 0.75026781]

from sklearn.ensemble import GradientBoostingRegressor
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0)
gbrt.fit(X,y)

GradientBoostingRegressor(alpha=0.9, criterion=‘friedman_mse‘, init=None,
             learning_rate=1.0, loss=‘ls‘, max_depth=2, max_features=None,
             max_leaf_nodes=None, min_impurity_split=1e-07,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=3, presort=‘auto‘,
             random_state=None, subsample=1.0, verbose=0, warm_start=False)

def plot_predictions(regressors, X, y, axes, label=None, style="r-", data_style="b.", data_label=None):
    x1 = np.linspace(axes[0], axes[1], 500)
    y_pred = sum(regressor.predict(x1.reshape(-1, 1)) for regressor in regressors)
    plt.plot(X[:, 0], y, data_style, label=data_label)
    plt.plot(x1, y_pred, style, linewidth=2, label=label)
    if label or data_label:
        plt.legend(loc="upper center", fontsize=16)
    plt.axis(axes)

plt.figure(figsize=(11,11))

plt.subplot(321)
plot_predictions([tree_reg1], X, y, axes=[-0.5, 0.5, -0.1, 0.8], label="$h_1(x_1)$", style="g-", data_label="Training set")
plt.ylabel("$y$", fontsize=16, rotation=0)
plt.title("Residuals and tree predictions", fontsize=16)

plt.subplot(322)
plot_predictions([tree_reg1], X, y, axes=[-0.5, 0.5, -0.1, 0.8], label="$h(x_1) = h_1(x_1)$", data_label="Training set")
plt.ylabel("$y$", fontsize=16, rotation=0)
plt.title("Ensemble predictions", fontsize=16)

plt.subplot(323)
plot_predictions([tree_reg2], X, y2, axes=[-0.5, 0.5, -0.5, 0.5], label="$h_2(x_1)$", style="g-", data_style="k+", data_label="Residuals")
plt.ylabel("$y - h_1(x_1)$", fontsize=16)

plt.subplot(324)
plot_predictions([tree_reg1, tree_reg2], X, y, axes=[-0.5, 0.5, -0.1, 0.8], label="$h(x_1) = h_1(x_1) + h_2(x_1)$")
plt.ylabel("$y$", fontsize=16, rotation=0)

plt.subplot(325)
plot_predictions([tree_reg3], X, y3, axes=[-0.5, 0.5, -0.5, 0.5], label="$h_3(x_1)$", style="g-", data_style="k+")
plt.ylabel("$y - h_1(x_1) - h_2(x_1)$", fontsize=16)
plt.xlabel("$x_1$", fontsize=16)

plt.subplot(326)
plot_predictions([tree_reg1, tree_reg2, tree_reg3], X, y, axes=[-0.5, 0.5, -0.1, 0.8], label="$h(x_1) = h_1(x_1) + h_2(x_1) + h_3(x_1)$")
plt.xlabel("$x_1$", fontsize=16)
plt.ylabel("$y$", fontsize=16, rotation=0)

plt.show()

# learn_rate hypermeter scales the contribution of each tree
from sklearn.ensemble import GradientBoostingRegressor

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=0.1, random_state=42)
gbrt.fit(X, y)

gbrt_slow = GradientBoostingRegressor(max_depth=2, n_estimators=200, learning_rate=0.1, random_state=42)
gbrt_slow.fit(X, y)

plt.figure(figsize=(11,4))

plt.subplot(121)
plot_predictions([gbrt], X, y, axes=[-0.5, 0.5, -0.1, 0.8], label="Ensemble predictions")
plt.title("learning_rate={}, n_estimators={}".format(gbrt.learning_rate, gbrt.n_estimators), fontsize=14)

plt.subplot(122)
plot_predictions([gbrt_slow], X, y, axes=[-0.5, 0.5, -0.1, 0.8])
plt.title("learning_rate={}, n_estimators={}".format(gbrt_slow.learning_rate, gbrt_slow.n_estimators), fontsize=14)

plt.show()

to find the optimal number of trees, can use early stopping, A simple way to implement this is to use the $stage_predict()$ method: returns an iterator over the predictions made by the ensumble at each stage of training.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X_train, X_val, y_train, y_val = train_test_split(X, y)

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=120)
gbrt.fit(X_train, y_train)

errors = [mean_squared_error(y_val, y_pred) for y_pred in gbrt.staged_predict(X_val)]
bst_n_estimators = np.argmin(errors)

gbrt_best = GradientBoostingRegressor(max_depth=2, n_estimators=bst_n_estimators)
gbrt_best.fit(X_train, y_train)

GradientBoostingRegressor(alpha=0.9, criterion=‘friedman_mse‘, init=None,
             learning_rate=0.1, loss=‘ls‘, max_depth=2, max_features=None,
             max_leaf_nodes=None, min_impurity_split=1e-07,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=67, presort=‘auto‘,
             random_state=None, subsample=1.0, verbose=0, warm_start=False)

gbrt = GradientBoostingRegressor(max_depth=2, warm_start=True)

min_val_error = float("inf")
error_going_up = 0
for n_estimators in range(1, 120):
    gbrt.n_estimators = n_estimators
    gbrt.fit(X_train, y_train)
    y_pred = gbrt.predict(X_val)
    val_error = mean_squared_error(y_val, y_pred)
    if val_error < min_val_error:
        min_val_error = val_error
        error_going_up = 0
    else:
        error_going_up += 1
        if error_going_up == 5:
            break

打开warm_start参数，当达到一种以后5种状态的误差都比现在这种状态大的时候就停止

subsample hyperparameter specifies the fraction of training instances to be used for training each tree
use Gradient Boosting with another cost function

Stacking¶

short for stacking generalization
split the training set into three subsets
first one used to train the first layer
second one used to creat the training set to train the second layer(using predictions made by the predictors of the first layer)
thrid one used to creat the training set to train the thrid layer(using predictions made by the predictors of the second layer)