标签:
本文的数据集和上一篇一样,是美国个人收入信息。在上一篇末尾提到了随机森林算法,这一篇就介绍随机森林。
随机森林是一种集成模型(Ensemble Models),集成模型结合了多个模型然后创建了一个精度更高的模型
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score
columns = ["age", "workclass", "education_num", "marital_status", "occupation", "relationship", "race", "sex", "hours_per_week", "native_country"]
clf = DecisionTreeClassifier(random_state=1, min_samples_leaf=75)
clf.fit(train[columns], train["high_income"])
clf2 = DecisionTreeClassifier(random_state=1, max_depth=6)
clf2.fit(train[columns], train["high_income"])
predictions = clf.predict(test[columns])
print(roc_auc_score(predictions, test["high_income"]))
predictions = clf2.predict(test[columns])
print(roc_auc_score(predictions, test["high_income"]))
‘‘‘
0.784853009097
0.771031199892
‘‘‘
当我们有多个分类器的时候,我们可以将其预测结果做成矩阵形式,如下:
然后采用多数表决的方式生成最终的预测结果(当然其实有很多的生成最终结果的方式),此时分类器的个数要大于2,并且最好是奇数,因为是偶数的话,还要写一个规则来打破平局。
由于上面我们只生成了两个决策树,因此我们采用另一种方法来生成最终结果:我们不用predict 函数,而用predict_proba 函数来生成每个样本属于0和1的概率,像下面这样:
predictions = clf.predict_proba(test[columns])[:,1]
predictions2 = clf2.predict_proba(test[columns])[:,1]
combined = (predictions + predictions2) / 2
rounded = numpy.round(combined)
print(roc_auc_score(rounded, test["high_income"]))
‘‘‘
0.789959895266
‘‘‘
The more “diverse“, or dissimilar, the models used to construct an ensemble, the stronger the combined predictions will be (assuming that all models have about the same accuracy).
如果不加修改,那么这些树的都相同,那么就不会得到提升。而随机森林(random forest)就是与上面一样的一个集成模型,其中的random就是暗示修改。随机森林做了两个修改bagging 和andom feature subsets.随机森林是一个包含多个决策树的分类器,并且其输出的类别是由个别树输出的类别的众数而定。
用N来表示训练用例(样本)的个数,M表示特征数目。输入特征数目m,用于确定决策树上一个节点的决策结果;其中m应远小于M。从N个训练用例(样本)中以有放回抽样的方式,取样N次,形成一个训练集(即bootstrap取样),并用未抽到的用例(样本)作预测,评估其误差。 对于每一个节点,随机选择m个特征,决策树上每个节点的决定都是基于这些特征确定的。根据这m个特征,计算其最佳的分裂方式。 每棵树都会完整成长而不会剪枝(Pruning,这有可能在建完一棵正常树状分类器后会被采用)。
tree_count = 10
# Each "bag" will have 60% of the number of original rows.
bag_proportion = .6
predictions = []
for i in range(tree_count):
# We select 60% of the rows from train, sampling with replacement.
# We set a random state to ensure we‘ll be able to replicate our results.
# We set it to i instead of a fixed value so we don‘t get the same sample every loop.
# That would make all of our trees the same.
bag = train.sample(frac=bag_proportion, replace=True, random_state=i)
# Fit a decision tree model to the "bag".
clf = DecisionTreeClassifier(random_state=1, min_samples_leaf=75)
clf.fit(bag[columns], bag["high_income"])
# Using the model, make predictions on the test data.
predictions.append(clf.predict_proba(test[columns])[:,1])
combined = numpy.sum(predictions, axis=0) / 10
rounded = numpy.round(combined)
print(roc_auc_score(rounded, test["high_income"]))
‘‘‘
0.785415640465
‘‘‘
# Create the dataset that we used 2 missions ago.
data = pandas.DataFrame([
[0,4,20,0],
[0,4,60,2],
[0,5,40,1],
[1,4,25,1],
[1,5,35,2],
[1,5,55,1]
])
data.columns = ["high_income", "employment", "age", "marital_status"]
# Set a random seed to make results reproducible.
numpy.random.seed(1)
# The dictionary to store our tree.
tree = {}
nodes = []
# The function to find the column to split on.
def find_best_column(data, target_name, columns):
information_gains = []
# Insert your code here.
for col in columns:
information_gain = calc_information_gain(data, col, "high_income")
information_gains.append(information_gain)
# Find the name of the column with the highest gain.
highest_gain_index = information_gains.index(max(information_gains))
highest_gain = columns[highest_gain_index]
return highest_gain
# The function to construct an id3 decision tree.
def id3(data, target, columns, tree):
unique_targets = pandas.unique(data[target])
nodes.append(len(nodes) + 1)
tree["number"] = nodes[-1]
if len(unique_targets) == 1:
if 0 in unique_targets:
tree["label"] = 0
elif 1 in unique_targets:
tree["label"] = 1
return
best_column = find_best_column(data, target, columns)
column_median = data[best_column].median()
tree["column"] = best_column
tree["median"] = column_median
left_split = data[data[best_column] <= column_median]
right_split = data[data[best_column] > column_median]
split_dict = [["left", left_split], ["right", right_split]]
for name, split in split_dict:
tree[name] = {}
id3(split, target, columns, tree[name])
# Run the id3 algorithm on our dataset and print the resulting tree.
id3(data, "high_income", ["employment", "age", "marital_status"], tree)
print(tree)
def find_best_column(data, target_name, columns):
information_gains = []
# Select two columns randomly.
cols = numpy.random.choice(columns, 2)
for col in cols:
information_gain = calc_information_gain(data, col, "high_income")
information_gains.append(information_gain)
highest_gain_index = information_gains.index(max(information_gains))
# Get the highest gain by indexing cols.
highest_gain = cols[highest_gain_index]
return highest_gain
id3(data, "high_income", ["employment", "age", "marital_status"], tree)
print(tree)
# We‘ll build 10 trees
tree_count = 10
# Each "bag" will have 70% of the number of original rows.
bag_proportion = .7
predictions = []
for i in range(tree_count):
# We select 80% of the rows from train, sampling with replacement.
# We set a random state to ensure we‘ll be able to replicate our results.
# We set it to i instead of a fixed value so we don‘t get the same sample every time.
bag = train.sample(frac=bag_proportion, replace=True, random_state=i)
# Fit a decision tree model to the "bag".
clf = DecisionTreeClassifier(random_state=1, min_samples_leaf=75, splitter="random", max_features="auto")
clf.fit(bag[columns], bag["high_income"])
# Using the model, make predictions on the test data.
predictions.append(clf.predict_proba(test[columns])[:,1])
combined = numpy.sum(predictions, axis=0) / 10
rounded = numpy.round(combined)
print(roc_auc_score(rounded, test["high_income"]))
‘‘‘
0.789767997764
‘‘‘
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=10, random_state=1, min_samples_leaf=75)
clf.fit(train[columns], train["high_income"])
predictions = clf.predict(test[columns])
print(roc_auc_score(predictions, test["high_income"]))
‘‘‘
0.791634978035
‘‘‘
与决策树相同随机森林也可以通过调整参数得到更高的精度,随机森林的参数如下,前四个参数和决策树的参数相同,第五个参数
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=150, random_state=1, min_samples_leaf=75)
clf.fit(train[columns], train["high_income"])
predictions = clf.predict(test[columns])
print(roc_auc_score(predictions, test["high_income"]))
‘‘‘
0.793788646293
‘‘‘
clf = RandomForestClassifier(n_estimators=150, random_state=1, min_samples_leaf=75)
clf.fit(train[columns], train["high_income"])
predictions = clf.predict(train[columns])
print(roc_auc_score(predictions, train["high_income"]))
predictions = clf.predict(test[columns])
print(roc_auc_score(predictions, test["high_income"]))
‘‘‘
0.794137608735
0.793788646293
‘‘‘
随机森林和神经网络以及gradient boosted trees是表现最好的算法之三。
随机森林的缺点:
所以结合了优点以及缺点,通常在精度是至关重要且不要求解释决策的情况下使用随机森林是很好的。当就要求效率(时间复杂度)且需要解释时使用单个的决策树会比较合适。
标签:
原文地址:http://blog.csdn.net/zm714981790/article/details/51262103