标签:label resin form not because rom first pow sample
最重要的,常常需要调试以提高算法效果的有两个参数:numTrees,maxDepth。
We include a few guidelines for using random forests by discussing the various parameters. We omit some decision tree parameters since those are covered in the decision tree guide.
The first two parameters we mention are the most important, and tuning them can often improve performance:
(1)numTrees: Number of trees in the forest.
Increasing the number of trees will decrease the variance in predictions, improving the model’s test-time accuracy.
Training time increases roughly linearly in the number of trees.
(2)maxDepth: Maximum depth of each tree in the forest.
Increasing the depth makes the model more expressive and powerful. However, deep trees take longer to train and are also more prone to overfitting.
In general, it is acceptable to train deeper trees when using random forests than when using a single decision tree. One tree is more likely to overfit than a random forest (because of the variance reduction from averaging multiple trees in the forest).
The next two parameters generally do not require tuning. However, they can be tuned to speed up training.
(3)subsamplingRate: This parameter specifies the size of the dataset used for training each tree in the forest, as a fraction of the size of the original dataset. The default (1.0) is recommended, but decreasing this fraction can speed up training.
(4)featureSubsetStrategy: Number of features to use as candidates for splitting at each tree node. The number is specified as a fraction or function of the total number of features. Decreasing this number will speed up training, but can sometimes impact performance if too low.
We include a few guidelines for using random forests by discussing the various parameters. We omit some decision tree parameters since those are covered in the decision tree guide.
""" Random Forest Classification Example. """ from __future__ import print_function from pyspark import SparkContext # $example on$ from pyspark.mllib.tree import RandomForest, RandomForestModel from pyspark.mllib.util import MLUtils # $example off$ if __name__ == "__main__": sc = SparkContext(appName="PythonRandomForestClassificationExample") # $example on$ # Load and parse the data file into an RDD of LabeledPoint. data = MLUtils.loadLibSVMFile(sc, ‘data/mllib/sample_libsvm_data.txt‘) # Split the data into training and test sets (30% held out for testing) (trainingData, testData) = data.randomSplit([0.7, 0.3]) # Train a RandomForest model. # Empty categoricalFeaturesInfo indicates all features are continuous. # Note: Use larger numTrees in practice. # Setting featureSubsetStrategy="auto" lets the algorithm choose. model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={}, numTrees=3, featureSubsetStrategy="auto", impurity=‘gini‘, maxDepth=4, maxBins=32) # Evaluate model on test instances and compute test error predictions = model.predict(testData.map(lambda x: x.features)) labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions) testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count()) print(‘Test Error = ‘ + str(testErr)) print(‘Learned classification forest model:‘) print(model.toDebugString()) # Save and load model model.save(sc, "target/tmp/myRandomForestClassificationModel") sameModel = RandomForestModel.load(sc, "target/tmp/myRandomForestClassificationModel") # $example off$
模型样子:
TreeEnsembleModel classifier with 3 trees Tree 0: If (feature 511 <= 0.0) If (feature 434 <= 0.0) Predict: 0.0 Else (feature 434 > 0.0) Predict: 1.0 Else (feature 511 > 0.0) Predict: 0.0 Tree 1: If (feature 490 <= 31.0) Predict: 0.0 Else (feature 490 > 31.0) Predict: 1.0 Tree 2: If (feature 302 <= 0.0) If (feature 461 <= 0.0) If (feature 208 <= 107.0) Predict: 1.0 Else (feature 208 > 107.0) Predict: 0.0 Else (feature 461 > 0.0) Predict: 1.0 Else (feature 302 > 0.0) Predict: 0.0
标签:label resin form not because rom first pow sample
原文地址:http://www.cnblogs.com/bonelee/p/7204096.html