《Spark MLlib 机器学习》细节解释（持续更新）

时间：2016-04-22 19:26:13 阅读：104 评论：0 收藏：0 [点我收藏+]

标签：

1、 P220

对该段文字的解决：

得到最大bin 数量后，求最大split 数量。对于无序特征，split = bin 数目/2；对于有序特征，split = bin 数目–1。

其中有读者问到：对于无序特征，split = bin 数目/2这个的由来，解释如下：

1）首先计算numBins：

// 当前的特征数量小于m值，则认为无序

if (numCategories <=maxCategoriesForUnorderedFeature) {//无序时

unorderedFeatures.add(featureIndex)

numBins(featureIndex) = numUnorderedBins(numCategories)

} else {//有序时

numBins(featureIndex) = numCategories

}

根据以上可知，无序时numBins = numUnorderedBins(numCategories)

其中numUnorderedBins函数如下：

/**

* Given the arity of a categorical feature(arity = number of categories),

* return the number of bins for the featureif it is to be treated as an unordered feature.

* There is 1 split for every partitioning ofcategories into 2 disjoint, non-empty sets;

* there are math.pow(2, arity - 1) - 1 suchsplits.

* Each split has 2 corresponding bins.

* 解释：一次划分会有2个bins，好比，切西瓜，一刀下去，分成2块

def numUnorderedBins(arity: Int): Int = 2 * ((1 << arity - 1) - 1)

根据公式：numBins = 2*math.pow(2,arity - 1) – 1

2）根据numBins计算numSplits：

def numSplits(featureIndex: Int): Int = if(isUnordered(featureIndex)) {

numBins(featureIndex) >> 1

} else {

numBins(featureIndex) - 1

}

根据公式：numSplits = numBins/2= math.pow(2, arity - 1) – 1

标签：

原文地址：http://blog.csdn.net/sunbow0/article/details/51211636

踩

(0)

评论一句话评论（0）

分享档案

更多>

周排行