码迷,mamicode.com
首页 > 其他好文 > 详细

《Spark MLlib 机器学习》细节解释(持续更新)

时间:2016-04-22 19:26:13      阅读:104      评论:0      收藏:0      [点我收藏+]

标签:

1、        P220

对该段文字的解决:

得到最大bin 数量后,求最大split 数量。对于无序特征,split = bin 数目/2;对于有序特征,split = bin 数目–1。

 

其中有读者问到:对于无序特征,split = bin 数目/2这个的由来,解释如下:

 

1)首先计算numBins:

        // 当前的特征数量小于m值,则认为无序

        if (numCategories <=maxCategoriesForUnorderedFeature) {//无序时

          unorderedFeatures.add(featureIndex)

          numBins(featureIndex) = numUnorderedBins(numCategories)

        } else {//有序时

          numBins(featureIndex) = numCategories

        }

根据以上可知,无序时numBins = numUnorderedBins(numCategories)

其中numUnorderedBins函数如下:

    /**

   * Given the arity of a categorical feature(arity = number of categories),

   * return the number of bins for the featureif it is to be treated as an unordered feature.

   * There is 1 split for every partitioning ofcategories into 2 disjoint, non-empty sets;

   * there are math.pow(2, arity - 1) - 1 suchsplits.

   * Each split has 2 corresponding bins.

   * 解释:一次划分会有2bins,好比,切西瓜,一刀下去,分成2

   */

  def numUnorderedBins(arity: Int): Int = 2 * ((1 << arity - 1) - 1)

 

根据公式:numBins = 2*math.pow(2,arity - 1) – 1

 

2)根据numBins计算numSplits:

 

  def numSplits(featureIndex: Int): Int = if(isUnordered(featureIndex)) {

    numBins(featureIndex) >> 1

  } else {

    numBins(featureIndex) - 1

  }

 

根据公式:numSplits = numBins/2= math.pow(2, arity - 1) – 1

《Spark MLlib 机器学习》细节解释(持续更新)

标签:

原文地址:http://blog.csdn.net/sunbow0/article/details/51211636

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!