码迷,mamicode.com
首页 > 其他好文 > 详细

spark MLlib 概念 2:Stratified sampling 层次抽样

时间:2015-02-01 17:25:53      阅读:251      评论:0      收藏:0      [点我收藏+]

标签:

定义:
In statistical surveys, when subpopulations within an overall population vary, it is advantageous to sample each subpopulation (stratum) independently.Stratification is the process of dividing members of the population into homogeneous subgroups before sampling. The strata should be mutually exclusive: every element in the population must be assigned to only one stratum. 
简言之,将数据集划分为相同标签的子集,然后再在每个子集进行独立的抽样

Advantages[edit]

优点是:即使在样本空间的概率密度急剧变化的情况,层次抽样也能保证不同(概率密度)层次的样本的抽取概率的精确性。

If population density varies greatly within a region, stratified sampling will ensure that estimates can be made with equal accuracy in different parts of the region, and that comparisons of sub-regions can be made with equal statistical power.

Randomized stratification can also be used to improve population representativeness in a study.

Disadvantages[edit]

Stratified sampling is not useful when the population cannot be exhaustively partitioned into disjoint subgroups. It would be a misapplication of the technique to make subgroups‘ sample sizes proportional to the amount of data available from the subgroups, rather than scaling sample sizes to subgroup sizes (or to their variances, if known to vary significantly 


 





spark MLlib 概念 2:Stratified sampling 层次抽样

标签:

原文地址:http://www.cnblogs.com/zwCHAN/p/4265743.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!