码迷,mamicode.com
首页 > 其他好文 > 详细

Reservoir Sampling

时间:2017-11-08 13:11:45      阅读:139      评论:0      收藏:0      [点我收藏+]

标签:simple   first   ctime   elements   with   imp   one   discus   car   

Reservoir sampling is proposed to solve such set of problems: Randomly choose 技术分享 items from a stream of 技术分享 elements where 技术分享 could be very large or unknown in advance, i.e., all elements in the stream are equally likely to be selected with probability 技术分享

The algorithm works as follows.

Let’s first take a look at a simple example with 技术分享. When a new item 技术分享 comes, we either keep 技术分享 with probability 技术分享 or keep the old selected item with probability 技术分享. We repeat this process till the end of the stream, i.e., all elements in 技术分享 have been visited. The probability that 技术分享 is chosen in the end is 技术分享

Thus we prove the algorithm guarantees equal probability for all elements to be chosen. A Java implementation of this algorithm should look like this:

int random(int n) {
    Random rnd = new Random();
    int ret = 0;
    for (int i = 1; i <= n; i++)
        if (rnd.nextInt(i) == 0)
            ret = i;
    return ret;
}

技术分享 is a little tricky. One straightforward way is to simply run the previous algorithm 技术分享 times. However, this does require multiple passes against the stream. Here we discuss another approach to get 技术分享 element randomly.

For item 技术分享, there are two cases to handle:

  1. When 技术分享, we just blindly keep 技术分享
  2. When 技术分享, we keep 技术分享 with probability 技术分享

A simple implementation requires the memory space to store the 技术分享 selected elements, say 技术分享. For every 技术分享 we first get a random number 技术分享 and keep 技术分享 when 技术分享, i.e., 技术分享. Otherwise 技术分享 is discarded. This guarantees the 技术分享 probability in the second scenario.

The proof is as previous. The probability of 技术分享 to be chosen is 技术分享

技术分享 is the probability that 技术分享 is replace by 技术分享 ad 技术分享.

Below is a sample implementation in Java:

int[] random(int[] a, int k) {
    int[] s = new int[k];
    Random rnd = new Random();
    for (int i = 0; i < k; i++)
        s[i] = a[i];
    for (int i = k + 1; i <= a.length; i++) {
        int j = rnd.nextInt(i);
        if (j < k) s[j] = a[i];
    }
    return s;
}

Reservoir Sampling

标签:simple   first   ctime   elements   with   imp   one   discus   car   

原文地址:http://www.cnblogs.com/wtsb/p/7803396.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!