如何在很大数量级的数据中（比如1个亿）筛选出前10万个最小值？之二

时间：2015-04-15 22:48:56 阅读：161 评论：0 收藏：0 [点我收藏+]

标签：

准备对能一次读入内存中处理的数据，取前K小，用多种方法进行。设想有：插入排序、折半查找插入排序、链排序、堆排序。

顺序查找直接插入：

思路：

1、对全部数据，依次取值。送到一个队列。

2、待判断的数据，在队列中做顺序查找。找到位置则插入。

3、当队列满，且待判断的数据大于队列中最大数据，则直接放弃。

另，代码中，用C++的排序结果与本程序的取值比较，判断出结果正确。

代码如下：

void sort1(int Data[], int m, int n, int Out[])
{
  for(int count= 0, i= 0; i< m; i++)
    if(count== n && Data[i]>= Out[n-1])
      continue;                                                 //无效数据直接放弃
    else
      for(int j= 0; ; j++)                                      //顺序查找
        if(j== count || Data[i]<= Out[j])                       //找到插入点
        {
          for(int k= count; k> j; k--)                          //空出插入点
            Out[k]= Out[k- 1];
          Out[j]= Data[i];
          if(count< n)
            count++;                                            //增加队列长
          break;
        }
}
void ShowTime(String caption, Double &tms)
{
  ShowMessage(String(caption+ "   "+ (GetTickCount()- tms)/1000)+ "秒");
  tms= GetTickCount();
}
void control(int n)
{
  int m= n* pow(10, 3);
  double tms= GetTickCount();
  int *Data= new int[m], *DataSort= new int[m];
  int *Out=  new int[n+ 1];
  for(int i= 0; i< m; i++)                                      //得随机数
    DataSort[i]= Data[i]= random(m);
  ShowTime("制造随机数用时", tms);
  sort(DataSort, DataSort+ m);
  ShowTime("标准排序用时", tms);

  sort1(Data, m, n, Out);
  ShowTime("排序用时", tms);
  for(int lim= n, i= 0; i<= lim; i++)
    if(i== lim || DataSort[i]!= Out[i])
      ShowMessage(i== lim? "取值正确":"取值出错"), i= lim;

  delete []DataSort;
  delete []Data;
  delete []Out;
}

当函数 control(int n)，的 n 值为十万时，用时如下：

制造随机数据：8.346秒；

C++校准排序：75.864秒；

本程序取值：1121.289；

用时虽然没有绝对意义。但，以后，写出更多方法。它们之间的用时，就可以比较了。

void sort2(int Data[], int m, int n, int Out[])
{
  for(int count= 0, i= 0; i< m; i++)
    if(count== n && Data[i]>= Out[n- 1])
      continue;
    else
    {
      int mid, low= 0, high= count- 1;
      for( ; mid=(high+ low)>> 1, low<= high; )                 //折半查找
        if(Data[i]>= Out[mid]) low= mid+ 1; else high= mid- 1;
      for(int k= count; k> low; k--)                            //空出插入点
        Out[k]= Out[k- 1];
      Out[low]= Data[i];
      if(count< n)
        count++;
  }
}

　　折半查找，用时：

标准：53.8秒；

取值：278.508秒。

从标准排序看，似乎，这次，程序跑得更快。

如何在很大数量级的数据中（比如1个亿）筛选出前10万个最小值？之二

标签：

原文地址：http://www.cnblogs.com/oldtab/p/4430191.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行