Hbase 布隆过滤器BloomFilter介绍

时间：2017-05-05 14:13:26 阅读：289 评论：0 收藏：0 [点我收藏+]

标签：create 结构 heap this files 高效生成 null sse

1、主要功能

提高随机读的性能

2、存储开销

bloom filter的数据存在StoreFile的meta中，一旦写入无法更新，由于StoreFile是不可变的。Bloomfilter是一个列族（cf）级别的配置属性，假设你在表中设置了Bloomfilter，那么HBase会在生成StoreFile时包括一份bloomfilter结构的数据，称其为MetaBlock；MetaBlock与DataBlock（真实的KeyValue数据）一起由LRUBlockCache维护。所以，开启bloomfilter会有一定的存储及内存cache开销。

3、控制粒度

a)ROW

依据KeyValue中的row来过滤storefile

举例：如果有2个storefile文件sf1和sf2，

sf1包括kv1（r1 cf:q1 v）、kv2（r2 cf:q1 v）

sf2包括kv3（r3 cf:q1 v）、kv4（r4 cf:q1 v）

假设设置了CF属性中的bloomfilter为ROW，那么get(r1)时就会过滤sf2。get(r3)就会过滤sf1

b)ROWCOL

依据KeyValue中的row+qualifier来过滤storefile

举例：如果有2个storefile文件sf1和sf2。

sf1包括kv1（r1 cf:q1 v）、kv2（r2 cf:q1 v）

sf2包括kv3（r1 cf:q2 v）、kv4（r2 cf:q2 v）

假设设置了CF属性中的bloomfilter为ROW。不管get(r1,q1)还是get(r1,q2)，都会读取sf1+sf2；而假设设置了CF属性中的bloomfilter为ROWCOL，那么get(r1,q1)就会过滤sf2。get(r1,q2)就会过滤sf1

4、经常使用场景

1、依据key随机读时。在StoreFile级别进行过滤

2、读数据时。会查询到大量不存在的key，也可用于高效推断key是否存在

5、举例说明

如果x、y、z三个key存在于table中。W不存在

使用Bloom Filter能够帮助我们降低为了推断key是否存在而去做Scan操作的次数

step1）分别对x、y、z运算hash函数取得bit mask。写到Bloom Filter结构中

step2）对W运算hash函数。从Bloom Filter查找bit mask

假设不存在：三个Bit位至少有一个为0，W肯定不存在该（Bloom Filter不会漏判）

假设存在：三个Bit位所有所有等于1，路由到负责W的Region运行scan，确认是否真的存在（Bloom Filter有极小的概率误判）

6、源代码解析

1.get操作会enable bloomfilter帮助剔除掉不会用到的Storefile

在scan初始化时（get会包装为scan）对于每一个storefile会做shouldSeek的检查，假设返回false。则表明该storefile里没有要找的内容，直接跳过

if (memOnly == false  
            && ((StoreFileScanner) kvs).shouldSeek(scan, columns)) {  
          scanners.add(kvs);  
}

shouldSeek方法：假设是scan直接返回true表明不能跳过。然后依据bloomfilter类型检查。

if (!scan.isGetScan()) {  
        return true;  
}  
byte[] row = scan.getStartRow();  
switch (this.bloomFilterType) {  
  case ROW:  
    return passesBloomFilter(row, 0, row.length, null, 0, 0);  
 
  case ROWCOL:  
    if (columns != null && columns.size() == 1) {  
      byte[] column = columns.first();  
      return passesBloomFilter(row, 0, row.length, column, 0, column.length);  
    }  
    // For multi-column queries the Bloom filter is checked from the  
    // seekExact operation.  
    return true;  
 
  default:  
    return true;
}

2.指明qualified的scan在配了rowcol的情况下会剔除不会用掉的StoreFile。

对指明了qualify的scan或者get进行检查：seekExactly

// Seek all scanners to the start of the Row (or if the exact matching row  
// key does not exist, then to the start of the next matching Row).  
if (matcher.isExactColumnQuery()) {  
  for (KeyValueScanner scanner : scanners)  
  scanner.seekExactly(matcher.getStartKey(), false);  
} else {  
  for (KeyValueScanner scanner : scanners)  
  scanner.seek(matcher.getStartKey());  
}

假设bloomfilter没命中，则创建一个非常大的假的keyvalue，表明该storefile不须要实际的scan

public boolean seekExactly(KeyValue kv, boolean forward)  
      throws IOException {  
    if (reader.getBloomFilterType() != StoreFile.BloomType.ROWCOL ||  
        kv.getRowLength() == 0 || kv.getQualifierLength() == 0) {  
      return forward ? reseek(kv) : seek(kv);  
    }  
  
    boolean isInBloom = reader.passesBloomFilter(kv.getBuffer(),  
        kv.getRowOffset(), kv.getRowLength(), kv.getBuffer(),  
        kv.getQualifierOffset(), kv.getQualifierLength());  
    if (isInBloom) {  
      // This row/column might be in this store file. Do a normal seek.  
      return forward ? reseek(kv) : seek(kv);  
    }  
  
    // Create a fake key/value, so that this scanner only bubbles up to the top  
    // of the KeyValueHeap in StoreScanner after we scanned this row/column in  
    // all other store files. The query matcher will then just skip this fake  
    // key/value and the store scanner will progress to the next column.  
    cur = kv.createLastOnRowCol();  
    return true;  
}

这边为什么是rowcol才干剔除storefile纳，非常easy，scan是一个范围，假设是row的bloomfilter不命中仅仅能说明该rowkey不在此storefile中。但next rowkey可能在。而rowcol的bloomfilter就不一样了，假设rowcol的bloomfilter没有命中表明该qualifiy不在这个storefile中，因此这次scan就不须要scan此storefile了！

7、总结

1.不论什么类型的get（基于rowkey或row+col）Bloom Filter的优化都能生效。关键是get的类型要匹配Bloom Filter的类型

2.基于row的scan是没办法走Bloom Filter的。

由于Bloom Filter是须要事先知道过滤项的。对于顺序scan是没有事先办法知道rowkey的。

而get是指明了rowkey所以能够用Bloom Filter，scan指明column同理。

3.row+col+qualify的scan能够去掉不存在此qualify的storefile，也算是不错的优化了，并且指明qualify也能降低流量。因此scan尽量指明qualify。

Hbase 布隆过滤器BloomFilter介绍

标签：create 结构 heap this files 高效生成 null sse

原文地址：http://www.cnblogs.com/yutingliuyl/p/6812531.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行