Hive中的桶表入门（适用于抽样查询）

时间：2017-10-10 17:47:51 阅读：164 评论：0 收藏：0 [点我收藏+]

1、基本概念

（1）桶表是对某一列数据进行哈希取值以将数据打散，然后放到不同文件中存储。

（2）在hive分区表中，分区中的数据量过于庞大时，建议使用桶。

（3）在分桶时，对指定字段的值进行hash运算得到hash值，并使用hash值除以桶的个数做取余运算得到的值进行分桶，保证每个桶中有数据但每个桶中的数据不一定相等。

做hash运算时，hash函数的选择取决于分桶字段的数据类型

（4）分桶后的查询效率比分区后的查询效率更高

2、桶表的创建

create table btable1 (id int) clustered by(id) into 4 buckets;

创建只有一个字段（id）的桶表，按照id分桶，分为4个bucket，而bucket的数量等于实际数据插入中reduce的数量。

3、间接加载数据

桶表不能通过load的方式直接加载数据，只能从另一张表中插入数据

4、操作示例

（1）环境配置，使hive能够识别桶

必须按照如下配置：
vim ~/.hiverc
添加：set hive.enforce.bucketing = true;

技术分享

（2）创建桶表

create table btable1 (id int) clustered by(id) into 4 buckets;

（3）创建中间过渡表并为其加载数据

create table btest2(id int);
load data local inpath ‘btest2‘ into table btest2;

（4）桶表的数据插入

insert into table btest1 select * from btest2;

（5）修改桶表中bueket数量

alter table btest3 clustered by(name,age) sorted by(age) into 10 buckets;

（6）Hive中的抽样查询

select * from btest tablesample(bucket 2 out of 5 on id);

原文地址：http://www.cnblogs.com/MrFee/p/hive_bucket.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

周排行