Hive语法层面优化之四count(distinct)引起的数据倾斜

时间：2014-07-20 22:18:35 阅读：1404 评论：0 收藏：0 [点我收藏+]

当该字段存在大量值为null或空的记录，容易发生数据倾斜；

解决思路：

count distinct时，将值为空的情况单独处理，如果是计算count distinct，可以不用处理，直接过滤，在最后结果中加1；

如果还有其他计算，需要进行group by，可以先将值为空的记录单独处理，再和其他计算结果进行union。

案例：

select count(distinct  end_user_id) as user_num  from trackinfo;

调整为：

select cast(count(distinct end_user_id)+1 as bigint) as user_num  from trackinfo where  end_user_id is not null and end_user_id <> ‘‘;

分析：把为空的过滤掉，在总的count上加1

Multi-Count Distinct

select pid, count(distinct acookie),count(distinct ip),count(wangwangid ip) from ods_p4ppv_ad_d where dt=20140305 group by pid;

必须设置参数：set hive.groupby.skewindata=true

Hive语法层面优化之四count(distinct)引起的数据倾斜,布布扣,bubuko.com

原文地址：http://www.cnblogs.com/luogankun/p/3856574.html

踩

(3)

(1)

评论一句话评论（0）

分享档案

更多>

周排行