hive sql 优化数据倾斜

时间：2015-05-13 17:09:42 阅读：205 评论：0 收藏：0 [点我收藏+]

此脚本运行速度慢，主要是reduce端数据倾斜导致的，了解到dw.fct_traffic_navpage_path_detl表是用来收集用户点击数据的，那么最终

购物车和下单的点击肯定极少，所以此表ordr_code字段为空和cart_prod_id字段为NULL的数据量极大，如下所示：

select ordr_code,count(*) as a from dw.fct_traffic_navpage_path_detl where ds = ‘2015-05-10‘ group by ordr_code having a>10000 ;

151722135

select cart_prod_id,count(*) as a fromdw.fct_traffic_navpage_path_detl where ds = ‘2015-05-10‘ groupby cart_prod_id having a>10000 ;

NULL 127233335

对于create table tmp_lifan_trfc_tpa as 这句SQL，BI加上如下配置，

set hive.mapjoin.smalltable.filesize = 120000000; //因为 dw.univ_parnt_tranx_comb_detl表最大不超过120MB，如果是hive on tez要用hive.auto.convert.join.noconditionaltask.size ，这样tez会生成BROADCAST

sethive.auto.convert.join=true;

同时修改SQL如下语句：

from dw.fct_traffic_navpage_path_detl t

left outer join dw.univ_parnt_tranx_comb_detl o //用mapjoin解决数据倾斜

on t.ordr_code = o.parnt_ordr_code

and t.cart_prod_id = o.comb_prod_id

and o.ds = ‘2015-05-10‘

left outer join bic.cust_first_ordr_tranx f

on case when o.end_user_id is null then cast(rand(9)*100as bigint) else o.end_user_id end = f.end_user_id //join后数倾斜用随机数避免倾斜，红色为修改部分

and f.first_ordr_date_id = ‘2015-05-10‘

where t.ds = ‘2015-05-10‘;

运行后SQL可以在可控时间内完成。

hive sql 优化数据倾斜

标签：hive sql 优化 tez

原文地址：http://tangjj.blog.51cto.com/1848040/1650926

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行

hive sql 优化 数据倾斜

hive sql 优化数据倾斜