码迷,mamicode.com
首页 > 其他好文 > 详细

Hive架构层面优化之二合理利用中间结果集(单Job)

时间:2014-07-18 20:29:09      阅读:384      评论:0      收藏:0      [点我收藏+]

标签:style   blog   color   使用   strong   os   

是针对单个作业,针对本job再怎么优化也不会影响到其他job

 

Hadoop的负载主要有两部分:CPU负载和IO负载;

问题:机器io开销很大,但是机器的cpu开销较小,另外map输出文件也较大,怎么办?

解决办法:通过设置map的中间输出进行压缩就可以了,这个不会影响最终reduce的输出。

 

集群中的机器一旦选定了,那么CPU就没的改变了,所以集群的最主要的负载还是IO负载

压缩技术虽然可以降低IO负载,但是同时也加重了CPU负载,治标不治本,CPU加重了,整体性能还是上不去;如果当前CPU负荷不高的话,可以考虑使用压缩,但是如果CPU负荷本来就很高了,那就没必要再使用压缩了;随着Job的增多,IO资源通常是最大的瓶颈(文件读写方式造成瓶颈)

 

充分利用中间结果集的目的:降低IO负载

 

90%的job按天执行,每天执行一次;5%的job按小时执行,每小时执行一次;

 

案例:

drop table tmp_brand_pv_dtl ;
create table tmp_brand_pv_dtl as
          select ‘{$label}‘ tdate, a.brand_id, b.province_id, max(a.pv) pv, a.site_id
            from
          (select getBrandId(url) brand_id, session_id, getSiteId(url) site_id,
                  min(track_time) track_time, url_page_id page_id, count(1) pv
            from trackinfo
           where ds=‘{$label}‘
   and (button_postion is null or length(button_postion)=0 or button_postion = ‘null‘)
   and url_page_id=14
           group by getBrandId(url), 
                  session_id, 
                  url_page_id,
                  getSiteId(url)) a 
          join
          (select getBrandId(url) brand_id, session_id, track_time,
                 ext_field2 province_id, city_id, getSiteId(url) site_id
            from trackinfo
     where ds=‘{$label}‘
      and (button_postion is null or length(button_postion)=0 or button_postion = ‘null‘)
   and url_page_id = in(14,15)) b
          on (a.brand_id = b.brand_id and a.session_id = b.session_id and a.site_id = b.site_id and a.track_time = b.track_time)
          group by a.brand_id, 
                 b.province_id,a.site_id; 

 

分析得出:

trackinfo表抽取了两次,数据用完就丢了

 

如何优化:

1、  将抽取的两次的数据存放到一个临时表中,对于trackinfo表来说只查一次;

两个select字段的字段对比:

getBrandId(url) brand_id getBrandId(url) brand_id
session_id session_id
getSiteId(url) site_id getSiteId(url) site_id
min(track_time) track_time track_time
url_page_id page_id  
count(1) pv  
  ext_field2 province_id
  city_id

 

 

 

 

 

 

两个select的where对比

‘{$label}‘ and (button_postion is null or length(button_postion)=0 or button_postion = ‘null‘) and url_page_id=14

 
‘{$label}‘ and (button_postion is null or length(button_postion)=0 or button_postion = ‘null‘) and url_page_id = in(14,15)

 

drop table tmp_brand_pv_0 ;
create table tmp_brand_pv_0 as
select getBrandId(url) brand_id, session_id,  getSiteId(url) site_id,
     track_time, url_page_id , ext_field2, city_id
from trackInfo
where 
    ds=‘{$label}‘
    and (button_position is null or length(button_position)=0 or button_position = ‘null‘)
and url_page_id in (14,15)  ;

 

2、  在join的时候直接从临时表中获取数据即可,临时表做为两个子查询的数据源;

drop table tmp_brand_pv_dtl ;
create table tmp_brand_pv_dtl as
select ‘{$label}‘ tdate,a.brand_id,b.province_id,max(a.pv) pv,a.site_id     
from 
(select brand_id,session_id,site_id,min(track_time) track_time,url_page_id  page_id,count(1) pv
from tmp_brand_pv_0 where url_page_id=14
 group by brand_id,session_id,site_id,url_page_id) a 
join
select brand_id, session_id, track_time, ext_field2 province_id, city_id, site_id
from tmp_brand_pv_0
where url_page_id in(14,15)) b
on (a.brand_id = b.brand_id and a.session_id = b.session_id and a.site_id = b.site_id and a.track_time = b.track_time)
 group by a.brand_id, b.province_id,a.site_id;

Hive架构层面优化之二合理利用中间结果集(单Job),布布扣,bubuko.com

Hive架构层面优化之二合理利用中间结果集(单Job)

标签:style   blog   color   使用   strong   os   

原文地址:http://www.cnblogs.com/luogankun/p/3850388.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!