Hive语法层面优化之五分析执行计划追踪导致数据倾斜的原因

时间：2014-07-20 22:16:56 阅读：347 评论：0 收藏：0 [点我收藏+]

count(distinct key)案例

explain select count(distinct session_id) from trackinfo where ds=‘ 2013-07-21‘ ;

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        trackinfo 
          TableScan
            alias: trackinfo
            Filter Operator
              predicate:
                  expr: (ds = ‘ 2013-07-21‘)
                  type: boolean
              Filter Operator
                predicate:
                    expr: (ds = ‘ 2013-07-21‘)
                    type: boolean
                Select Operator
                  expressions:
                        expr: session_id
                        type: string
                  outputColumnNames: session_id
                  Group By Operator
                    aggregations:
                          expr: count(DISTINCT session_id)
                    bucketGroup: true
                    keys:
                          expr: session_id
                          type: string
                    mode: hash
                    outputColumnNames: _col0, _col1
                    Reduce Output Operator
                      key expressions:
                            expr: _col0
                            type: string
                      sort order: +
                      tag: -1
                      value expressions:
                            expr: _col1
                            type: bigint
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: count(DISTINCT KEY._col0:0._col0)
          bucketGroup: false
          mode: mergepartial
          outputColumnNames: _col0
          Select Operator
            expressions:
                  expr: _col0
                  type: bigint
            outputColumnNames: _col0
            File Output Operator
              compressed: false
              GlobalTableId: 0
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1

group by案例

explain select max(session_id) from trackinfo where ds=‘2013-07-21‘ group by city_id;

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        trackinfo 
          TableScan
            alias: trackinfo
            Filter Operator
              predicate:
                  expr: (ds = ‘2013-07-21‘)
                  type: boolean
              Select Operator
                expressions:
                      expr: city_id
                      type: string
                      expr: session_id
                      type: string
                outputColumnNames: city_id, session_id
                Group By Operator
                  aggregations:
                        expr: max(session_id)
                  bucketGroup: false
                  keys:
                        expr: city_id
                        type: string
                  mode: hash
                  outputColumnNames: _col0, _col1
                  Reduce Output Operator
                    key expressions:
                          expr: _col0
                          type: string
                    sort order: +
                    Map-reduce partition columns:
                          expr: _col0
                          type: string
                    tag: -1
                    value expressions:
                          expr: _col1
                          type: string
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: max(VALUE._col0)
          bucketGroup: false
          keys:
                expr: KEY._col0
                type: string
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Select Operator
            expressions:
                  expr: _col1
                  type: string
            outputColumnNames: _col0
            File Output Operator
              compressed: false
              GlobalTableId: 0
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1

count(distinct key)联合group by案例

explain select count(distinct session_id) from trackinfo where ds=‘2013-07-21‘ group by city_id;

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        trackinfo 
          TableScan
            alias: trackinfo
            Filter Operator
              predicate:
                  expr: (ds = ‘2013-07-21‘)
                  type: boolean
              Select Operator
                expressions:
                      expr: city_id
                      type: string
                      expr: session_id
                      type: string
                outputColumnNames: city_id, session_id
                Group By Operator
                  aggregations:
                        expr: count(DISTINCT session_id)
                  bucketGroup: false
                  keys:
                        expr: city_id
                        type: string
                        expr: session_id
                        type: string
                  mode: hash
                  outputColumnNames: _col0, _col1, _col2
                  Reduce Output Operator
                    key expressions:
                          expr: _col0
                          type: string
                          expr: _col1
                          type: string
                    sort order: ++
                    Map-reduce partition columns:
                          expr: _col0
                          type: string
                    tag: -1
                    value expressions:
                          expr: _col2
                          type: bigint
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: count(DISTINCT KEY._col1:0._col0)
          bucketGroup: false
          keys:
                expr: KEY._col0
                type: string
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Select Operator
            expressions:
                  expr: _col1
                  type: bigint
            outputColumnNames: _col0
            File Output Operator
              compressed: false
              GlobalTableId: 0
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1

max(key)联合group by案例

explain select count(session_id) from trackinfo where ds=‘2013-07-21‘ group by city_id;

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        trackinfo 
          TableScan
            alias: trackinfo
            Filter Operator
              predicate:
                  expr: (ds = ‘2013-07-21‘)
                  type: boolean
              Select Operator
                expressions:
                      expr: city_id
                      type: string
                      expr: session_id
                      type: string
                outputColumnNames: city_id, session_id
                Group By Operator
                  aggregations:
                        expr: count(session_id)
                  bucketGroup: false
                  keys:
                        expr: city_id
                        type: string
                  mode: hash
                  outputColumnNames: _col0, _col1
                  Reduce Output Operator
                    key expressions:
                          expr: _col0
                          type: string
                    sort order: +
                    Map-reduce partition columns:
                          expr: _col0
                          type: string
                    tag: -1
                    value expressions:
                          expr: _col1
                          type: bigint
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: count(VALUE._col0)
          bucketGroup: false
          keys:
                expr: KEY._col0
                type: string
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Select Operator
            expressions:
                  expr: _col1
                  type: bigint
            outputColumnNames: _col0
            File Output Operator
              compressed: false
              GlobalTableId: 0
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1

执行计划总结

select count(distinct session_id) from trackinfo where ds=‘2013-11-01‘ ;

分发的是：session_id

select max(session_id) from trackinfo where ds=‘2013-11-01‘ group by city_id;

分发的是：city_id

select count(distinct session_id) from trackinfo where ds=‘2013-11-01‘ group by city_id;

分发的是：session_id和city_id

select count(session_id) from trackinfo where ds=‘2013-11-01‘ group by city_id;

分发的是：city_id

得出数据倾斜的结论：

join、group by、 count(distinct key)容易出现数据倾斜；

max、count等聚合函数并不会导致数据倾斜。

案例中的trackinfo建表语句

create table trackinfo (
id      bigint                          ,
url     string                          ,
referer string                          ,
keyword string                          ,
type    int                             ,
gu_id   string                          ,
page_id string                          ,
module_id       string                  ,
link_id string                          ,
attached_info   string                  ,
session_id      string                  ,
tracker_u       string                  ,
tracker_type    int                     ,
ip      string                          ,
tracker_src     string                  ,
cookie  string                          ,
order_code      string                  ,
track_time      string                  ,
end_user_id     bigint                  ,
first_link      string                  ,
session_view_no int                     ,
product_id      string                  ,
merchant_id     bigint                  ,
province_id     string                  ,
city_id string                          ,
fee     string                          ,
edm_activity    string                  ,
edm_email       string                  ,
edm_jobid       string                  ,
ie_version      string                  ,
platform        string                  ,
internal_keyword        string          ,
result_sum      string                  ,
currentpage     string                  ,
link_position   string                  ,
button_position string                  ,
ext_field1      string                  ,
ext_field2      string                  ,
ext_field3      string                  ,
ext_field4      string                  ,
ext_field5      string                  ,
adgroupkeywordid        string          ,
ext_field6      string                  ,
ext_field7      string                  ,
ext_field8      string                  ,
ext_field9      string                  ,
ext_field10     string                  ,
url_page_id     int                     ,
url_page_value  string                  ,
refer_page_id   int                     ,
refer_page_value        string ) partitioned by(ds      string);

Hive语法层面优化之五分析执行计划追踪导致数据倾斜的原因,布布扣,bubuko.com

Hive语法层面优化之五分析执行计划追踪导致数据倾斜的原因

标签：style blog color strong os 数据

原文地址：http://www.cnblogs.com/luogankun/p/3856581.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行