apache hive 无法收集stats问题

时间：2015-05-10 01:08:54 阅读：464 评论：0 收藏：0 [点我收藏+]

标签：hive;stats

环境：

hive: apache-hive-1.1.0

hadoop:hadoop-2.5.0-cdh5.3.2

hive元数据以及stats使用mysql进行存储。

hive stats相关参数如下：

hive.stats.autogather：在insert overwrite命令时自动收集统计信息，默认开启true；设置为true

hive.stats.dbclass：存储hive临时统计信息的数据库，默认是jdbc:derby；设置为jdbc:mysql

hive.stats.jdbcdriver：数据库临时存储hive统计信息的jdbc驱动；设置为com.mysql.jdbc.driver

hive.stats.dbconnectionstring：临时统计信息数据库连接串，默认jdbc:derby:databaseName=TempStatsStore;create=true；设置为jdbc:mysql://[ip:port]/[dbname]?user=[username]&password=[password]

hive.stats.defaults.publisher：如果dbclass不是jdbc或者hbase，那么使用这个作为默认发布，必须实现StatsPublisher接口，默认是空；保留默认

hive.stats.defaults.aggregator：如果dbclass不是jdbc或者hbase，那么使用该类做聚集，要求实现StatsIAggregator接口，默认是空；保留默认

hive.stats.jdbc.timeout：jdbc连接超时配置，默认30秒；保留默认

hive.stats.retries.max：当统计发布合聚集在更新数据库时出现异常时最大的重试次数，默认是0，不重试；保留默认

hive.stats.retries.wait：重试次数之间的等待窗口，默认是3000毫秒；保留默认

hive.client.stats.publishers：做count的job的统计发布类列表，由逗号隔开，默认是空；必须实现org.apache.hadoop.hive.ql.stats.ClientStatsPublisher接口；保留默认

现象：

执行insert overwrite table 没有正确的返回numRows和rawDataSize;结果类似如下

[numFiles=1, numRows=0, totalSize=59, rawDataSize=0]

在hive stats mysql 数据库也没有任何相关的stats插入进来。

先定位问题是hive stats出现问题，由于console打印出来的信息过少，无法精确定位问题；因此设置

hive --hiveconf hive.root.logger=INFO,console ；将详细日志打印出来,发现以下信息：

[Error 30001]: StatsPublisher cannot be initialized. There was a error in the initialization
of StatsPublisher, and retrying might help. If you dont want the query to fail because accurate
statistics could not be collected, set hive.stats.reliable=false

Specified key was too long; max key length is 767 bytes

这个问题比较简单，是由于hive1.1.0,ID column长度默认为4000；而且设置ID为主键，导致报错

org.apache.hadoop.hive.ql.stats.jdbc.JDBCStatsSetupConstants

  // MySQL - 65535, SQL Server - 8000, Oracle - 4000, Derby - 32762, Postgres - large.
  public static final int ID_COLUMN_VARCHAR_SIZE = 4000;

org.apache.hadoop.hive.ql.stats.jdbc.JDBCStatsPublisher：public boolean init(Configuration hconf)

              if (colSize < JDBCStatsSetupConstants.ID_COLUMN_VARCHAR_SIZE) {
                String alterTable = JDBCStatsUtils.getAlterIdColumn();
                  stmt.executeUpdate(alterTable);
              }

从这个代码知道，如果表的ID column size小于4000，会被自动改为4000；因此只有修改源码将4000->255（mysql采用utf8编码，一个utf8占用3个字节，因此255*3=765<767）;并且对于目前集群来说255字节已经够用。

  public static final int ID_COLUMN_VARCHAR_SIZE = 255;

重新编译，打包推送到测试环境，经过测试发现问题还是存在。

[numFiles=1, numRows=0, totalSize=59, rawDataSize=0]

hive --hiveconf hive.root.logger=INFO,console ；将详细日志打印出来

并没有发现有异常发生。

为了跟踪问题，set hive.stats.reliable=true；

重新执行命令，这次报错，查看job报错信息，发现问题出现在

org.apache.hadoop.hive.ql.stats.jdbc.JDBCStatsAggregator

    try {
      Class.forName(driver).newInstance();
    } catch (Exception e) {
      LOG.error("Error during instantiating JDBC driver " + driver + ". ", e);
      return false;
    }

这个是在yarn上运行，无法找到com.mysql.jdbc.Driver这个类导致，将mysql驱动包，放置于yarn/lib/目录下面，全集群推送，重跑测试脚本，发现问题解决。

本文出自 “SuperMagi” 博客，请务必保留此出处http://supermagi.blog.51cto.com/10191319/1649905

apache hive 无法收集stats问题

标签：hive;stats

原文地址：http://supermagi.blog.51cto.com/10191319/1649905

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行