hive压缩

时间：2014-08-15 19:40:59 阅读：254 评论：0 收藏：0 [点我收藏+]

压缩配置：

map/reduce 输出压缩（一般采用序列化文件存储）

set hive.exec.compress.output=true;

set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;

set mapred.output.compression.type=BLOCK;

任务中间压缩

set hive.exec.compress.intermediate=true;

set hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;（常用）

set hive.intermediate.compression.type=BLOCK;

1、是否选择文件压缩：

在hadoop作业执行过程中，job执行速度更多的是局限于I/O，而不是受制于CPU。如果是这样，通过文件压缩可以提高hadoop性能。然而，如果作业的执行速度受限于CPU的性能，呢么压缩文件可能就不合适，因为文件的压缩和解压会花费掉较多的时间。当然确定适合集群最优配置的最好方式是通过实验测试，然后衡量结果。

2、压缩格式

GZip 和 BZip2压缩格式是所有最近的hadoop版本支持的，而且linux本地的库也支持这种格式的压缩和解压缩。

Snappy是最近添加的压缩格式，可以自己添加这种压缩格式

LZO是经常用到的压缩格式

GZip 和 BZip2压缩可以保证最小的压缩文件，但是过于消耗时间；Snappy和LZO压缩和解压缩很快，但是压缩的文件较大。所以如何选择压缩格式，需要根据具体的需求决定。（I/O,CPU）

BZip2 and LZO支持压缩文件分割

3、中间压缩

中间压缩就是处理作业map任务和reduce任务之间的数据，对于中间压缩，最好选择一个节省CPU耗时的压缩方式

<property>

<name>hive.exec.compress.intermediate</name>

<value>true</value>

<description> This controls whether intermediate files produced by Hive between

multiple map-reduce jobs are compressed. The compression codec and other options

are determined from hadoop config variables mapred.output.compress* </description>

</property>

hadoop压缩有一个默认的压缩格式，当然可以通过修改mapred.map.output.compression.codec属性，使用新的压缩格式，这个变量可以在
mapred-site.xml 中设置或者在 hive-site.xml文件。 SnappyCodec 是一个较好的压缩格式，CPU消耗较低。
<property>

<name>mapred.map.output.compression.codec</name>

<value>org.apache.hadoop.io.compress.SnappyCodec</value>

<description> This controls whether intermediate files produced by Hive

between multiple map-reduce jobs are compressed. The compression codec

and other options are determined from hadoop config variables

mapred.output.compress* </description>

</property>

4、最终的压缩输出

作业最终的输出也可以压缩，hive.exec.compress.output这个属性控制这个操作。当然，如果仅仅只需要在某一次作业中使用最终压缩，呢么，可以直接在脚本中设置这个属性，而不必修改配置文件

<property>

<name>hive.exec.compress.output</name>

<value>false</value>

<description> This controls whether the final outputs of a query

(to a local/hdfs file or a Hive table) is compressed. The compression

codec and other options are determined from hadoop config variables

mapred.output.compress* </description>

</property>

如果hive.exec.compress.output这个属性被设置成true，呢么可以选择GZip压缩方式，这种方式具有很好的压缩效果，减少I/O，当然GZip压缩格式文件是不允许被分割的。

<property>

<name>mapred.output.compression.codec</name>

<value>org.apache.hadoop.io.compress.GzipCodec</value>

<description>If the job outputs are compressed, how should they be compressed?

</description>

</property>

5、序列化文件

序列化文件支持hadoop把文件按块分割，同时支持压缩文件分割。

在hive中可以通过以下设置使用序列化文件：

CREATE TABLE a_sequence_file_table STORED AS SEQUENCEFILE;

序列化文件有三种不同的压缩方式: NONE, RECORD, and BLOCK.

RECORD是默认的；

BLOCK压缩方式比较有效，同时可以支持文件分割，和其他的属性一样，这个属性不是hive独有的，可以通过hadoop的mapred-site.xml文件和hive的hive.site.xml文件设置，也可以通过脚本、终端查询设置

<property>

<name>mapred.output.compression.type</name>

<value>BLOCK</value>

<description>If the job outputs are to compressed as SequenceFiles,

how should they be compressed? Should be one of NONE, RECORD or BLOCK.

</description>

</property>

hive压缩,布布扣,bubuko.com

hive压缩

标签：hadoop hive mapreduce

原文地址：http://blog.csdn.net/jyl1798/article/details/38589099

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行