hive学习笔记-数据操作

时间：2015-08-21 21:34:01 阅读：166 评论：0 收藏：0 [点我收藏+]

hive数据操作

hive命令行操作
hive -d --define <key=value> 定义一个key-value可以在命令行中使用
hive -d database <databasename>   指定使用的数据库
hive -e “hql”   不需要进入cli执行hql语句，可以在脚本中使用
hive -f fileName 将hql放到一个file文件中执行，sql语句来自file文件
hive -h hostname 访问主机，通过主机的地址
hive -H --help 打印帮助信息
hive -H --hiveconf <property=value> 使用的信息可以在这里定义property的value
hive -H --hivevar <key-value> 使用可变的命令，将一个命令重新赋值使用
hive -i filename hive的初始化文件，可以将hive的一些初始化信息放到这个文件中，比如使用自定义函数的时候可以将相应的jar包目录写进去
hive -S --silent 在shell下进入安静模式，不需要打印一些输出信息
hive -v --verbose 打印执行的详细信息，比如执行的SQL语句

实例：查询test表，并且将打印的结果放到/home/data/select_result.txt中
hive -S -e "select * from test" > /home/data/select_result.txt
含有执行的SQL语句
hive -v -e "select * from test" > /home/data/select_result.txt
执行放在文件中的SQL
hive -f /home/colin/hive-1.2.1/select_test

hive cli中使用list查看分布式缓存中的file|jar|archive（比如通过add jar添加进去的，可以通过list jar查看添加到分布式缓存中的jar包）
hive cli中使用source执行指定目录下的文件，比如执行指定目录下的一个sql文件 source /home/colin/hive-1.2.1/select_test

hive操作变量
配置变量
set val_Name=val_Value；
${hiveconf:val}
查看linux下的环境变量
${env:变量名称},env查看所有的环境变量
实例定义变量val_test,设置为yang，作为查询语句条件
set val_test=yang;
select * from test2 where name=‘${hiveconf:val_test}‘;
查看
HIVE_HOME的环境变量
select ‘${env:HIVE_HOME}‘ from test;
注：test表中有多少条记录，打印多少次路径

hive数据加载
内表数据加载
    创建表时加载
    create table tableName as select col_1,col_2... from tableName2;
    创建表的时候指定数据位置
    create table tablename(col_name typye comment ...) location ‘path‘;(path为hdfs中的路径，注意这个path是文件上层的目录，也就是说指定文件到上层目录，目录下的数据都会作为该表的数据。并且这种方式不会在hive/warehouse下创建该表的目录，因为他会把hdfs中指定的path作为该表目录操作   )
       注：这种指定方式，在内表中会将数据的拥有权给当前表，当表删除的时候数据也会删除(连同上层目录)
    本地数据加载
    load data local inpath ‘localpath‘ [overwrite] into table tableName;
    加载HDFS中数据
    load data inpath ‘hdfspath‘ [overwrite] into table tableName;
         注意：这种方式，是将hdfs中指定位置的数据移动到表的目录下
    使用Hadoop命令拷贝数据到指定位置(hive中shell执行和Linux中shell执行)
    hdfs dfs -copyFromLocal /home/data /data
    hive shell中 dfs -copyFromLocal /home/data data（hadoop命令直接可以在hive中执行，同样hive也可以执行linux命令，但是需要在命令前加上！）

    由查询语句加载数据
    insert [overwrite|into] table tableName select col1,col2... from tablenName2 where ...
    from tableNable2 insert [overwrite|into] table tableName select col1,col2... where ...
    select col1,col2.. from tableName2 where ... insert [overwrite|into] table tableName;
    注：可以select的字段名字可以和table中不同，hive在数据加载时候不会进行字段检测和类型检测。只有在查询的时候检测
外表数据加载
    创建表的时候指定数据位置(因为外表对数据没有控制权)
    create external table (col_Name type comment...) location ‘path‘;
    通过insert语句，和内表一样
    通过hadoop命令，和内表一样
hive分区表数据加载
    内部分区表和内表数据加载类似
    外部分区表和外表数据加载类似
不同之处是指定分区；在外部分区表中数据存放的层次要表的分区一致，如果分区表下没有新增分区，即使目录下有数据也是查不到的,当满足目录结构对应的时候需要添加分区 alter table tableName add partition (dt=20150820)。
    load data local inpath ‘path‘ [overwrie] into table tableNmme partition(pName=‘..‘);
    insert [overwrite|into] table tableName partition(pName=‘..‘) select col1,col2.. from tableName2 where ...

注意：row format分隔符如果设定多个字符起分割作用，只有第一个字符有作用
      load数据的时候，字段类型不能相互转换，否则会加载为NULL
      插入数据时候如果selct后的类型也不能相互转换，否则插入为NULL;
      在HDFS中NULL是以\N来显示的

Hive数据导出
导出方式：
   Hadoop命令的方式
          get
          text
   通过Insert....DIRECTOR
       insert overwrite [local] directory ‘path‘ [row format delimited fields terminated by ‘\t‘ lines terminated by ‘\n‘] select col1,col2.. from tableName
            注：如果使用local是导到本地，否则是HDFS中，row format只对导到本地起作用(在1.2.1hive中已经能够在HDFS中使用row format了
)。
   通过Shell命令加管道
   通过第三方工具
实例：
hdfs dfs -get /user/hive/warehouse/test4/* ./data/newdata
hdfs dfs -text /user/hive/warehouse/test5/* > ./data/newdata(可以对多种格式进行输出，压缩、序列化等)
hive -S -e "select * from test4" | grep yang > ./data/newdata

hive动态分区
分区不确定，需要从查询结果中查看。不需要为每个分区都使用alter table添加
使用动态分区需要配置的参数：
set hive.exec.dynamic.partition=true;//使用动态分区
set hive.exec.dynamic.partition.model=nonstrick;//分区有两种方式：一种是strick有限制分区，需要有一个静态分区，且放在最前面。一种就是nostrick无限制模式
set hive.exec.max.dynamic.partitions.pernode=10000;//每个节点生成动态分区的最大个数
set hive.exec.max.dynamic.partitions=100000;//生成动态分区的最大个数
set hive.exec.max.created.fiels=150000;//一个任务最多可以创建的文件数目
set dfs.datanode.max.xcievers=8192;//限定一次最多打开的文件数

insert overwrie table test7 partition(dt) select name,time as dt from test6;

表属性操作
修改表名：
alter table tableName rename to newName
修改列明:
alter table tableName change column old_col new_col newType comment ‘....‘ after colName(如果要为第一列则将aftercolName 改为first)
增加列：
alter table tableName add columns (c1 type comment ‘..‘,c2 type comment ‘...‘)
修改表属性
查看表属性
desc formatted tablename
这个是可以要修改的表的属性信息
Table Parameters:
   COLUMN_STATS_ACCURATE   false
   last_modified_by       colin
   last_modified_time     1440154819
   numFiles               0
   numRows                -1
   rawDataSize            -1
   totalSize              0
   transient_lastDdlTime   1440154819
修改属性：
alter table tableName set tblproerties(‘propertiesName‘=‘.....‘);
比如修改comment
alter table tableName set tblproperties("comment"="xxxxx");
修改序列化信息：
无分区表
alter table tableName set serdepropertie(‘fields.delim‘=‘\t‘);
有分区表
alter table tableName partition(dt=‘xxxx‘) set serdeproperties(‘fields.delim‘=‘\t‘);
修改Location：
alter table tableName [partition(..)] set localtion ‘path‘;
内部表外部表转换：

alter table tableName set tblproperties (‘EXTERNAL‘=‘TRUE|FALSE‘);必须大写EXTERNAL

更多属性操作查看:https://cwiki.apache.org/confluence/display/Hive/Home

hive学习笔记-数据操作

标签：hadoop hive 数据操作

原文地址：http://blog.csdn.net/colin_yjz/article/details/47839531

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行