hive笔记-----查询数据

时间：2016-05-06 15:27:42 阅读：208 评论：0 收藏：0 [点我收藏+]

标签：

一、排序和聚集

hive中的order by能够预期产生完全排序的结果，但这个排序的过程只是使用一个reduce任务来完成的，这个面对大规模的数据集肯定不可行的

因此

sort by出现，它可以为每个reduce任务产生一个排序文件

distribute by

可以控制某个特定行应该到哪个reducer，目的在于进行后续的聚集操作

例如

from record2

select year,temperture

distribute by year

sort by year asc,temperature desc;

当sort by和distribute by中所用的列相同时，可以使用cluster by简写替换

二、MapReduce脚本

transform,map,reduce子句可以在hive中调用外部脚本

add file /user/tom/book-workspace/hadoop-book/ch12/src/main/python/is_good_quality.py;

from record2

select transform(year,temperature,quality)

using ‘is_good_quality.py‘

as year,temperature;

查询本身把year，temperature和quality这些字段以制表符分隔的行的形式传递给脚本，并把制表符分隔的输出解析为year和temperature字段，最终形成查询的结果

from (

from record2

map year,temperature,quality

using ‘is_good_quality.py‘

as year,temperature) map_output

reduce year,temperature

using ‘max_temperature_reduce.py‘

as year,temperature;

在该例中map和reduce使用select transform可以实现相同的效果

三、连接

内连接

select sales.*,things.* from sales join things on (sales.id=things.id);

hive只允许在from子句中出现一个表

因此，不支持

select sales.*,things.* from sales,things where sales.id=things.id;

explain

select sales.*,things.* from sales join things on (sales.id=things.id);

可以了解hive将为这个查询使用多少个MapReduce作业

join子句中表的顺序很重要，一般最好将最大的表放到最后

外连接

左连接，返回左表中所有的项，无法匹配也返回

select sales.*,things.* from sales left outer join things on (sales.id=things.id);

右连接，返回右表中所有的项，无法匹配也返回

select sales.*,things.* from sales right outer join things on (sales.id=things.id);

全连接，返回两个表中所有的项

select sales.*,things.* from sales full outer join things on (sales.id=things.id);

半连接

hive不支持in查询

select * from things where things.id in (select id from sales);

可替换成

select * from things left semi join sales on (sales.id=things.id);

在这个过程中，右表sales只能出现在on子句中，例如不能再select表达式中引用右表

map连接

如果有一个表小到足以放入内存，hive就可以把较小的表放入每个mapper的内存来执行连接操作

如果要指定使用map连接，需要在sql中使用c语言风格的注释

select /*+ mapjoin(things) */ sales.*,things.* from sales join things on (sales.id=things.id)

执行这个查询不使用reducer，因此这个查询对right或full outer join无效，因为只有对所有输入上进行聚集操作才能检测到哪个数据无法匹配

map连接可以利用分桶的表，因为作用于桶的mapper加载右侧表中对应的通即可执行连接。此时使用的语法和前面提到的在内存中进行连接是一样的

需要启动set hive.optimize.bucketmaojoin=true;

四、子查询

hive只允许子查询出现在from子句中

select station,year,AVG(max_temperature)

from(

select station,year,MAX(temperature) as max_temperature

from record2

where temperature !=9999

and(quality=0 or quality=1 or quality=4 or quality=5 or quality=9)

group by station,year

) mt

group by station,year;

子查询中结果需要赋予别名mt，列名也需要赋予别名max_temperature

五、视图

hive中的视图是只读的，所以无法通过视图为基表加载或插入数据

create view valid_records

select *

from record2

where temperature !=9999

and(quality=0 or quality=1 or quality=4 or quality=5 or quality=9);

create view max_temperatures(station,year,max_temperature)

select station,year,MAX(temperature) from valid_records

group by station,year;

select station,year,AVG(max_temperature)

from max_temperatures

group by station,year;

可以通过视图完成第四部分中子查询相同的功能，使用的MapReduce作业的个数也相同

hive笔记-----查询数据

标签：

原文地址：http://blog.csdn.net/iwantknowwhat/article/details/51325821

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行