下面以sales和things表为例。这两个表定义如下: hive> SELECT * FROM sales; Joe 2 Hank 4 Ali 0 Eve 3 Hank 2 hive> SELECT * FROM things; 2 Tie 4 Coat 3 Hat 1 Scarf
1. Inner joins
hive> SELECT sales.*, things.* > FROM sales JOIN things ON (sales.id = things.id); Joe 2 2 Tie Hank 2 2 Tie Eve 3 3 Hat Hank 4 4 Coat
以集合论的语言描述,可以表示为:
也就说,只有两个表匹配上,有交集,才会出现在最终结果中。
2. LEFT OUTER JOIN
hive> SELECT sales.*, things.* > FROM sales LEFT OUTER JOIN things ON (sales.id = things.id); Ali 0 NULL NULL Joe 2 2 Tie Hank 2 2 Tie Eve 3 3 Hat Hank 4 4 Coat 以集合论的语言描述,可以表示为:
hive> SELECT sales.*, things.* > FROM sales RIGHT OUTER JOIN things ON (sales.id = things.id); NULL NULL 1 Scarf Joe 2 2 Tie Hank 2 2 Tie Eve 3 3 Hat Hank 4 4 Coat 以集合论的语言描述,可以表示为:
RIGHT OUTER JOIN只是将LEFT OUTER JOIN的两个表对换而已。
4. FULL OUTER JOIN
hive> SELECT sales.*, things.* > FROM sales FULL OUTER JOIN things ON (sales.id = things.id); Ali 0 NULL NULL NULL NULL 1 Scarf Joe 2 2 Tie Hank 2 2 Tie Eve 3 3 Hat Hank 4 4 Coat 以集合论的语言描述,可以表示为:
FULL OUTER JOIN可以理解为LEFT OUTER JOIN + RIGHT OUTER JOIN。
5. LEFT SEMI JOIN
Hive doesn’t support IN subqueries (at the time of this writing), but you can use a LEFT SEMI JOIN to do the same thing. Consider this IN subquery, which finds all the items in the things table that are in the sales table: SELECT * FROM things WHERE things.id IN (SELECT id from sales);
We can rewrite it as follows: hive> SELECT * > FROM things LEFT SEMI JOIN sales ON (sales.id = things.id); 2 Tie 3 Hat 4 Coat 从LEFT SEMI JOIN的定义看,只有出现在sales中的id,才会出现在最终结果里。且由于采用的是IN关键字,因此,从IN (SELECT id from sales)出来的id已经是去过重的。
6. Map joins
hive> SELECT /*+ MAPJOIN(things) */ sales.*, things.* > FROM sales JOIN things ON (sales.id = things.id); Joe 2 2 Tie Hank 4 4 Coat Eve 3 3 Hat Hank 2 2 Tie Map joins主要是一种优化操作,通过避免reduce端的操作,加速job运行时间。实际中,如何一个表的大小在几十M,均可以通过map join加速。
Map join的实际操作应该是,将小表通过分布式缓存发布到集群上。每个mapper将该小表的join
key加载到自己的内存中,通过类似hash表的方式进行存储。Mapper每读进一条记录,就会查查当前记录的key是否在该hash表中,如果在,则
输出。这样就避免了reduce的操作。
以下是一个复杂的hive例子: SELECT * FROM (
SELECT
uid,sessionid,stepid,time,position,source,action,request,response,cellphone,other
FROM log_route_car_filtered WHERE dt=‘20140517‘ UNION ALL SELECT client.uid,client.sessionid,client.stepid,client.time,client.position,client.source,client.action, client.request,client.response,client.cellphone,client.other FROM log_client_filtered client WHERE (dt=‘20140517‘ AND other[‘date‘]=‘20140517‘) OR (dt=‘20140518‘ AND other[‘date‘]=‘20140517‘) LEFT SEMI JOIN ( SELECT DISTINCT uid, sessionid FROM log_route_car_filtered WHERE dt=‘20140517‘ ) unique_route ON (client.uid=unique_route.uid AND client.sessionid=unique_route.sessionid) ) merge WHERE uid RLIKE ‘^[\\w-]+$‘ DISTRIBUTE BY uid SORT BY uid,sessionid,CAST(stepid AS INT),time;
这里的主要问题是UNION ALL后的SELECT语句存在问题,会导致hive执行报错,而且错误信息很模糊。 因此我们把这部分结果单独拿出来进行分析: SELECT client.uid,client.sessionid,client.stepid,client.time,client.position,client.source,client.action, client.request,client.response,client.cellphone,client.other FROM log_client_filtered client WHERE (dt=‘20140517‘ AND other[‘date‘]=‘20140517‘) OR (dt=‘20140518‘ AND other[‘date‘]=‘20140517‘) LEFT SEMI JOIN ( SELECT DISTINCT uid, sessionid FROM log_route_car_filtered WHERE dt=‘20140517‘ ) unique_route ON (client.uid=unique_route.uid AND client.sessionid=unique_route.sessionid)
LEFT SEMI JOIN的语法框架如下: SELECT * FROM tabel1 LEFT SEMI JOIN table2 ON (tabel1.id = tabel2.id); 不包括WHERE子句,但在这里我们又想通过WHERE子句过滤包含指定日期的结果,怎么办呢? 这时我们可以将WHERE子句包装进一个Subquery,如下: SELECT * FROM ( SELECT uid,sessionid,stepid,time,position,source,action,.request,response,cellphone,other FROM log_client_filtered WHERE (dt=‘20140517‘ AND other[‘date‘]=‘20140517‘) OR (dt=‘20140518‘ AND other[‘date‘]=‘20140517‘) ) client LEFT SEMI JOIN ( SELECT DISTINCT uid, sessionid FROM log_route_car_filtered WHERE dt=‘20140517‘ ) unique_route ON (client.uid=unique_route.uid AND client.sessionid=unique_route.sessionid)