hadoop环境描述:
master节点:node1
slave节点:node2,node3,node4
远端服务器(python连接hive):node29
需求:通过hive查询到cdn日志中指定时间段内url访问次数最多的前10个url
ps:用pig查询可以查询文章:
http://shineforever.blog.51cto.com/1429204/1571124
说明:python操作远程操作需要使用Thrift接口:
hive源码包下面自带Thrift插件:
[root@node1 shell]# ls -l /usr/local/hive-0.8.1/lib/py
total 28
drwxr-xr-x 2 hadoop hadoop 4096 Nov 5 15:29 fb303
drwxr-xr-x 2 hadoop hadoop 4096 Oct 15 10:30 fb303_scripts
drwxr-xr-x 2 hadoop hadoop 4096 Nov 5 15:29 hive_metastore
drwxr-xr-x 2 hadoop hadoop 4096 Oct 15 10:30 hive_serde
drwxr-xr-x 2 hadoop hadoop 4096 Nov 5 15:29 hive_service
drwxr-xr-x 2 hadoop hadoop 4096 Nov 5 15:20 queryplan
drwxr-xr-x 6 hadoop hadoop 4096 Nov 5 15:20 thrift
1)把相关文件scp到远端的node29相关目录下:
scp -r /usr/local/hive-0.8.1/lib/py/* 172.16.41.29:/usr/local/hive_py/.
2) 在node1服务器上开发hive:
[hadoop@node1 py]$ hive --service hiveserver
Starting Hive Thrift Server
3)在node29上编写查询脚本:
#!/usr/bin/env python #coding:utf-8 #找出cdn日志指定时间段,url访问次数最多的前10个; import sys import os import string import re import MySQLdb #加载hive的python相关库文件; sys.path.append(‘/usr/local/hive_py‘) from hive_service import ThriftHive from hive_service.ttypes import HiveServerException from thrift import Thrift from thrift.transport import TSocket from thrift.transport import TTransport from thrift.protocol import TBinaryProtocol dbname="default" hsql="select request,count(request) as counts from cdnlog where time >= ‘[27/Oct/2014:10:40:00 +0800]‘ and time <= ‘[27/Oct/2014:10:49 :59 +0800]‘ group by request order by counts desc limit 10" def hiveExe(hsql,dbname): try: transport = TSocket.TSocket(‘172.16.41.151‘, 10000) transport = TTransport.TBufferedTransport(transport) protocol = TBinaryProtocol.TBinaryProtocol(transport) client = ThriftHive.Client(protocol) transport.open() #加载增长表达式支持,必需(以下加载路径为远端hive路径,不是脚本的本地路径!) client.execute(‘add jar /usr/local/hive-0.8.1/lib/hive_contrib.jar‘) # client.execute("use " + dbname) # row = client.fetchOne() client.execute(hsql) return client.fetchAll() #查询所有数据; transport.close() except Thrift.TException, tx: print ‘%s‘ % (tx.message) if __name__ == ‘__main__‘: results=hiveExe(hsql,dbname) num=len(results) for i in range(num):
在node29上执行脚本,输入结果为:
node1服务器上hive计算过程为:
本文出自 “shine_forever的博客” 博客,请务必保留此出处http://shineforever.blog.51cto.com/1429204/1573439
hadoop中的hive查询cdn访问日志指定时间段内url访问次数最多的前10位(结合python语言)
原文地址:http://shineforever.blog.51cto.com/1429204/1573439