1,首先需要安装hive,参考http://lqding.blog.51cto.com/9123978/1750967
2,在spark的配置目录下添加配置文件,让Spark可以访问hive的metastore。
root@spark-master:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/conf# vi hive-site.xml <configuration> <property> <name>hive.metastore.uris</name> <value>thrift://spark-master:9083</value> <description>Thrift uri for the remote metastore. Used by metastore client to connect to remote metastore.</description> </property> </configuration>
3,将MySQL jdbc驱动copy到spark的lib目录下
root@spark-master:/usr/local/hive/apache-hive-1.2.1/lib# cp mysql-connector-java-5.1.36-bin.jar /usr/local/spark/spark-1.6.0-bin-hadoop2.6/lib/
4,启动Hive的metastore服务
root@spark-master:/usr/local/hive/apache-hive-1.2.1/bin# ./hive --service metastore & [1] 20518 root@spark-master:/usr/local/hive/apache-hive-1.2.1/bin# SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/local/hadoop/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] Starting Hive Metastore Server SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/local/hadoop/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
5,启动spark-shell
root@spark-master:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/bin# ./spark-shell --master spark://spark-master:7077
生成hiveContext
scala> val hc = new org.apache.spark.sql.hive.HiveContext(sc);
执行sql
scala> hc.sql("show tables").collect.foreach(println) [sougou,false] [t1,false] scala> hc.sql("select count(*) from sougou").collect.foreach(println) 16/03/14 23:15:58 INFO parse.ParseDriver: Parsing command: select count(*) from sougou 16/03/14 23:16:00 INFO parse.ParseDriver: Parse Completed 16/03/14 23:16:01 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 16/03/14 23:16:02 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 474.9 KB, free 474.9 KB) 16/03/14 23:16:02 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 41.6 KB, free 516.4 KB) 16/03/14 23:16:02 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.199.100:41635 (size: 41.6 KB, free: 517.4 MB) 16/03/14 23:16:02 INFO spark.SparkContext: Created broadcast 0 from collect at <console>:30 16/03/14 23:16:03 INFO mapred.FileInputFormat: Total input paths to process : 1 16/03/14 23:16:03 INFO spark.SparkContext: Starting job: collect at <console>:30 16/03/14 23:16:03 INFO scheduler.DAGScheduler: Registering RDD 5 (collect at <console>:30) 16/03/14 23:16:03 INFO scheduler.DAGScheduler: Got job 0 (collect at <console>:30) with 1 output partitions 16/03/14 23:16:03 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (collect at <console>:30) 16/03/14 23:16:03 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 0) 16/03/14 23:16:04 INFO scheduler.DAGScheduler: Missing parents: List(ShuffleMapStage 0) 16/03/14 23:16:04 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[5] at collect at <console>:30), which has no missing parents 16/03/14 23:16:04 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 13.8 KB, free 530.2 KB) 16/03/14 23:16:04 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 6.9 KB, free 537.1 KB) 16/03/14 23:16:04 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.199.100:41635 (size: 6.9 KB, free: 517.4 MB) 16/03/14 23:16:04 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006 16/03/14 23:16:04 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[5] at collect at <console>:30) 16/03/14 23:16:04 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks 16/03/14 23:16:04 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, spark-worker2, partition 0,NODE_LOCAL, 2152 bytes) 16/03/14 23:16:04 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, spark-worker1, partition 1,NODE_LOCAL, 2152 bytes) 16/03/14 23:16:05 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on spark-worker2:55899 (size: 6.9 KB, free: 146.2 MB) 16/03/14 23:16:05 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on spark-worker1:38231 (size: 6.9 KB, free: 146.2 MB) 16/03/14 23:16:09 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on spark-worker1:38231 (size: 41.6 KB, free: 146.2 MB) 16/03/14 23:16:10 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on spark-worker2:55899 (size: 41.6 KB, free: 146.2 MB) 16/03/14 23:16:16 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 12015 ms on spark-worker1 (1/2) 16/03/14 23:16:16 INFO scheduler.DAGScheduler: ShuffleMapStage 0 (collect at <console>:30) finished in 12.351 s 16/03/14 23:16:16 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 12341 ms on spark-worker2 (2/2) 16/03/14 23:16:16 INFO scheduler.DAGScheduler: looking for newly runnable stages 16/03/14 23:16:16 INFO scheduler.DAGScheduler: running: Set() 16/03/14 23:16:16 INFO scheduler.DAGScheduler: waiting: Set(ResultStage 1) 16/03/14 23:16:16 INFO scheduler.DAGScheduler: failed: Set() 16/03/14 23:16:16 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[8] at collect at <console>:30), which has no missing parents 16/03/14 23:16:16 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 16/03/14 23:16:16 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 12.9 KB, free 550.1 KB) 16/03/14 23:16:16 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 6.4 KB, free 556.5 KB) 16/03/14 23:16:16 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.199.100:41635 (size: 6.4 KB, free: 517.4 MB) 16/03/14 23:16:16 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006 16/03/14 23:16:16 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[8] at collect at <console>:30) 16/03/14 23:16:16 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 1 tasks 16/03/14 23:16:16 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, spark-worker1, partition 0,NODE_LOCAL, 1999 bytes) 16/03/14 23:16:16 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on spark-worker1:38231 (size: 6.4 KB, free: 146.1 MB) 16/03/14 23:16:17 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to spark-worker1:43568 16/03/14 23:16:17 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 158 bytes 16/03/14 23:16:18 INFO scheduler.DAGScheduler: ResultStage 1 (collect at <console>:30) finished in 1.288 s 16/03/14 23:16:18 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 1279 ms on spark-worker1 (1/1) 16/03/14 23:16:18 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 16/03/14 23:16:18 INFO scheduler.DAGScheduler: Job 0 finished: collect at <console>:30, took 14.285673 s [1000000]
跟Hive相比,速度是有所提升的。如果是复杂的语句,相比hive速度将更加的快。
scala> hc.sql("select word,count(*) cnt from sougou group by word order by cnt desc limit 5").collect.foreach(println) .... 16/03/14 23:19:16 INFO scheduler.DAGScheduler: ResultStage 3 (collect at <console>:30) finished in 11.900 s 16/03/14 23:19:16 INFO scheduler.DAGScheduler: Job 1 finished: collect at <console>:30, took 17.925094 s 16/03/14 23:19:16 INFO scheduler.TaskSetManager: Finished task 195.0 in stage 3.0 (TID 200) in 696 ms on spark-worker2 (200/200) 16/03/14 23:19:16 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool [百度,7564] [baidu,3652] [人体艺术,2786] [馆陶县县长闫宁的父亲,2388] [4399小游戏,2119]
之前使用Hive需要跑将近110s,而使用Spark SQL仅需17s
本文出自 “叮咚” 博客,请务必保留此出处http://lqding.blog.51cto.com/9123978/1751100
原文地址:http://lqding.blog.51cto.com/9123978/1751100