scala> val count = file.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_) scala> val count = file.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_) 15/07/12 21:38:43 INFO FileInputFormat: Total input paths to process : 1 count: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[8] at reduceByKey at <console>:23 scala> scala> count.collect() 15/07/12 21:39:25 INFO SparkContext: Starting job: collect at <console>:26 15/07/12 21:39:25 INFO DAGScheduler: Registering RDD 7 (map at <console>:23) 15/07/12 21:39:25 INFO DAGScheduler: Got job 0 (collect at <console>:26) with 3 output partitions (allowLocal=false) 15/07/12 21:39:25 INFO DAGScheduler: Final stage: ResultStage 1(collect at <console>:26) 15/07/12 21:39:25 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0) 15/07/12 21:39:25 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0) 15/07/12 21:39:25 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[7] at map at <console>:23), which has no missing parents 15/07/12 21:39:25 INFO MemoryStore: ensureFreeSpace(4128) called with curMem=297554, maxMem=278302556 15/07/12 21:39:25 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 4.0 KB, free 265.1 MB) 15/07/12 21:39:25 INFO MemoryStore: ensureFreeSpace(2305) called with curMem=301682, maxMem=278302556 15/07/12 21:39:25 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 2.3 KB, free 265.1 MB) 15/07/12 21:39:25 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:60268 (size: 2.3 KB, free: 265.4 MB) 15/07/12 21:39:25 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:874 15/07/12 21:39:25 INFO DAGScheduler: Submitting 3 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[7] at map at <console>:23) 15/07/12 21:39:25 INFO TaskSchedulerImpl: Adding task set 0.0 with 3 tasks 15/07/12 21:39:25 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, ANY, 1406 bytes) 15/07/12 21:39:25 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, ANY, 1406 bytes) 15/07/12 21:39:25 INFO Executor: Running task 1.0 in stage 0.0 (TID 1) 15/07/12 21:39:25 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 15/07/12 21:39:25 INFO HadoopRDD: Input split: hdfs://9.125.73.217:9000/hbase/hbase.version:0+3 15/07/12 21:39:25 INFO HadoopRDD: Input split: hdfs://9.125.73.217:9000/hbase/hbase.version:3+3 15/07/12 21:39:25 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 15/07/12 21:39:25 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 15/07/12 21:39:25 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 15/07/12 21:39:25 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 15/07/12 21:39:25 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 15/07/12 21:39:25 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2003 bytes result sent to driver 15/07/12 21:39:25 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 2003 bytes result sent to driver 15/07/12 21:39:25 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, localhost, ANY, 1406 bytes) 15/07/12 21:39:25 INFO Executor: Running task 2.0 in stage 0.0 (TID 2) 15/07/12 21:39:25 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 162 ms on localhost (1/3) 15/07/12 21:39:25 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 179 ms on localhost (2/3) 15/07/12 21:39:25 INFO HadoopRDD: Input split: hdfs://9.125.73.217:9000/hbase/hbase.version:6+1 15/07/12 21:39:25 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2). 2003 bytes result sent to driver 15/07/12 21:39:25 INFO DAGScheduler: ShuffleMapStage 0 (map at <console>:23) finished in 0.205 s 15/07/12 21:39:25 INFO DAGScheduler: looking for newly runnable stages 15/07/12 21:39:25 INFO DAGScheduler: running: Set() 15/07/12 21:39:25 INFO DAGScheduler: waiting: Set(ResultStage 1) 15/07/12 21:39:25 INFO DAGScheduler: failed: Set() 15/07/12 21:39:25 INFO DAGScheduler: Missing parents for ResultStage 1: List() 15/07/12 21:39:25 INFO DAGScheduler: Submitting ResultStage 1 (ShuffledRDD[8] at reduceByKey at <console>:23), which is now runnable 15/07/12 21:39:25 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 25 ms on localhost (3/3) 15/07/12 21:39:25 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 15/07/12 21:39:25 INFO MemoryStore: ensureFreeSpace(2288) called with curMem=303987, maxMem=278302556 15/07/12 21:39:25 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 2.2 KB, free 265.1 MB) 15/07/12 21:39:25 INFO MemoryStore: ensureFreeSpace(1377) called with curMem=306275, maxMem=278302556 15/07/12 21:39:25 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 1377.0 B, free 265.1 MB) 15/07/12 21:39:25 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:60268 (size: 1377.0 B, free: 265.4 MB) 15/07/12 21:39:25 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:874 15/07/12 21:39:25 INFO DAGScheduler: Submitting 3 missing tasks from ResultStage 1 (ShuffledRDD[8] at reduceByKey at <console>:23) 15/07/12 21:39:25 INFO TaskSchedulerImpl: Adding task set 1.0 with 3 tasks 15/07/12 21:39:25 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 3, localhost, PROCESS_LOCAL, 1165 bytes) 15/07/12 21:39:25 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 4, localhost, PROCESS_LOCAL, 1165 bytes) 15/07/12 21:39:25 INFO Executor: Running task 0.0 in stage 1.0 (TID 3) 15/07/12 21:39:25 INFO Executor: Running task 1.0 in stage 1.0 (TID 4) 15/07/12 21:39:25 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 3 blocks 15/07/12 21:39:25 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 3 blocks 15/07/12 21:39:25 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 7 ms 15/07/12 21:39:25 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 8 ms 15/07/12 21:39:25 INFO Executor: Finished task 1.0 in stage 1.0 (TID 4). 1031 bytes result sent to driver 15/07/12 21:39:25 INFO Executor: Finished task 0.0 in stage 1.0 (TID 3). 1029 bytes result sent to driver 15/07/12 21:39:25 INFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID 5, localhost, PROCESS_LOCAL, 1165 bytes) 15/07/12 21:39:25 INFO Executor: Running task 2.0 in stage 1.0 (TID 5) 15/07/12 21:39:25 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 3) in 47 ms on localhost (1/3) 15/07/12 21:39:25 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 3 blocks 15/07/12 21:39:25 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms 15/07/12 21:39:25 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 4) in 46 ms on localhost (2/3) 15/07/12 21:39:25 INFO Executor: Finished task 2.0 in stage 1.0 (TID 5). 882 bytes result sent to driver 15/07/12 21:39:25 INFO TaskSetManager: Finished task 2.0 in stage 1.0 (TID 5) in 6 ms on localhost (3/3) 15/07/12 21:39:25 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 15/07/12 21:39:25 INFO DAGScheduler: ResultStage 1 (collect at <console>:26) finished in 0.043 s 15/07/12 21:39:25 INFO DAGScheduler: Job 0 finished: collect at <console>:26, took 0.352074 s res1: Array[(String, Int)] = Array((?8,1), (PBUF,1)) scala> |