码迷,mamicode.com
首页 > 其他好文 > 详细

三、spark入门:文本中发现5个最常用的word,排除常用停用词

时间:2016-08-03 00:05:29      阅读:1383      评论:0      收藏:0      [点我收藏+]

标签:

package com.yl.wordcount

import java.io.File

import org.apache.spark.{SparkConf, SparkContext}

import scala.collection.Iterator
import scala.io.Source

/**
* wordcount进行排序并排除停用词
*/
object WordCountStopWords {

def main(args: Array[String]) {
val conf = new SparkConf().setMaster("spark://localhost:7077").setAppName("wordcount")
val sc = new SparkContext(conf)

val outFile = "/Users/admin/spark/sparkoutput"
var stopWords:Iterator[String] = null
val stopWordsFile = new File("/Users/admin/src"+"/tingyongci.txt")

if(stopWordsFile.exists()){
stopWords = Source.fromFile(stopWordsFile).getLines
}
val stopWordList = stopWords.toList

val textFile = sc.textFile("/Users/admin/spark/spark-1.5.1-bin-hadoop2.4/README.md")
val result = textFile.flatMap(_.split(" ")).filter(!_.isEmpty).filter(!stopWordList.contains(_)).map((_,1)).reduceByKey(_+_).map{case (word,count) =>(count,word)}.sortByKey(false)

result.saveAsTextFile(outFile)
}

}

三、spark入门:文本中发现5个最常用的word,排除常用停用词

标签:

原文地址:http://www.cnblogs.com/ylcoder/p/5730947.html

(1)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!