标签:adl ext reverse tor 分类 处理 array block source
处理数据常用的语言,使用基本的api处理一个wordcount
读取文件,找出单词(转大写)出现次数,并排序,获取TopK数据。
def main(args: Array[String]): Unit = {
//读取文件
val source: BufferedSource = Source.fromFile("dir/wordcount.txt")
/*
hadoop Spark hive
Spark Flink hadoop
java scala hadoop
Spark Hadoop Java
*/
val text:String =source.mkString
//切分字符串为数组
val strings: Array[String] = text.split("\\W+")
//处理数据
strings.map(_.toUpperCase).map((_,1)).groupBy(_._1).map(k=>(k._1,k._2.length)). //转大写->转元组->分组->聚合
toArray.sortBy(_._2).reverse //排序->反转->遍历
.foreach(println)
/*
(HADOOP,4)
(SPARK,3)
(JAVA,2)
(HIVE,1)
(SCALA,1)
(FLINK,1)
*/
source.close()
}
Java中的集合要转换为Stream才支持高阶函数。
FileReader reader = new FileReader(new File("dir/wordcount.txt"));
char[] chars=new char[1024];
int len = reader.read(chars);
Stream<String> stream = Stream.of(new String(chars).split("\\W+"));//分割字符串并转流
java.util.Map<String, Long> collect = stream.map(String::toUpperCase).collect(Collectors.groupingBy(s->s,Collectors.counting())); //转大写->分类->聚合
//wordCount完成
System.out.println(collect); //{JAVA=3, HIVE=1, HADOOP=4, SCALA=1, HDFS=1, SPARK=3, HBASE=1, YARN=1, FLINK=1}
reader.close();
//排序
List<java.util.Map.Entry<String, Long>> entryList = collect.entrySet().stream().sorted((e1, e2) -> Long.compare(e1.getValue(), e2.getValue()))
.collect(Collectors.toList());
Collections.reverse(entryList);//翻转链表
//获取Top5数据,TopK完成
entryList.stream().limit(5).forEach(System.out::println);
/*
HADOOP=4
SPARK=3
JAVA=3
FLINK=1
YARN=1
*/
Python虽然也支持一点函数式编程,但使用还是很吃力
import re
import copy
# 读取文件
with open(‘wordcount.txt‘, ‘r‘) as file:
text = file.readlines()
file.close()
# 将text处理成字符串
lines = ‘‘.join(text).upper()
""" lines:
HADOOP SPARK HIVE YARN HDFS
SPARK FLINK HADOOP JAVA
JAVA SCALA HADOOP HBASE
SPARK HADOOP JAVA
"""
word = re.split("\\s+", lines)
"""分割字符串
[‘HADOOP‘, ‘SPARK‘, ‘HIVE‘, ‘YARN‘,‘HDFS‘, ‘SPARK‘, ‘FLINK‘, ‘HADOOP‘, ‘JAVA‘, ‘JAVA‘, ‘SCALA‘, ‘HADOOP‘,‘HBASE‘, ‘SPARK‘, ‘HADOOP‘, ‘JAVA‘]
"""
data = list(map(lambda x: (x,1), word))
"""生成元组data:
[(‘HADOOP‘, 1), (‘SPARK‘, 1), (‘HIVE‘, 1), (‘YARN‘, 1), (‘HDFS‘, 1), (‘SPARK‘, 1), (‘FLINK‘, 1), (‘HADOOP‘, 1),
(‘JAVA‘, 1), (‘JAVA‘, 1), (‘SCALA‘, 1), (‘HADOOP‘, 1), (‘HBASE‘, 1), (‘SPARK‘, 1), (‘HADOOP‘, 1), (‘JAVA‘, 1)]
"""
# 深拷贝一份,用来形成字典
new_data = copy.copy(data)
new_list = list(dict(new_data))
""" 元数据,没有重复:
[‘HADOOP‘, ‘SPARK‘, ‘HIVE‘, ‘YARN‘, ‘HDFS‘, ‘FLINK‘, ‘JAVA‘, ‘SCALA‘, ‘HBASE‘]
"""
topK = [] # 去重后的数据,还未排序
c = 0 # 用于计数
# 计数,形成新列表
for i in new_list:
for j in data:
if i == j[0]:
c+=1
topK.append((c, i))
c=0
# 排序,翻转
topK.sort(reverse=True)
# 获取Top5
for word in topK[0:5]:
print(word)
"""
(4, ‘HADOOP‘)
(3, ‘SPARK‘)
(3, ‘JAVA‘)
(1, ‘YARN‘)
(1, ‘SCALA‘)
"""
标签:adl ext reverse tor 分类 处理 array block source
原文地址:https://www.cnblogs.com/cgl-dong/p/14142966.html