标签:lsp std san tab value rds head add ext
用Python编写WordCount程序任务
程序
WordCount
输入
一个包含大量单词的文本文件
输出
文件中每个单词及其出现次数(频数),并按照单词字母顺序排序,每个单词和其频数占一行,单词和频数之间有间隔
create
‘Student‘
,
‘ S_No ‘
‘S_Name‘
‘S_Sex‘
‘S_Age‘
put
‘s001‘
‘S_No‘
‘2015001‘
‘Zhangsan‘
‘male‘
‘23‘
‘s002‘
‘2015003‘
‘Mary‘
‘female‘
‘22‘
‘s003‘
‘Lisi‘
‘24‘
scan
alter
‘NAME‘
=
>
‘course‘
‘3‘
‘course:Math‘
‘85‘
dorp
count
‘s1‘
truncate
cd
/
home
hadoop
wc
sudo gedit mapper.py
# map函数
import
sys
for
i
in
stdin:
i.strip()
words
i.split()
word
words:
print
‘%s\t%s‘
%
(word,
1
)
#reduce函数
from
operator
itemgetter
current_word
None
current_count
0
word, count
i.split(
‘\t‘
try
:
int
(count)
except
ValueError:
continue
if
word:
+
else
current_word:
(current_word, current_count)
chmod a
x
mapper.py
echo
"foo foo quux labs foo bar quux"
|
mapper.py | sort
-
k1,
reducer.p
wget http:
www.gutenberg.org
files
5000
8.txt
cache
epub
20417
pg20417.txt
usr
hdfs dfs
gutenberg
*
.txt
user
input
理解MapReduce计算构架
原文地址:https://www.cnblogs.com/wenjian1027/p/9026761.html