码迷,mamicode.com
首页 > 其他好文 > 详细

综合练习:词频统计

时间:2018-03-28 12:19:18      阅读:129      评论:0      收藏:0      [点我收藏+]

标签:span   color   ems   ati   lambda   word   style   san   enc   

1.英文词频统

f = open(lyric.txt,r)
lyric = f.read()
f.close()
 
 
punctuation = ‘‘‘,.?/:;‘"‘‘‘
a = {in,on,with,by,for,at,about,under,of,i,a,is,its,so,and,dont,it,to,ill,the}
for i in punctuation:
    lyric = lyric.replace(i,‘‘)
result = lyric.lower().lstrip().rstrip()
tempwords = result.split()
print(tempwords)
count = {}
words = list(set(tempwords)-a)
 
print(words)
print(result)
 
for i in range(0,len(words)):
    count[words[i]]=result.count(str(words[i]))
    print(单词  + words[i] +  的出现次数为:+str(result.count(words[i])))
 
for i in count:
    print(i)
    print(count[i])
 
countList = list(count.items())
countList.sort(key=lambda x:x[1],reverse=True)
print(countList)
 
f = open(lyricCount.txt,a)
for i in range(20):
    f.write(countList[i][0]+:+str(countList[i][1])+\n)
f.close()

 

2.中文词频统计

import jieba
 
 
f = open(sanguoyanyi.txt, r,encoding=utf-8)
text = f.read()
f.close()
 
jieba.add_word(曹操)
jieba.add_word(诸葛亮)
jieba.add_word(孔明)
punctuation = ‘‘‘,。‘’“”:;()!?、 ‘‘‘
a = {,\n,\u3000,,,,,,,,,
     ,,,,,,,,,
     ,,,,,,,,,
     ,,,,,,,,,
     ,,}
for i in punctuation:
    text = text.replace(i, ‘‘)
print(list(jieba.cut(text)))
tempwords = list(jieba.cut(text))
print(tempwords)
count = {}
words = list(set(tempwords) - a)
print(words)
 
 
for i in range(0, len(words)):
    count[words[i]] = text.count(str(words[i]))
 
 
countList = list(count.items())
countList.sort(key=lambda x: x[1], reverse=True)
print(countList)
 
f = open(zzzCount.txt, a)
for i in range(20):
    f.write(countList[i][0] + : + str(countList[i][1]) + \n)
f.close()

 

综合练习:词频统计

标签:span   color   ems   ati   lambda   word   style   san   enc   

原文地址:https://www.cnblogs.com/605-mk/p/8662682.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!