码迷,mamicode.com
首页 > 编程语言 > 详细

【Python】三国演义词频统计

时间:2018-05-03 15:24:07      阅读:694      评论:0      收藏:0      [点我收藏+]

标签:des   txt   cut   exclude   文件编码   int   coding   ==   use   

import jieba
txt = open(‘C:/Users/eternal/Desktop/threekingdoms.txt‘,‘r‘,encoding=‘UTF-8‘).read()  #提前修改txt文件编码格式utf-8
excludes = {‘将军‘,‘却说‘,‘荆州‘,‘二人‘,‘不可‘,‘不能‘,‘如此‘}  #错误的名字
words = jieba.lcut(txt)
print(words)
counts = {}
for word in words:
if len(word) == 1:
continue
elif word == ‘诸葛亮‘ or word == ‘孔明曰‘:
rword = ‘孔明‘
elif word == ‘关公‘ or word == ‘云长‘:
rword == ‘关羽‘
elif word == ‘玄德‘ or word == ‘玄德曰‘:
rword = ‘刘备‘
elif word == ‘孟德‘ or word == ‘丞相‘:
rword = ‘曹操‘
else:
rword = word
counts[rword] = counts.get(rword,0) + 1
for word in excludes:
del counts[word]
items = list(counts.items())
items.sort(key=lambda x:x[1],reverse=True)
print(items)
for i in range(10):
word,count = items[i]
print(‘{0:<10}{1:>5}‘.format(word,count))

【Python】三国演义词频统计

标签:des   txt   cut   exclude   文件编码   int   coding   ==   use   

原文地址:https://www.cnblogs.com/naraka/p/8985134.html

(0)
(1)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!