码迷,mamicode.com
首页 > 编程语言 > 详细

Python 中文文件统计词频 + 中文词云

时间:2019-09-30 12:31:26      阅读:214      评论:0      收藏:0      [点我收藏+]

标签:进一步   generate   open   乱码   directory   rds   names   ofo   tin   

1. 词频统计:

 1 import jieba
 2 txt = open("threekingdoms3.txt", "r", encoding=utf-8).read()
 3 words  = jieba.lcut(txt)
 4 counts = {}
 5 for word in words:
 6     if len(word) == 1:
 7         continue
 8     else:
 9         counts[word] = counts.get(word,0) + 1
10 items = list(counts.items())
11 items.sort(key=lambda x:x[1], reverse=True)
12 for i in range(15):
13     word, count = items[i]
14     print ("{0:<10}{1:>5}".format(word, count))

结果是:

曹操 946
孔明 737
将军 622
玄德 585
却说 534
关公 509
荆州 413
二人 410
丞相 405
玄德曰 390
不可 387
孔明曰 374
张飞 358
如此 320
不能 318

进一步改进, 我想只知道人物出场统计,代码如下:

 1 import jieba
 2 txt = open("threekingdoms3.txt", "r", encoding=utf-8).read()
 3 names = {曹操,孔明,刘备,关羽,张飞,吕布,赵云,孙权,周瑜,袁绍,黄忠,魏延}
 4 words  = jieba.lcut(txt)
 5 counts = {}
 6 for word in words:
 7     if len(word) == 1:
 8         continue
 9     elif word == "诸葛亮" or word == "孔明曰":
10         rword = "孔明"
11     elif word == "关公" or word == "云长":
12         rword = "关羽"
13     elif word == "玄德" or word == "玄德曰":
14         rword = "刘备"
15     elif word == "孟德" or word == "丞相":
16         rword = "曹操"
17     else:
18         rword = word
19     counts[rword] = counts.get(rword,0) + 1
20 # for word in excludes:
21 #     del counts[word]
22 items = list(counts.items())
23 items.sort(key=lambda x:x[1], reverse=True)
24 for i in range(40):
25     word, count = items[i]
26     if word in names:
27         print ("{0:<10}{1:>5}".format(word, count))

运行结果为:

曹操 1358
孔明 1265
刘备 1251
关羽 783
张飞 358
吕布 300
赵云 278
孙权 257
周瑜 217
袁绍 191

进一步的做词云图:

 1 import jieba
 2 import os
 3 import wordcloud
 4  
 5 def getText(file):
 6     with open(file, r, encoding= UTF-8) as txt:
 7         txt = txt.read()
 8         jieba.lcut(txt)
 9     return txt
10  
11  
12 directoryname =  os.getcwd()
13 filename = input()
14 txt = getText(filename + .txt)
15 wordclouds = wordcloud.WordCloud(width=1000, height= 800, margin=2).generate(txt)
16 wordclouds.to_file({}.png.format(filename))
17  
18 os.system({}.png.format(filename))

技术图片

名称是可以进一步优化的,参见第二部分代码。

中文wordcloud库默认会出现乱码,解决方法参考 https://blog.csdn.net/Dick633/article/details/80261233

 

参考:https://blog.csdn.net/weixin_44521703/article/details/93058003

Python 中文文件统计词频 + 中文词云

标签:进一步   generate   open   乱码   directory   rds   names   ofo   tin   

原文地址:https://www.cnblogs.com/116970u/p/11611821.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!