中文词频统计及词云制作

时间：2017-09-25 13:29:27 阅读：526 评论：0 收藏：0 [点我收藏+]

标签：制作 odi orm users range 显示 imp http 分析

1.我希望老师能讲一点python在数据挖掘，数据分析领域的应用，最好能举些实例，或者说带我们实际操作一波。

2.中文分词

下载一中文长篇小说，并转换成UTF-8编码。
使用jieba库，进行中文词频统计，输出TOP20的词及出现次数。
**排除一些无意义词、合并同一词。
**使用wordcloud库绘制一个词云。

import jieba

book = "活着.txt"
txt = open(book,"r",encoding=‘utf-8‘).read()

ex = {‘有庆‘,‘我们‘,‘知道‘,‘看到‘,‘自己‘,‘起来‘}

ls = []
words = jieba.lcut(txt)
counts = {}
for word in words:
    ls.append(word)
    if len(word) == 1:
        continue
    else:
        counts[word] = counts.get(word,0)+1

for word in ex:
    del(counts[word])
    
items = list(counts.items())
items.sort(key = lambda x:x[1], reverse = True)
for i in range(10):
    word , count = items[i]
    print ("{:<10}{:>5}".format(word,count))

wz = open(‘ms.txt‘,‘w+‘)
wz.write(str(ls))

import matplotlib.pyplot as plt
from wordcloud import WordCloud

wzhz = WordCloud().generate(txt)
plt.imshow(wzhz)
plt.show()

输出结果：

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\ADMINI~1\AppData\Local\Temp\jieba.cache
Loading model cost 0.723 seconds.
Prefix dict has been built succesfully.
家珍          575
凤霞          413
二喜          175
队长          166
什么          151
他们          148
一个          145
看着          115
孩子          114
没有          113

词云显示结果：

技术分享

中文词频统计及词云制作

标签：制作 odi orm users range 显示 imp http 分析

原文地址：http://www.cnblogs.com/xypbk/p/7591109.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行