码迷,mamicode.com
首页 > 其他好文 > 详细

综合练习:词频统计

时间:2018-03-27 23:57:20      阅读:222      评论:0      收藏:0      [点我收藏+]

标签:png   idt   second   hand   gone   walk   out   class   gpo   

词频统计预处理

下载一首英文的歌词或文章

news = ‘‘‘
She hangs out every day near by the beach  .  Havin’ a HEINEKEN fallin’ asleep She looks so sexy when she’s walking the sand    Nobody ever put a ring on her hand    She is the story the story is she   She sings to the moon and the stars in the sky   Shining from high above you shouldn’t ask why  She is the one that you never forget  She is the heaven-sent angel you met   Oh, she must be the reason why God made a girl She is so pretty all over the world  She puts the rhythm, the beat in the drum  She comes in the morning and the evening she’s gone  Every little hour every second you live   Trust in eternity that’s what she gives   She looks like Marilyn, walks like Suzanne Suzanne  She talks like Monica and Marianne  Monica Marianne  She wins in everything that she might do   And she will respect you forever just you  She is the one that you never forget She is the heaven-sent angel you met    Oh, she must be the reason why God made a girl  She is so pretty all over the world  She is so pretty all over the world  She is so pretty  She is like you and me    Like them like we She is in you and me   She is the one that you never forget   She is the heaven-sent angel you met   Oh, she must be the reason why God made a girl   She is so pretty all over the world  (She is the one) She is the one   (That you never forget) That you never forget  She is the heaven-sent angel you met  She’s the reason (oh she must be the reason) why God made a girl    She is so pretty all over the world (oh...)   Na na na na na ….
‘‘‘

  

将所有,.?!’:等分隔符全部替换为空格

sep = ‘‘‘.,?!‘‘‘
for c in sep:
	news = news.replace(c,‘‘)

  

将所有大写转换为小写

生成单词列表  

wordList = news.lower().split()
for w in wordList:
	print(w)  

技术分享图片

 

生成词频统计

wordDict = {}
wordSet = set(wordList)
for w in wordSet:
	wordDict[w] = wordList.count(w)

  

排序

dictList = list(wordDict.items())
dictList.sort(key = lambda x: x[1], reverse=True)

  

排除语法型词汇,代词、冠词、连词

exclude ={‘she‘,‘the‘,‘you‘,‘a‘,‘and‘,‘why‘,‘that‘,‘one‘,‘she’s‘,‘me‘,‘by‘,‘when‘}
wordSet = set(wordList)-exclude
for w in wordSet:
	wordDict[w] = wordList.count(w)

技术分享图片

 

  

输出词频最大TOP20

for i in range(20):
	print(dictList[i])

  

将分析对象存为utf-8编码的文件,通过文件读取的方式获得词频分析内容。

f = open(‘news.txt‘,‘r‘,encoding=‘utf-8‘)
news = f.read()
f.close()
print(news)

f = open(‘newscount.txt‘,‘a‘)
for i in range(20):
    f.write(dictList[i][0]+‘‘+str(dictList[i][1])+‘\n‘)
f.close()

2.中文词频统计

下载一长篇中文文章。

从文件读取待分析文本。

news = open(‘gzccnews.txt‘,‘r‘,encoding = ‘utf-8‘)

安装与使用jieba进行中文分词。

pip install jieba

import jieba

list(jieba.lcut(news))

生成词频统计

排序

排除语法型词汇,代词、冠词、连词

输出词频最大TOP20(或把结果存放到文件里)

import jieba
f = open(‘princekin.txt‘,‘r‘,encoding=‘utf-8‘)
princekin = f.read()
f.close()
str1 = ‘‘‘“”,。?:、()’!\n   …‘‘‘
exclude = {‘那‘,‘你‘,‘的‘,‘我‘,‘一‘,‘它‘,‘他‘,‘这‘,‘人‘,‘了‘,‘啊‘,‘呢‘,‘唉‘}
jieba.add_word(‘小王子‘)
for i in str1:
    princekin = princekin.replace(i,‘‘)
result = list(jieba.cut(princekin))
wordDict = {}
words = list(set(result)-exclude)
for i in words:
    wordDict[i]= result.count(i)
wordList = list(wordDict.items())
wordList.sort(key = lambda x: x[1], reverse=True)
print(wordList)
f = open(‘saveall.txt‘,‘a‘,encoding=‘utf-8‘)
for i in range(20):
    f.write(wordList[i][0] + ‘‘ + str(wordList[i][1]) + ‘\n‘)
f.close()

  

  技术分享图片

 

 

综合练习:词频统计

标签:png   idt   second   hand   gone   walk   out   class   gpo   

原文地址:https://www.cnblogs.com/yjxblog/p/8658815.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!