码迷,mamicode.com
首页 > 其他好文 > 详细

综合练习:词频统计

时间:2018-03-27 18:38:00      阅读:117      评论:0      收藏:0      [点我收藏+]

标签:utf-8   items   poc   volume   通过   time   get   RKE   pen   

下载一首英文的歌词或文章

生成词频统计

news=‘‘‘At the same time, the market of TV dramas has also maintained rapid development. In 2017, the production volume of TV dramas in China reaches 310 and 13,000 sets, and continues to be the no.1 in the world. The "national treasure", "national treasure", "if the national treasure can talk" and other TV variety shows, documentaries, vividly spread the excellent Chinese traditional culture.
With modern technology, traditional culture is rejuvenated. Hangzhou songcheng group with new technology to interpret ancient Chinese traditional story, the Qingdao publishing group is using virtual reality, 3 d printing technology, the audience can feel the charm of traditional culture anytime and anywhere.
In recent years, China‘s cultural industry has been growing rapidly, and the pace of "going out" has been accelerating. As of last year, China‘s publishing enterprises set up more than 400 branches overseas and established cooperative partnership with over 500 publishing institutions in over 70 countries. People‘s day boat publishing co., LTD. Was set up in less than two years, has published "Chinese traditional festival" (in Arabic), "in a pocket of father" (French version) and so on more than 40 foreign language books.
 ‘‘‘
sep = ‘‘‘,.;:‘‘""‘‘‘
for c in sep:
    news = news.replace(c, ‘ ‘)

wordlist = news.lower().split()

wordDict = {}
for w in wordlist:
    wordDict[w] = wordDict.get(w, 0) + 1
‘‘‘
wordSet=set(wordlist)
for w in wordSet:
    wordDict[w]=wordlist.count(w)
‘‘‘
for w in wordDict:
    print(w, wordDict[w])

  

排序

wordSet=set(wordlist)
for w in wordSet:
    wordDict[w]=wordlist.count(w)
dictList=list(wordDict.items())
dictList.sort(key=lambda x:x[1],reverse=True)
 
print(dictList)

  

排除语法型词汇,代词、冠词、连词

exclude={‘the‘,‘a‘,‘an‘,‘and‘,‘of‘,‘with‘,‘to‘,‘by‘,‘am‘,‘are‘,‘is‘,‘which‘,‘on‘}
wordSet=set(wordlist)-exclude
for w in wordSet:
    wordDict[w]=wordlist.count(w)
dictList=list(wordDict.items())
dictList.sort(key=lambda x:x[1],reverse=True)
 
print(dictList)

 

输出词频最大TOP20以及将分析对象存为utf-8编码的文件,通过文件读取的方式获得词频分析内容。

for i in range(20):
    print(dictList[i])


print(‘author:xujinpei‘)
f=open(‘news.txt‘,‘r‘)
news=f.read()
f.close()
print(news)

  

综合练习:词频统计

标签:utf-8   items   poc   volume   通过   time   get   RKE   pen   

原文地址:https://www.cnblogs.com/zhongchengzhe/p/8658569.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!