码迷,mamicode.com
首页 > 其他好文 > 详细

综合练习:词频统计

时间:2018-03-28 00:02:55      阅读:209      评论:0      收藏:0      [点我收藏+]

标签:down   items   文件   off   ttl   after   add   more   back   

一、英文词频统计

1.下载一首英文的歌词或文章

I said,"dad, you might leave now.” But he looked out of the window and said,” I’m going to buy you some tangerines. You just stay here. Don‘t move around.” I caught sight of several vendors waiting for customers outside the railings beyond a platform. But to reach that platform would require crossing the railway track and doing some climbing up and down. That would be a strenuous job for father, who was fat. I wanted to do all that myself, but he stopped me, so I could do nothing but let him go. I watched him hobble towards the railway track in his black skullcap, black cloth mandarin jacket and dark blue cotton-padded cloth ling gown. He had little trouble climbing down the railway track, but it was a lot more difficult for him to climb up that platform after crossing the railway track. His hands held onto the upper part of the platform, his legs huddled up and his corpulent body tipped slightly towards the left, obviously making an enormous exertion. While I was watching him from behind, tears gushed from my eyes. I quickly wiped them away lest he or others should catch me crying. The next moment when I looked out of the window again, father was already on the way back, holding bright red tangerines in both hands. In crossing the railway track, he first put the tangerines on the ground, climbed down slowly and then picked them up again. When he came near the train, I hurried out to help him by the hand. After boarding the train with me, he laid all the tangerines on my overcoat, and patting the dirt off his clothes, he looked somewhat relieved and said after a while,” I must be going now. Don’t forget to write me from Beijing!” I gazed after his back retreating out of the carriage. After a few steps, he looked back at me and said, "Go back to your seat. Don’t leave your things alone." I, however, did not go back to my seat until his figure was lost among crowds of people hurrying to and fro and no longer visible. My eyes were again wet with tears.

2.将所有,.?!’:等分隔符全部替换为空格

sep = ‘‘‘:.,?!‘‘‘
for i in sep:
    article = article.replace(i,‘ ‘);

 3.将所有大写转换为小写

article = article.lower();

  4.生成单词列表

article_list = article.split();
print(article_list);

 5.生成词频统计

article_dict={}
for w in article_list:
    article_dict[w] = article_dict.get(w,0)+1
for w in exclude:
    del (article_dict[w]);
 
for w in article_dict:
    print(w,article_dict[w])  

  6.排序

dictList = list(article_dict.items())
dictList.sort(key=lambda x:x[1],reverse=True);  

  7.排除语法型词汇,代词、冠词、连词

exclude = {‘the‘,‘to‘,‘is‘,‘and‘}
for w in exclude:
    del (article_dict[w]); 

  8.输出词频最大TOP20

for i in range(20):
     print(dictList[i])  

  9.将分析对象存为utf-8编码的文件,通过文件读取的方式获得词频分析内容。

file =  open("test.txt", "r",encoding=‘utf-8‘);
article = file.read();
file.close()

  

 

 

二、中文词频统计,下载一长篇中文文章。

import jieba
 
#打开文件
file = open("zgsjtl.txt",‘r‘,encoding="utf-8")
notes = file.read();
file.close();
 
#替换标点符号
sep = ‘‘‘:。,?!;∶ ...“”‘‘‘
for i in sep:
    notes = notes.replace(i,‘ ‘);
 
notes_list = list(jieba.cut(notes));
 
 

exclude =[‘ ‘,‘\n‘,‘你‘,‘嗯‘,‘他‘,‘和‘,‘但‘,‘啊‘,‘的‘,‘来‘,‘是‘,‘去‘,‘在‘,‘上‘,‘走‘]
 
notes_dict={}
for w in notes_list:
    notes_dict[w] = notes_dict.get(w,0)+1
 
for w in exclude:
    del (notes_dict[w]);
 
for w in notes_dict:
    print(w,notes_dict[w])
 
 

dictList = list(notes_dict.items())
dictList.sort(key=lambda x:x[1],reverse=True);
print(dictList)
 

for i in range(20):
    print(dictList[i])

outfile = open("top20.txt","a")
for i in range(20):
    outfile.write(dictList[i][0]+" "+str(dictList[i][1])+"\n")
outfile.close();

  

 

综合练习:词频统计

标签:down   items   文件   off   ttl   after   add   more   back   

原文地址:https://www.cnblogs.com/jiesheng/p/8660763.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!