码迷,mamicode.com
首页 > 其他好文 > 详细

综合练习:词频统计

时间:2018-03-28 00:05:28      阅读:175      评论:0      收藏:0      [点我收藏+]

标签:pip   集合   AC   lambda   top   gpo   eth   hat   pytho   

1.英文词频统

下载一首英文的歌词或文章

article = ‘‘‘An empty street
An empty house
A hole inside my heart
I‘m all alone
The rooms are getting smaller
I wonder how
I wonder why
I wonder where they are
The days we had
The songs we sang together
Oh yeah
And oh my love
I‘m holding on forever
Reaching for a love that seems so far
So i say a little prayer
And hope my dreams will take me there
Where the skies are blue to see you once again, my love
Over seas and coast to coast
To find a place i love the most
Where the fields are green to see you once again, my love
I try to read
I go to work
I‘m laughing with my friends
But i can‘t stop to keep myself from thinking
Oh no I wonder how
I wonder why
I wonder where they are
The days we had
The songs we sang together
Oh yeah And oh my love
I‘m holding on forever
Reaching for a love that seems so far Mark:
To hold you in my arms
To promise you my love
To tell you from the heart
You‘re all i‘m thinking of
I‘m reaching for a love that seems so far 
So i say a little prayer
And hope my dreams will take me there
Where the skies are blue to see you once again, my love
Over seas and coast to coast
To find a place i love the most
Where the fields are green to see you once again,my love
say a little prayer
dreams will take me there
Where the skies are blue to see you once again ‘‘‘

  

将所有,.?!’:等分隔符全部替换为空格

sep = ‘‘‘:.,?!‘‘‘
for i in sep:
    article = article.replace(i,‘ ‘);

  

将所有大写转换为小写

	
article = article.lower();

  

生成单词列表

article_list = article.split();
print(article_list);

  

生成词频统计

# # ①统计,遍历集合

# article_dict={}
# article_set =set(article_list)-exclude# 清除重复的部分
# for w in article_set:
#     article_dict[w] = article_list.count(w)
# # 遍历字典
# for w in article_dict:
#     print(w,article_dict[w])
 
 
#方法②,遍历列表
article_dict={}
for w in article_list:
    article_dict[w] = article_dict.get(w,0)+1
# 排除不要的单词
for w in exclude:
    del (article_dict[w]);
 
for w in article_dict:
    print(w,article_dict[w])  

  

 

排序

dictList = list(article_dict.items())
dictList.sort(key=lambda x:x[1],reverse=True); 

  

排除语法型词汇,代词、冠词、连词

exclude = {‘the‘,‘to‘,‘is‘,‘and‘}
for w in exclude:
    del (article_dict[w]); 

  

输出词频最大TOP20

for i in range(20):
     print(dictList[i]) 

  

将分析对象存为utf-8编码的文件,通过文件读取的方式获得词频分析内容。

file =  open("test.txt", "r",encoding=‘utf-8‘);
article = file.read();
file.close()

  

 

2.中文词频统计

下载一长篇中文文章。

从文件读取待分析文本。

news = open(‘gzccnews.txt‘,‘r‘,encoding = ‘utf-8‘)

安装与使用jieba进行中文分词。

pip install jieba

import jieba

list(jieba.lcut(news))

生成词频统计

排序

排除语法型词汇,代词、冠词、连词

输出词频最大TOP20(或把结果存放到文件里)

 

将代码与运行结果截图发布在博客上。

综合练习:词频统计

标签:pip   集合   AC   lambda   top   gpo   eth   hat   pytho   

原文地址:https://www.cnblogs.com/qq412158152/p/8660824.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!