综合练习：词频统计

时间：2018-03-28 23:53:17 阅读：200 评论：0 收藏：0 [点我收藏+]

标签：new open pre ever 综合 div 英文分隔符空格

联系要求

下载一首英文的歌词或文章

将歌词存入文件中，然后读取出来

将所有,.？！’:等分隔符全部替换为空格

将所有大写转换为小写

生成单词列表

生成词频统计

排序

排除语法型词汇，代词、冠词、连词

输出词频最大TOP20

将分析对象存为utf-8编码的文件，通过文件读取的方式获得词频分析内容。

 1 io=open("test.txt",‘r‘)
 2 news=io.read()
 3 io.close()
 4 strList={‘is‘,‘the‘,‘to‘,‘is‘,‘it‘,‘and‘,‘oh‘,‘in‘}
 5 for item in str1:
 6     news2=news.replace(item," ").lower().split()
 7 #print(news2)
 8 
 9 wordDict={}
10 
11 wordSet=set(news2) -strList
12 for w in news2:
13     wordDict[w]=news2.count(w)
14 
15 
16 wordList=list(wordDict.items())
17 print(wordList)
18 for item in wordList:
19     #print(item)
20     pass
21 wordList.sort(key=lambda x:x[1],reverse=True)
22 newWordList=wordList[:20]
23 for i in newWordList:
24     print(i)

2.中文词频统计

下载一长篇中文文章。

从文件读取待分析文本。

news = open(‘gzccnews.txt‘,‘r‘,encoding = ‘utf-8‘)

安装与使用jieba进行中文分词。

pip install jieba

import jieba

list(jieba.lcut(news))

生成词频统计

排序

排除语法型词汇，代词、冠词、连词

输出词频最大TOP20（或把结果存放到文件里）

 1 #!/usr/bin/python
 2 # -*- coding: UTF-8 -*-
 3 import jieba
 4 
 5 str1=‘‘‘‘"‘‘‘
 6 io=open("test2.txt",‘r‘,encoding=‘UTF-8‘)
 7 strList=io.read()
 8 io.close()
 9 
10 print(strList)
11 wordList =list(jieba.cut(strList))
12 for item in wordList:
13     print(item)
  
   wordList.sort(key=lambda x:x[1],reverse=True)
   newWordList=wordList[:20]
   for i in newWordList:
        print(i)

综合练习：词频统计

标签：new open pre ever 综合 div 英文分隔符空格

原文地址：https://www.cnblogs.com/crx234/p/8658941.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行