码迷,mamicode.com
首页 > 其他好文 > 详细

中文词频统计

时间:2017-09-29 23:08:55      阅读:379      评论:0      收藏:0      [点我收藏+]

标签:ati   been   64 bit   reverse   txt   中文分词   log   start   red   

中文分词

  1. 下载一中文长篇小说,并转换成UTF-8编码。
  2. 使用jieba库,进行中文词频统计,输出TOP20的词及出现次数。
  3. 排除一些无意义词、合并同一词。
  4. 对词频统计结果做简单的解读。
  5. import jieba
    book=open(D:\\xiaoshuo.txt,r,encoding=utf-8)
    
    #读入待分析的字符串
    str=book.read()
    book.close()
    
    for i in ,。!、   \n “ ” ;:
        str=str.replace(i,‘‘)
    
    words=jieba.cut(str)
    word=set(words)
    
    #计数字典 
    dic={}
    for i in word:
        if len(i)>1:
            dic[i]=str.count(i)
    str=list(dic.items())
    
    #排序
    str.sort(key=lambda x:x[1],reverse=True)
    for i in range(20):
        print(str[i])

    Python 3.6.2 (v3.6.2:5fd33b5, Jul 8 2017, 04:57:36) [MSC v.1900 64 bit (AMD64)] on win32
    Type "copyright", "credits" or "license()" for more information.
    >>>
    ============================= RESTART: D:/daa.py =============================
    Building prefix dict from the default dictionary ...
    Dumping model to file cache C:\Users\asus\AppData\Local\Temp\jieba.cache
    Loading model cost 1.306 seconds.
    Prefix dict has been built succesfully.
    (‘父亲‘, 10)
    (‘背影‘, 4)
    (‘丧事‘, 3)
    (‘北京‘, 3)
    (‘散文‘, 3)
    (‘茶房‘, 3)
    (‘那年‘, 2)
    (‘父母‘, 2)
    (‘踌躇‘, 2)
    (‘朱自清‘, 2)
    (‘要紧‘, 2)
    (‘终于‘, 2)
    (‘日子‘, 2)
    (‘一会‘, 2)
    (‘一半‘, 2)
    (‘子女‘, 2)
    (‘描写‘, 2)
    (‘回家‘, 2)
    (‘不必‘, 2)
    (‘为了‘, 2)
    >>>

    Python 3.6.2 (v3.6.2:5fd33b5, Jul 8 2017, 04:57:36) [MSC v.1900 64 bit (AMD64)] on win32
    Type "copyright", "credits" or "license()" for more information.
    >>>
    ============================= RESTART: D:/daa.py =============================
    Building prefix dict from the default dictionary ...
    Dumping model to file cache C:\Users\asus\AppData\Local\Temp\jieba.cache
    Loading model cost 1.306 seconds.
    Prefix dict has been built succesfully.
    (‘父亲‘, 10)
    (‘背影‘, 4)
    (‘丧事‘, 3)
    (‘北京‘, 3)
    (‘散文‘, 3)
    (‘茶房‘, 3)
    (‘那年‘, 2)
    (‘父母‘, 2)
    (‘踌躇‘, 2)
    (‘朱自清‘, 2)
    (‘要紧‘, 2)
    (‘终于‘, 2)
    (‘日子‘, 2)
    (‘一会‘, 2)
    (‘一半‘, 2)
    (‘子女‘, 2)
    (‘描写‘, 2)
    (‘回家‘, 2)
    (‘不必‘, 2)
    (‘为了‘, 2)
    >>>

    Python 3.6.2 (v3.6.2:5fd33b5, Jul 8 2017, 04:57:36) [MSC v.1900 64 bit (AMD64)] on win32
    Type "copyright", "credits" or "license()" for more information.
    >>>
    ============================= RESTART: D:/daa.py =============================
    Building prefix dict from the default dictionary ...
    Dumping model to file cache C:\Users\asus\AppData\Local\Temp\jieba.cache
    Loading model cost 1.306 seconds.
    Prefix dict has been built succesfully.
    (‘父亲‘, 10)
    (‘背影‘, 4)
    (‘丧事‘, 3)
    (‘北京‘, 3)
    (‘散文‘, 3)
    (‘茶房‘, 3)
    (‘那年‘, 2)
    (‘父母‘, 2)
    (‘踌躇‘, 2)
    (‘朱自清‘, 2)
    (‘要紧‘, 2)
    (‘终于‘, 2)
    (‘日子‘, 2)
    (‘一会‘, 2)
    (‘一半‘, 2)
    (‘子女‘, 2)
    (‘描写‘, 2)
    (‘回家‘, 2)
    (‘不必‘, 2)
    (‘为了‘, 2)
    >>>

中文词频统计

标签:ati   been   64 bit   reverse   txt   中文分词   log   start   red   

原文地址:http://www.cnblogs.com/xiepingjian/p/7612830.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!