Python Show-Me-the-Code 第 0006 题最重要的词

时间：2015-04-21 18:07:15 阅读：149 评论：0 收藏：0 [点我收藏+]

第 0006 题：你有一个目录，放了你一个月的日记，都是 txt，为了避免分词的问题，假设内容都是英文，请统计出你认为每篇日记最重要的词。

思路：切换到目标目录，然后遍历该目录下的txt文件，用正则表达式匹配响应的单词和数字，然后让Counter计算单词的词频，并认为排除掉stop word后出现最多的词是最重要的词。

注：stopword就是类似 a/an/and/are/then 的这类高频词，高频词会对基于词频的算分公式产生极大的干扰，所以需要过滤

部分代码引用Show-Me-the-Code 第四题中的统计单词代码

0006.最重要的词.py

#!/usr/bin/env python
#coding: utf-8
import re, os
from collections import Counter

# 目标文件所在目录
FILE_PATH = ‘/home/bill/Desktop‘

def getCounter(articlefilesource):
    ‘‘‘输入一个英文的纯文本文件，统计其中的单词出现的个数‘‘‘
    pattern = r‘‘‘[A-Za-z]+|\$?\d+%?$‘‘‘
    with open(articlefilesource) as f:
        r = re.findall(pattern, f.read())
        return Counter(r)

#过滤词
stop_word = [‘the‘, ‘in‘, ‘of‘, ‘and‘, ‘to‘, ‘has‘, ‘that‘, ‘s‘, ‘is‘, ‘are‘, ‘a‘, ‘with‘, ‘as‘, ‘an‘]

def run(FILE_PATH):
    # 切换到目标文件所在目录
    os.chdir(FILE_PATH)
    # 遍历该目录下的txt文件
    total_counter = Counter()
    for i in os.listdir(os.getcwd()):
        if os.path.splitext(i)[1] == ‘.txt‘:
            total_counter += getCounter(i)
    # 排除stopword的影响
    for i in stop_word:
        total_counter[i] = 0
    print total_counter.most_common()[0][0]

if __name__ == ‘__main__‘:
    run(FILE_PATH)

随便从BBC中国频道上选了几篇新闻进行测试

输出：
技术分享

Python Show-Me-the-Code 第 0006 题最重要的词

标签：python 正则表达式文本

原文地址：http://blog.csdn.net/huangxiongbiao/article/details/45154445

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行

Python Show-Me-the-Code 第 0006 题 最重要的词

Python Show-Me-the-Code 第 0006 题最重要的词