去除文本中的HTML标签、中英文标点符号、数字及英文单词

时间：2017-04-22 00:04:27 阅读：182 评论：0 收藏：0 [点我收藏+]

在进行中文分词统计前，往往要先把爬取下来的文本中包含的一些标签、标点符号、英文字母等过滤掉，这一过程叫做数据清洗。

#coding=utf-8
import re 
import codecs 
def strs_filter(file):
    with codecs.open(file,"r","utf8") as f,codecs.open("result.txt","a+","utf8") as c:
        lines=f.readlines()
        for line in lines:
            # line=line.decode(‘utf8‘)
            re_html=re.compile(‘<[^>]+>‘.decode(‘utf8‘))#从‘<‘开始匹配，不是‘>‘的字符都跳过，直到‘>‘
            re_punc=re.compile(‘[\s+\.\!\/_,$%^*(+\"\‘]+|[+——！，。？、~@#￥%……&*“”《》：（）]+‘.decode(‘utf8‘))#去除标点符号
            re_digits_letter=re.compile(‘\w+‘.decode(‘utf8‘))#去除数字及字母
            line=re_html.sub(‘‘,line)
            line=re_punc.sub("",line)
            line=re_digits_letter.sub("",line)
            c.write(line)
strs_filter("strip.txt")

通过上面的代码可以去除与中文分词统计无关的内容，效果如下：

技术分享

去除文本中的HTML标签、中英文标点符号、数字及英文单词

标签：字符 alt 数据清洗单词通过 color let 文本 http

原文地址：http://www.cnblogs.com/lovealways/p/6550249.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行