文本预处理常用操作

时间：2018-05-29 12:33:14 阅读：176 评论：0 收藏：0 [点我收藏+]

标签：中文 not 方式 strip() nbsp top 特征 stop delete

这里介绍一下文本预处理中常用的操作：

１．英文统一小写

text = text.lower()

２．分词

    def cut(text):
        # return list(jieba.cut(text))
        return [item for item in jieba.cut(text.lower())] if text != "" else []

３．去噪

两种方式

（１）去停用词

包括中英文标点符号、以及噪音词，参考附录[1]

    stopwords = set([line.strip() for line in codecs.open("data/stopwords.txt", "r")])
    def cut_and_remove_stopwords(text):
        return [item for item in jieba.cut(text.lower()) if item not in Utils.stopwords] if text != "" else []

（２）只保留指定词典中的词

这个词典与任务强相关，通常是当前任务重点关注的特征词

    def cut_and_in_vocabulary(text):
        return [item for item in jieba.cut(text.lower()) if item in Utils.vocabulary] if text != "" else []

其中，为了保证分词的结果是我们想要的，通常需要调整jieba词典：

    file_vocabulary = "data/vocabulary.txt"
    jieba.load_userdict(file_vocabulary)
    vocabulary = set([line.strip() for line in codecs.open(file_vocabulary, "r")])

    file_jieba_delete_dict = "data/jieba_delete_dict.txt"
    for wd in [line.strip() for line in codecs.open(file_jieba_delete_dict, "r")]:
        jieba.del_word(wd)

详细说明参考：fxsjy/jieba: 结巴中文分词

附录[1]：停用词表（其中有两行分别是中英文的空格）

,
.
?
!
　
，
。
？
！
不好意思
抱歉
谢谢
这边
那边
那个
这个
那样
这种
那种
我想
这儿
这样
还
也
额
呃
嗯
噢
那
哎
先
后
啊
哦
吧
呀
啦
哈
诶
咯
恩
阿
呢
吗
的
了

待补充～

文本预处理常用操作

标签：中文 not 方式 strip() nbsp top 特征 stop delete

原文地址：https://www.cnblogs.com/bymo/p/9104282.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行