Python数据挖掘-中文分词

时间：2018-10-01 22:12:43 阅读：239 评论：0 收藏：0 [点我收藏+]

标签：import add install iter taf 调用 [] 实用 pen

将一个汉字序列切分成一个一个单独的词

安装分词模块： pip install jieba

分词在特殊场合的实用性，调用add_word()，把我们要添加的分词加入jieba词库

高效方法：将txt保存的词库一次性导入用户词库中

import jieba
jieba.load_userdict("D:\\Python\\Python数据挖掘\\Python数据挖掘实战课程课件\\2.2\\金庸武功招式.txt")

1、搭建语料库

import os
import os.path
import codecs

filePaths=[]
fileContents=[]
for root,dirs,files in os.walk("D:\\Python\\Python数据挖掘\\Python数据挖掘实战课程课件\\2.2\\SogouC.mini\\Sample"):
    for name in files:
        filePath=os.path.join(root,name)
        filePaths.append(filePath)
        f=codecs.open(filePath,"r","utf-8")
        fileContent=f.read()
        f.close()
        fileContents.append(fileContent)
        
import pandas
corpos=pandas.DataFrame({
                         "filePath":filePaths,
                         "fileContent":fileContents})

2、介绍分词来自哪篇文章

import jieba

segments=[]
filePaths=[]
for index,row in corpos.iterrows():   #这样遍历得到的行是一个字典，row()是一个字典
    filePath=row["filePath"]
    fileContent=row["fileContent"]
    segs=jieba.cut(fileContent)   #调用cut方法对文件内容进行分词
    for seg in segs:
        segments.append(seg)
        filePaths.append(filePath)
        
segmentDataFrame=pandas.DataFrame({
                                   "segment":segments,
                                   "filepath":filePaths})

使用数据框的遍历方法，得到语料库中的每行数据，列名作为key

查了一下相关iterrows()的资料；

iterrows()返回值为元组,(index,row)
上面的代码里，for循环定义了两个变量，index，row，那么返回的元组，index=index，row=row.

Python数据挖掘-中文分词

标签：import add install iter taf 调用 [] 实用 pen

原文地址：https://www.cnblogs.com/U940634/p/9735869.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行