数据挖掘-语料库的构建

时间：2018-10-01 21:05:15 阅读：213 评论：0 收藏：0 [点我收藏+]

语料库：是我们要分析的所有文档的集合

使用搜狗实验室提供的语料库，里面有一个classlist，里面内容是文件的编号及分类名称

1、导入模块

import os   
import os.path

filePaths=[]    #建立一个空的列表来存放语料库的文件名称，数组变量
for root,dirs,files in os.walk(     
    "D:\\Python\\Python数据挖掘\\2.1\\SogouC.mini\\Sample"):
    for name in files:
        filePaths.append(os.path.join(root,name))

使用os.walk传入这个目录作为参数，遍历该文件夹下的全部文件，该方法返回一个Truple的数组，第一个root是文件所在目录，第二个是root文件下的子目录命名为dirs，第三个root文件下的所有文件命名为files

拼接文件路径（可解决不同系统下的的文件拼接）

os.path.join(root,name)

2、把第一步的文件路径下的内容读取到内存中

import codecs

filePaths=[]
fileContents=[]
filenames=[]
for root,dirs,files in os.walk(
    "D:\\Python\\Python数据挖掘\\2.1\\SogouC.mini\\Sample"):
    for name in files:
        filePaths.append(os.path.join(root,name))
        filePath=os.path.join(root,name)
        f=codecs.open(filePath,"r",encoding="utf-8")
        fileContent=f.read()   #读取内容后关闭
        fileContents.append(fileContent)

使用codecs.open(filePath,method,encoding)来打开文件，然后用文件的read()方法

3、把读取到的内容变成一个数据框

import pandas
corpos=pandas.DataFrame({
        "filePath":filePaths,
        "fileContent":fileContents,
        "class":filenames})

数据挖掘-语料库的构建

标签：返回 sample 语料库 col port 方法集合 taf span

原文地址：https://www.cnblogs.com/U940634/p/9735681.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行