Python数据挖掘-词频统计-实现

时间：2018-10-01 22:36:15 阅读：309 评论：0 收藏：0 [点我收藏+]

标签：taf color 括号 .data import eset 数据挖掘 als use

词频：某个词在该文档中出现的内容

1、语料库搭建

import jieba
jieba.load_userdict("D:\\Python\\Python数据挖掘\\Python数据挖掘实战课程课件\\2.2\\金庸武功招式.txt")

import os
import os.path
import codecs

filePaths=[]
fileContents=[]
for root,dirs,files in os.walk("D:\\Python\\Python数据挖掘\\Python数据挖掘实战课程课件\\2.2\\SogouC.mini\\Sample"):
    for name in files:
        filePath=os.path.join(root,name)
        filePaths.append(filePath)
        f=codecs.open(filePath,"r","utf-8")
        fileContent=f.read()
        f.close()
        fileContents.append(fileContent)
        
import pandas
corpos=pandas.DataFrame({
                         "filePath":filePaths,
                         "fileContent":fileContents})

#分词来源哪个文章
import jieba

segments=[]
filePaths=[]
for index,row in corpos.iterrows():
    filePath=row["filePath"]
    fileContent=row["fileContent"]
    segs=jieba.cut(fileContent)
    for seg in segs:
        segments.append(seg)
        filePaths.append(filePath)
        
segmentDataFrame=pandas.DataFrame({
                                   "segment":segments,
                                   "filepath":filePaths})

2、词频统计

import numpy
#进行词频统计
#by是要分组的列，[]是要统计的列
segStat=segmentDataFrame.groupby(
            by="segment"
            )["segment"].agg({
            "计数":numpy.size
            }).reset_index().sort(columns=["计数"],   #重新设置索引，再根据计数进行逆序排序
            ascending=False)

by=[“列名”]后面跟着的是要分组的列，根据方括号里面的列的内容来进行统计；

第二个[]是要统计的列，在分组的列的基础上进行统计的列，可以是它自己本身

3、移除停用词，由于统计的词语很多是我们不需要的，所以需要移除

stopwords=pandas.read_csv(
    "D:\\Python\\Python数据挖掘\\Python数据挖掘实战课程课件\\2.3\\StopwordsCN.txt",    #改文件中包含停用词
    encoding="utf-8",
    index_col=False)

fSegStat=segStat[
        ~segStat.segment.isin(stopwords.stopword)]

所用方法为isin()，然后在取反~

第二种分词方法：

import jieba

segments=[]
filePaths=[]

for index,row in corpos.iterrows():
    filePath=row["filePath"]
    fileContent=row["fileContent"]
    segs=jieba.cut(fileContent)
    for seg in segs:
        if seg not in stopwords.stopword.values and len(seg.strip())>0:
            segments.append(seg)
            filePaths.append(filePath)

segmentDataFrame=pandas.DataFrame({
        "segment":segments,
        "filePath":filePaths})

segStat=segmentDataFrame.groupby(
                    by="segment"
                    )["segment"].agg({
                    "计数":numpy.size
                    }).reset_index().sort(
                        columns=["计数"],
                        ascending=False)

第二种分词方法，是在jieba分词后，通过if判断，筛选除了不在stopwords里面的分词，然后在再输出为数据框，再统计计数

Python数据挖掘-词频统计-实现

标签：taf color 括号 .data import eset 数据挖掘 als use

原文地址：https://www.cnblogs.com/U940634/p/9735946.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行