邮件分词去掉停用词

时间：2018-11-29 15:38:45 阅读：265 评论：0 收藏：0 [点我收藏+]

标签：抽取 load code closed down size lease color mod

!pip install nltk

技术分享图片

#读取文件
text = ‘Be assured that individual statistics are not disclosed and this is for internal use only..I am pleased to inform you that you have been accepted to join the workshop scheduled for 22-24 Nov,2008.‘
import nltk
nltk.download(‘punkt‘)
nltk.download(‘stopwords‘)
nltk.download(‘wordnet‘)
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
 #预处理
def preprocessing(text):
    #text = text.decode("utf-8")
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    stops = stopwords.words(‘english‘)
    tokens = [token for token in tokens if token not in stops]
    
    tokens = [token.lower() for token in tokens if len(token) >= 3]
    lmtzr = WordNetLemmatizer()
    tokens = [lmtzr.lemmatize(token) for token in tokens]
    preprocessed_text = ‘‘.join(tokens)
    return preprocessed_text

preprocessing(text)

技术分享图片

#划分数据集
from sklearn.model_selection import train_test_split
# 生成100条数据：100个2维的特征向量，对应100个标签
x = [["feature ","one "]] * 50 + [["feature ","two "]] * 50
y = [1] * 50 + [2] * 50
 # 随机抽取30%的测试集
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.3,random_state=0)
print ("train:",len(x_train), "test:",len(x_test))
 # 查看被划分出的测试集
for i in range(len(x_test)):
    print ("".join(x_test[i]), y_test[i])

技术分享图片

邮件分词去掉停用词

标签：抽取 load code closed down size lease color mod

原文地址：https://www.cnblogs.com/fanfanfan/p/10036777.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行