码迷,mamicode.com
首页 > 编程语言 > 详细

自然语言处理----词干提取器

时间:2017-06-08 22:25:16      阅读:262      评论:0      收藏:0      [点我收藏+]

标签:ati   from   介绍   提取   rem   arc   运行   over   power   

这里主要介绍nltk中的一些现成的词干提取器Porter和Lancaster.

1. Porter

>>> import nltk
>>> porter=nltk.PorterStemmer()
>>> raw=‘‘‘Listen, strange women lying in ponds distributing swords is no basis
... for a system of government. Supreme executive power derives from a mandate from
... the masses, not from some farcical aquatic‘‘‘
>>> tokens=nltk.word_tokenize(raw)
>>> [porter.stem(t) for t in tokens]
[listen, ,, ustrang, women, ulie, in, upond, udistribut, usword, is, no, ubasi, for, a, system, of, ugovern, ., usuprem, uexecut, power, uderiv, from,
, umandat, from, the, umass, ,, not, from, some, ufarcic, uaquat]

2. Lancaster

>>> lancaster=nltk.LancasterStemmer()
>>> [lancaster.stem(t) for t in tokens]
[list, ,, strange, wom, lying, in, pond, distribut, sword, is, no, bas, for, a, system, of, govern, ., suprem, execut, pow, der, from, a, mand, from
, the, mass, ,, not, from, som, farc, aqu]

3. 词形归并器:删除词缀产生的词, 常用的有WordNetLemmatier

>>> wnl=nltk.WordNetLemmatizer()
>>> [wnl.lemmatize(t) for t in tokens]
[Listen, ,, strange, uwoman, lying, in, upond, distributing, usword, is, no, basis, for, a, system, of, government, ., Supreme, executive, power, derives, from, a, mandate, from, the, umass, ,, not, from, some, farcical, aquatic]

从上面的运行结果可以看出,Porter词干提取器的效果比较好。

自然语言处理----词干提取器

标签:ati   from   介绍   提取   rem   arc   运行   over   power   

原文地址:http://www.cnblogs.com/no-tears-girl/p/6964910.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!