码迷,mamicode.com
首页 > 编程语言 > 详细

Python nltk English Detection

时间:2016-06-19 13:00:44      阅读:252      评论:0      收藏:0      [点我收藏+]

标签:

http://blog.alejandronolla.com/2013/05/15/detecting-text-language-with-python-and-nltk/

 

>>> from nltk import wordpunct_tokenize

>>> wordpunct_tokenize("That‘s thirty minutes away. I‘ll be there in ten.")

[‘That‘, "‘", ‘s‘, ‘thirty‘, ‘minutes‘, ‘away‘, ‘.‘, ‘I‘, "‘", ‘ll‘, ‘be‘, ‘there‘, ‘in‘, ‘ten‘, ‘.‘]

 

>>> from nltk.corpus import stopwords

>>> stopwords.fileids()

[‘danish‘, ‘dutch‘, ‘english‘, ‘finnish‘, ‘french‘, ‘german‘, ‘hungarian‘, ‘italian‘, ‘norwegian‘, ‘portuguese‘, ‘russian‘, ‘spanish‘, ‘swedish‘, ‘turkish‘]

>>>

>>> stopwords.words(‘english‘)[0:10]

[‘i‘, ‘me‘, ‘my‘, ‘myself‘, ‘we‘, ‘our‘, ‘ours‘, ‘ourselves‘, ‘you‘, ‘your‘]

 

>>> languages_ratios = {}

>>>

>>> tokens = wordpunct_tokenize(text)

>>> words = [word.lower() for word in tokens]

>>> for language in stopwords.fileids():

... stopwords_set = set(stopwords.words(language))

... words_set = set(words)

... common_elements = words_set.intersection(stopwords_set)

...

... languages_ratios[language] = len(common_elements)

# language "score"

>>>

>>> languages_ratios

{‘swedish‘: 1, ‘danish‘: 1, ‘hungarian‘: 2, ‘finnish‘: 0, ‘portuguese‘: 0, ‘german‘: 1, ‘dutch‘: 1, ‘french‘: 1, ‘spanish‘: 0, ‘norwegian‘: 1, ‘english‘: 6, ‘russian‘: 0, ‘turkish‘: 0, ‘italian‘: 2}

 

>>> most_rated_language = max(languages_ratios, key=languages_ratios.get)

>>> most_rated_language

‘english‘

 

Python nltk English Detection

标签:

原文地址:http://www.cnblogs.com/turtle920/p/5597829.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!