<NLP with python>笔记：三

时间：2016-06-24 00:04:19 阅读：230 评论：0 收藏：0 [点我收藏+]

标签：

Accessing Text Corpora and Lexical Resources(文本语料库和词汇资源)

　　常用文本预料和词汇资源，如何通过python访问这些资源。

2.1 Accessing Text Corpora

　　语料：大量的文本资源。

　　访问语料的三个接口： raw(fileids) /sents(fileids) / words(fileids)

Gutenberg Corpora

　　nltk.corpus.gutenberg，通过raw(fileid)/sents(fileid)/words(fileid)访问文本内容。

　　nltk.corpus.gutenberg.words(fileids=None):根据文件id返回文件对应的内容。,可以将其转换为nltk.text对象，从可以可以使用nltk.text.Text中的方法，concordance,collocations,count等；

　　nltk.corpus.gutenberg.sents(fileids=None):根据文件id返回文件对应的句子。

　　nltk.corpus.gutenberg.raw(fileid=None):返回原始文件。　

Web and Chat Text

　　来自web上的论坛/对话等资源，相对不太正式的语言。

　　nltk.corpus.webtext:论坛文本资源；

　　nltk.corpus.nps_chat：对话资源

Brown Corpus

　　第一个百万级别的英文电子语料。　　

　　nltk.corpus.brown。words(fildids)/sents(fileids)/raw(fileids)

　　常用语研究不同文体直接的差别

　　条件频率分布：nltk.probability.ConditionalFreqDist（cond_simple)

Reuter Corpus

　　nltk.corpus.reuter　. 注释文本语料

Corpus in Other Language

2.2 Conditional Freqency Distributions

　　nltk.probability.ConditionalFreqDist(cond_sample)：从样本中生成条件分布。其中，每个样本由条件-样本对组成，不同于nltk.probability.FeqDist的由样本组成。

　　在很多nlp任务中都非常有用。

2.3 More Python: Reusing Code

2.4 Lexical Resources

　　词汇资源：单词/词语和对应的信息，如pos等。

Wordlist Corpora

　　来自unix/usr/dict/words的资源，主要用于拼写检查。

　　nltk.corpus.stopwords

发音字典

　　单词和他对应的发音。

2.5 WordNet(MATTER)

　　面向语义的英语词典

　　nltk.corpus.wordnet

2.6 Summary

<NLP with python>笔记：三

标签：

原文地址：http://www.cnblogs.com/Mscer/p/5598840.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行