码迷,mamicode.com
首页 > 其他好文 > 详细

自然语言处理(2)之文本资料库

时间:2014-08-28 00:41:48      阅读:464      评论:0      收藏:0      [点我收藏+]

标签:des   style   blog   color   os   io   for   ar   文件   

自然语言处理(2)之文本资料库

1.获取文本资料库

本章首先给出了一个文本资料库的实例:nltk.corpus.gutenberg,通过gutenberg实例来学习文本资料库。我们用help来查看它的类型

  1 >>> import nltk
  2 >>> help(nltk.corpus.gutenberg)
  3 Help on PlaintextCorpusReader in module nltk.corpus.reader.plaintext object:
  4 
  5 class PlaintextCorpusReader(nltk.corpus.reader.api.CorpusReader)
  6  |  Reader for corpora that consist of plaintext documents.  Paragraphs
  7  |  are assumed to be split using blank lines.  Sentences and words can
  8  |  be tokenized using the default tokenizers, or by custom tokenizers
  9  |  specificed as parameters to the constructor.
 10  |  
 11  |  This corpus reader can be customized (e.g., to skip preface
 12  |  sections of specific document formats) by creating a subclass and
 13  |  overriding the ``CorpusView`` class variable.
 14  |  
 15  |  Method resolution order:
 16  |      PlaintextCorpusReader
 17  |      nltk.corpus.reader.api.CorpusReader
 18  |      __builtin__.object
 19  |  
 20  |  Methods defined here:
 21  |  
 22  |  __init__(self, root, fileids, word_tokenizer=WordPunctTokenizer(pattern=\\w+|[^\\w\\s]+, gaps=False, discard_empty=True, flags=56), sent_tokenizer=<nltk.tokenize.punkt.Punkt
 23 SentenceTokenizer object>, para_block_reader=<function read_blankline_block>, encoding=None)
 24  |      Construct a new plaintext corpus reader for a set of documents
 25  |      located at the given root directory.  Example usage:
 26  |      
 27  |          >>> root = /usr/local/share/nltk_data/corpora/webtext/
 28  |          >>> reader = PlaintextCorpusReader(root, .*\.txt)
 29  |      
 30  |      :param root: The root directory for this corpus.
 31  |      :param fileids: A list or regexp specifying the fileids in this corpus.
 32  |      :param word_tokenizer: Tokenizer for breaking sentences or
 33  |          paragraphs into words.
 34  |      :param sent_tokenizer: Tokenizer for breaking paragraphs
 35  |          into words.
 36  |      :param para_block_reader: The block reader used to divide the
 37  |          corpus into paragraph blocks.
 38  |  
 39  |  paras(self, fileids=None, sourced=False)
 40  |      :return: the given file(s) as a list of
 41  |          paragraphs, each encoded as a list of sentences, which are
 42  |          in turn encoded as lists of word strings.
 43  |      :rtype: list(list(list(str)))
 44  |  
 45  |  raw(self, fileids=None, sourced=False)
 46  |      :return: the given file(s) as a single string.
 47  |      :rtype: str
 48  |  
 49  |  sents(self, fileids=None, sourced=False)
 50  |      :return: the given file(s) as a list of
 51  |          sentences or utterances, each encoded as a list of word
 52  |          strings.
 53  |      :rtype: list(list(str))
 54  |  
 55  |  words(self, fileids=None, sourced=False)
 56  |      :return: the given file(s) as a list of words
 57 |          and punctuation symbols.
 58  |      :rtype: list(str)
 59  |  
 60  |  ----------------------------------------------------------------------
 61  |  Data and other attributes defined here:
 62  |  
 63  |  CorpusView = <class nltk.corpus.reader.util.StreamBackedCorpusView>
 64  |      A view of a corpus file, which acts like a sequence of tokens:
 65  |      it can be accessed by index, iterated over, etc.  However, the
 66  |      tokens are only constructed as-needed -- the entire corpus is
 67  |      never stored in memory at once.
 68  |      
 69  |      The constructor to ``StreamBackedCorpusView`` takes two arguments:
 70  |      a corpus fileid (specified as a string or as a ``PathPointer``);
 71  |      and a block reader.  A "block reader" is a function that reads
 72  |      zero or more tokens from a stream, and returns them as a list.  A
 73  |      very simple example of a block reader is:
 74  |      
 75  |          >>> def simple_block_reader(stream):
 76  |          ...     return stream.readline().split()
 77  |      
 78  |      This simple block reader reads a single line at a time, and
 79  |      returns a single token (consisting of a string) for each
 80  |      whitespace-separated substring on the line.
 81  |      
 82  |      When deciding how to define the block reader for a given
 83  |      corpus, careful consideration should be given to the size of
 84  |      blocks handled by the block reader.  Smaller block sizes will
 85  |      increase the memory requirements of the corpus views internal
 86  |      data structures (by 2 integers per block).  On the other hand,
 87  |      larger block sizes may decrease performance for random access to
 88  |      the corpus.  (But note that larger block sizes will *not*
 89  |      decrease performance for iteration.)
 90  |      
 91  |      Internally, ``CorpusView`` maintains a partial mapping from token
 92  |      index to file position, with one entry per block.  When a token
 93  |      with a given index *i* is requested, the ``CorpusView`` constructs
 94  |      it as follows:
 95  |      
 96  |        1. First, it searches the toknum/filepos mapping for the token
 97  |           index closest to (but less than or equal to) *i*.
 98  |      
 99  |        2. Then, starting at the file position corresponding to that
100  |           index, it reads one block at a time using the block reader
101  |           until it reaches the requested token.
102  |      
103  |      The toknum/filepos mapping is created lazily: it is initially
104  |      empty, but every time a new block is read, the blocks
105  |      initial token is added to the mapping.  (Thus, the toknum/filepos
106  |      map has one entry per block.)
107  |      
108  |      In order to increase efficiency for random access patterns that
109  |      have high degrees of locality, the corpus view may cache one or
110 |      have high degrees of locality, the corpus view may cache one or
111  |      more blocks.
112  |      
113  |      :note: Each ``CorpusView`` object internally maintains an open file
114  |          object for its underlying corpus file.  This file should be
115  |          automatically closed when the ``CorpusView`` is garbage collected,
116  |          but if you wish to close it manually, use the ``close()``
117  |          method.  If you access a ``CorpusView``s items after it has been
118  |          closed, the file object will be automatically re-opened.
119  |      
120  |      :warning: If the contents of the file are modified during the
121  |          lifetime of the ``CorpusView``, then the ``CorpusView``s behavior
122  |          is undefined.
123  |      
124  |      :warning: If a unicode encoding is specified when constructing a
125  |          ``CorpusView``, then the block reader may only call
126  |          ``stream.seek()`` with offsets that have been returned by
127  |          ``stream.tell()``; in particular, calling ``stream.seek()`` with
128  |          relative offsets, or with offsets based on string lengths, may
129  |          lead to incorrect behavior.
130  |      
131  |      :ivar _block_reader: The function used to read
132  |          a single block from the underlying file stream.
133  |      :ivar _toknum: A list containing the token index of each block
134  |          that has been processed.  In particular, ``_toknum[i]`` is the
135  |          token index of the first token in block ``i``.  Together
136  |          with ``_filepos``, this forms a partial mapping between token
137  |          indices and file positions.
138  |      :ivar _filepos: A list containing the file position of each block
139  |          that has been processed.  In particular, ``_toknum[i]`` is the
140  |          file position of the first character in block ``i``.  Together
141  |          with ``_toknum``, this forms a partial mapping between token
142  |          indices and file positions.
143  |      :ivar _stream: The stream used to access the underlying corpus file.
144  |      :ivar _len: The total number of tokens in the corpus, if known;
145  |          or None, if the number of tokens is not yet known.
146  |      :ivar _eofpos: The character position of the last character in the
147  |          file.  This is calculated when the corpus view is initialized,
148  |          and is used to decide when the end of file has been reached.
149  |      :ivar _cache: A cache of the most recently read block.  It
150  |         is encoded as a tuple (start_toknum, end_toknum, tokens), where
151  |         start_toknum is the token index of the first token in the block;
152  |         end_toknum is the token index of the first token not in the
153  |         block; and tokens is a list of the tokens in the block.
154  |  
155  |  ----------------------------------------------------------------------
156  |  Methods inherited from nltk.corpus.reader.api.CorpusReader:
157  |  
158  |  __repr__(self)
159  |  
160  |  abspath(self, fileid)
161  |      Return the absolute path for the given file.
162  |      
163  |      :type file: str
164 
165 |      :param file: The file identifier for the file whose path
166  |          should be returned.
167  |      :rtype: PathPointer
168  |  
169  |  abspaths(self, fileids=None, include_encoding=False, include_fileid=False)
170  |      Return a list of the absolute paths for all fileids in this corpus;
171  |      or for the given list of fileids, if specified.
172  |      
173  |      :type fileids: None or str or list
174  |      :param fileids: Specifies the set of fileids for which paths should
175  |          be returned.  Can be None, for all fileids; a list of
176  |          file identifiers, for a specified set of fileids; or a single
177  |          file identifier, for a single file.  Note that the return
178  |          value is always a list of paths, even if ``fileids`` is a
179  |          single file identifier.
180  |      
181  |      :param include_encoding: If true, then return a list of
182  |          ``(path_pointer, encoding)`` tuples.
183  |      
184  |      :rtype: list(PathPointer)
185  |  
186  |  encoding(self, file)
187  |      Return the unicode encoding for the given corpus file, if known.
188  |      If the encoding is unknown, or if the given file should be
189  |      processed using byte strings (str), then return None.
190  |  
191  |  fileids(self)
192  |      Return a list of file identifiers for the fileids that make up
193  |      this corpus.
194  |  
195  |  open(self, file, sourced=False)
196  |      Return an open stream that can be used to read the given file.
197  |      If the files encoding is not None, then the stream will
198  |      automatically decode the files contents into unicode.
199  |      
200  |      :param file: The file identifier of the file to read.
201  |  
202  |  readme(self)
203  |      Return the contents of the corpus README file, if it exists.
204  |  
205  |  ----------------------------------------------------------------------
206  |  Data descriptors inherited from nltk.corpus.reader.api.CorpusReader:
207  |  
208  |  __dict__
209  |      dictionary for instance variables (if defined)
210  |  
211  |  __weakref__
212  |      list of weak references to the object (if defined)
213  |  
214  |  root
215  |      The directory where this corpus is stored.
216  |      
217  |      :type: PathPointer

在PlaintextCorpusReader中可以看到很多本文例子中方法,比如fileids(),words()等等。

1.1 fileids()返回语料库的文件标识符

1 >>> from nltk.corpus import gutenberg
2 >>> gutenberg.fileids()
3 [austen-emma.txt, austen-persuasion.txt, austen-sense.txt, bible-kjv.txt, blake-poems.txt, bryant-stories.txt, burgess-busterbrown.txt, carroll-alice.txt, chesterton-ball.txt, chesterton-brown.txt, chesterton-thursday.txt, edgeworth-parents.txt, melville-moby_dick.txt, milton-paradise.txt, shakespeare-caesar.txt, shakespeare-hamlet.txt, shakespeare-macbeth.txt, whitman-leaves.txt]

1.2 words()返回文件的单词列表

1 >>> from nltk.corpus import gutenberg
2 >>> gutenberg.fileids()
3 [austen-emma.txt, austen-persuasion.txt, austen-sense.txt, bible-kjv.txt, blake-poems.txt, bryant-stories.txt, burgess-busterbrown.txt, carroll-alice.txt, chesterton-ball.txt, chesterton-brown.txt, chesterton-thursday.txt, edgeworth-parents.txt, melville-moby_dick.txt, milton-paradise.txt, shakespeare-caesar.txt, shakespeare-hamlet.txt, shakespeare-macbeth.txt, whitman-leaves.txt]
4 >>> gutenberg.words(austen-emma.txt)
5 [[, Emma, by, Jane, Austen, 1816, ], ...]
6 >>> len(gutenberg.words(austen-emma.txt))
7 192427

用concordance()来搜索文本里的单词

 1 >>> emma = nltk.Text(gutenberg.words(austen-emma.txt))
 2 >>> emma
 3 <Text: Emma by Jane Austen 1816>
 4 >>> emma.concordance(surperize)
 5 Building index...
 6 No matches
 7 >>> emma.concordance(surprize)
 8 Displaying 25 of 37 matches:
 9 er father , was sometimes taken by surprize at his being still able to pity ` 
10 hem do the other any good ." " You surprize me ! Emma must do Harriet good : a
11 Knightley actually looked red with surprize and displeasure , as he stood up ,
12 r . Elton , and found to his great surprize , that Mr . Elton was actually on 
13 d aid ." Emma saw Mrs . Weston ‘ s surprize , and felt that it must be great ,
14 father was quite taken up with the surprize of so sudden a journey , and his f
15 y , in all the favouring warmth of surprize and conjecture . She was , moreove
16 he appeared , to have her share of surprize , introduction , and pleasure . Th
17 ir plans ; and it was an agreeable surprize to her , therefore , to perceive t
18 talking aunt had taken me quite by surprize , it must have been the death of m
19 f all the dialogue which ensued of surprize , and inquiry , and congratulation
20  the present . They might chuse to surprize her ." Mrs . Cole had many to agre
21 the mode of it , the mystery , the surprize , is more like a young woman  s s
22  to her song took her agreeably by surprize -- a second , slightly but correct
23 " " Oh ! no -- there is nothing to surprize one at all .-- A pretty fortune ; 
24 t to be considered . Emma  s only surprize was that Jane Fairfax should accep
25 of your admiration may take you by surprize some day or other ." Mr . Knightle
26 ation for her will ever take me by surprize .-- I never had a thought of her i
27  expected by the best judges , for surprize -- but there was great joy . Mr . 
28  sound of at first , without great surprize . " So unreasonably early !" she w
29 d Frank Churchill , with a look of surprize and displeasure .-- " That is easy
30 ; and Emma could imagine with what surprize and mortification she must be retu
31 tled that Jane should go . Quite a surprize to me ! I had not the least idea !
32  . It is impossible to express our surprize . He came to speak to his father o
33 g engaged !" Emma even jumped with surprize ;-- and , horror - struck , exclai

这里用到了nltk.Text类,再次通过help查看这个类,通过method的查看发现这个类非常有用。

  1 class Text(__builtin__.object)
  2  |  A wrapper around a sequence of simple (string) tokens, which is
  3  |  intended to support initial exploration of texts (via the
  4  |  interactive console).  Its methods perform a variety of analyses
  5  |  on the texts contexts (e.g., counting, concordancing, collocation
  6  |  discovery), and display the results.  If you wish to write a
  7  |  program which makes use of these analyses, then you should bypass
  8  |  the ``Text`` class, and use the appropriate analysis function or
  9  |  class directly instead.
 10  |  
 11  |  A ``Text`` is typically initialized from a given document or
 12  |  corpus.  E.g.:
 13  |  
 14  |  >>> import nltk.corpus
 15  |  >>> from nltk.text import Text
 16  |  >>> moby = Text(nltk.corpus.gutenberg.words(melville-moby_dick.txt))
 17  |  
 18  |  Methods defined here:
 19  |  
 20  |  __getitem__(self, i)
 21  |  
 22  |  __init__(self, tokens, name=None)
 23  |      Create a Text object.
 24  |      
 25  |      :param tokens: The source text.
 26  |      :type tokens: sequence of str
 27  |  
 28  |  __len__(self)
 29  |  
 30  |  __repr__(self)
 31  |      :return: A string representation of this FreqDist.
 32  |      :rtype: string
 33  |  
 34  |  collocations(self, num=20, window_size=2)
 35  |      Print collocations derived from the text, ignoring stopwords.
 36  |      
 37  |      :seealso: find_collocations
 38  |      :param num: The maximum number of collocations to print.
 39  |      :type num: int
 40  |      :param window_size: The number of tokens spanned by a collocation (default=2)
 41  |      :type window_size: int
 42  |  
 43  |  common_contexts(self, words, num=20)
 44  |      Find contexts where the specified words appear; list
 45  |      most frequent common contexts first.
 46  |      
 47  |      :param word: The word used to seed the similarity search
 48  |      :type word: str
 49  |      :param num: The number of words to generate (default=20)
 50  |      :type num: int
 51  |      :seealso: ContextIndex.common_contexts()
 52  |  
 53 |  concordance(self, word, width=79, lines=25)
 54  |      Print a concordance for ``word`` with the specified context window.
 55  |      Word matching is not case-sensitive.
 56  |      :seealso: ``ConcordanceIndex``
 57  |  
 58  |  count(self, word)
 59  |      Count the number of times this word appears in the text.
 60  |  
 61  |  dispersion_plot(self, words)
 62  |      Produce a plot showing the distribution of the words through the text.
 63  |      Requires pylab to be installed.
 64  |      
 65  |      :param words: The words to be plotted
 66  |      :type word: str
 67  |      :seealso: nltk.draw.dispersion_plot()
 68  |  
 69  |  findall(self, regexp)
 70  |      Find instances of the regular expression in the text.
 71  |      The text is a list of tokens, and a regexp pattern to match
 72  |      a single token must be surrounded by angle brackets.  E.g.
 73  |      
 74  |      >>> from nltk.book import text1, text5, text9
 75  |      >>> text5.findall("<.*><.*><bro>")
 76  |      you rule bro; telling you bro; u twizted bro
 77  |      >>> text1.findall("<a>(<.*>)<man>")
 78  |      monied; nervous; dangerous; white; white; white; pious; queer; good;
 79  |      mature; white; Cape; great; wise; wise; butterless; white; fiendish;
 80  |      pale; furious; better; certain; complete; dismasted; younger; brave;
 81  |      brave; brave; brave
 82  |      >>> text9.findall("<th.*>{3,}")
 83  |      thread through those; the thought that; that the thing; the thing
 84  |      that; that that thing; through these than through; them that the;
 85  |      through the thick; them that they; thought that the
 86  |      
 87  |      :param regexp: A regular expression
 88  |      :type regexp: str
 89  |  
 90  |  generate(self, length=100)
 91  |      Print random text, generated using a trigram language model.
 92  |      
 93  |      :param length: The length of text to generate (default=100)
 94  |      :type length: int
 95  |      :seealso: NgramModel
 96  |  
 97  |  index(self, word)
 98  |      Find the index of the first occurrence of the word in the text.
 99  |  
100  |  plot(self, *args)
101  |      See documentation for FreqDist.plot()
102  |      :seealso: nltk.prob.FreqDist.plot()
103  |  
104  |  readability(self, method)
105  |  
106  |  similar(self, word, num=20)
107  |      Distributional similarity: find other words which appear in the
108  |      same contexts as the specified word; list most similar words first.
109  |      
110  |      :param word: The word used to seed the similarity search
111  |      :type word: str
112  |      :param num: The number of words to generate (default=20)
113  |      :type num: int
114  |      :seealso: ContextIndex.similar_words()
115  |  
116  |  vocab(self)
117  |      :seealso: nltk.prob.FreqDist
118  |  
119  |  ----------------------------------------------------------------------
120  |  Data descriptors defined here:
121  |  
122  |  __dict__
123  |      dictionary for instance variables (if defined)
124  |  
125  |  __weakref__
126  |      list of weak references to the object (if defined)

1.3 raw,sent,words的区别

我们通过以下例子来查看raw,sent,words的区别:

  1 #!/bin/envs python
  2 from nltk.corpus import gutenberg
  3 for fileid in gutenberg.fileids():
  4     num_chars = len(gutenberg.raw(fileid))                                  // 字母的个数
  5     num_words = len(gutenberg.words(fileid))                                // 单词的个数
  6     num_sents = len(gutenberg.sents(fileid))                                // 句子的个数
  7     num_vocab = len(set([w.lower() for w in gutenberg.words(fileid)]))      // 不相同的单词的个数
  8     print int(num_chars/num_words),int(num_words/num_sents),int(num_words/num_vocab),fileid
  
4 21 26 austen-emma.txt  //平均单词长度   平均每句单词个数   平均单词的重复个数
4 23 16 austen-persuasion.txt
4 23 22 austen-sense.txt
4 33 79 bible-kjv.txt
4 18 5 blake-poems.txt
4 17 14 bryant-stories.txt
4 17 12 burgess-busterbrown.txt
4 16 12 carroll-alice.txt
4 17 11 chesterton-ball.txt
4 19 11 chesterton-brown.txt
4 16 10 chesterton-thursday.txt
4 17 24 edgeworth-parents.txt
4 24 15 melville-moby_dick.txt
4 52 10 milton-paradise.txt
4 11 8 shakespeare-caesar.txt
4 12 7 shakespeare-hamlet.txt
4 12 6 shakespeare-macbeth.txt
4 35 12 whitman-leaves.txt

获取并查看shakespeare-macbeth.txt文本最长的一个句子

  1 #!/bin/envs python
  2 from nltk.corpus import gutenberg
  3 macbenth_sentences = gutenberg.sents(shakespeare-macbeth.txt) # 获取句子的list
  4 print macbenth_sentences
  5 print macbenth_sentences[1037]
  6 longtest_len=max([len(s) for s in macbenth_sentences])         # 获取最长句子的长度
  7 [ s for s in macbenth_sentences if longtest_len == len(s)]     # 获取最长句子的内容

[[[, The, Tragedie, of, Macbeth, by, William, Shakespeare, 1603, ]], [Actus, Primus, .], ...]

[Good, night, ,, and, better, health, Attend, his, Maiesty]

[[Doubtfull, it, stood, ,, As, two, spent, Swimmers, ,, that, doe, cling, together, ,, And, choake, their, Art, :, The, mercilesse, Macdonwald, (, Worthie, to, be, a, Rebell, ,, for, to, that, The, multiplying, Villanies, of, Nature, Doe, swarme, vpon, him, ), from, the, Westerne, Isles, Of, Kernes, and, Gallowgrosses, is, supply, "", d, ,, And, Fortune, on, his, damned, Quarry, smiling, ,, Shew, "", d, like, a, Rebells, Whore, :, but, all, "", s, too, weake, :, For, braue, Macbeth, (, well, hee, deserues, that, Name, ), Disdayning, Fortune, ,, with, his, brandisht, Steele, ,, Which, smoak, "", d, with, bloody, execution, (, Like, Valours, Minion, ), caru, "", d, out, his, passage, ,, Till, hee, fac, "", d, the, Slaue, :, Which, neu, "", r, shooke, hands, ,, nor, bad, farwell, to, him, ,, Till, he, vnseam, "", d, him, from, the, Naue, toth, "", Chops, ,, And, fix, "", d, his, Head, vpon, our, Battlements]]

1.4 NPSChatCorpusReader类

接下来学习下新的一个reader类,nltk给出另一个实例类nltk.corpus.nps_chat,同样用help来查看下该类的信息。可以初步看出该类与xml格式的文件有关。

1 nps_chat = class NPSChatCorpusReader(nltk.corpus.reader.xmldocs.XMLCorpusReader)
2  |  Method resolution order:
3  |      NPSChatCorpusReader
4  |      nltk.corpus.reader.xmldocs.XMLCorpusReader
5  |      nltk.corpus.reader.api.CorpusReader
6  |      __builtin__.object
7  |  
8  |  Methods defined here:
9 ...
1 >>> from nltk.corpus import nps_chat
2 >>> nps_chat.fileids()
3 [10-19-20s_706posts.xml, 10-19-30s_705posts.xml, 10-19-40s_686posts.xml, 10-19-adults_706posts.xml, 10-24-40s_706posts.xml, 10-26-teens_706posts.xml, 11-06-adults_706posts.xml, 11-08-20s_705posts.xml, 11-08-40s_706posts.xml, 11-08-adults_705posts.xml, 11-08-teens_706posts.xml, 11-09-20s_706posts.xml, 11-09-40s_706posts.xml, 11-09-adults_706posts.xml, 11-09-teens_706posts.xml]
4 >>> chartoom=nps_chat.posts(10-19-20s_706posts.xml)
5 >>> chartoom[123]
6 [i, do, "n‘t", want, hot, pics, of, a, female, ,, I, can, look, in, a, mirror, .]

 1.5 CategorizedTaggedCorpusReader类

本文以brown类为实例介绍了CategorizedTaggedCorpusReader类。

  1 >>> from nltk.corpus import brown
  2 >>> help(brown)
  3 class CategorizedTaggedCorpusReader(nltk.corpus.reader.api.CategorizedCorpusReader, TaggedCorpusReader)
  4  |  A reader for part-of-speech tagged corpora whose documents are
  5  |  divided into categories based on their file identifiers.
  6  |  
  7  |  Method resolution order:
  8  |      CategorizedTaggedCorpusReader
  9  |      nltk.corpus.reader.api.CategorizedCorpusReader
 10  |      TaggedCorpusReader
 11  |      nltk.corpus.reader.api.CorpusReader
 12  |      __builtin__.object
 13  |  
 14  |  Methods defined here:
 15  |  
 16  |  __init__(self, *args, **kwargs)
 17  |      Initialize the corpus reader.  Categorization arguments
 18  |      (``cat_pattern``, ``cat_map``, and ``cat_file``) are passed to
 19  |      the ``CategorizedCorpusReader`` constructor.  The remaining arguments
 20  |      are passed to the ``TaggedCorpusReader``.
 21  |  
 22  |  paras(self, fileids=None, categories=None)
 23  |  
 24  |  raw(self, fileids=None, categories=None)
 25  |  
 26  |  sents(self, fileids=None, categories=None)
 27  |  
 28  |  tagged_paras(self, fileids=None, categories=None, simplify_tags=False)
 29  |  
 30  |  tagged_sents(self, fileids=None, categories=None, simplify_tags=False)
 31  |  
 32  |  tagged_words(self, fileids=None, categories=None, simplify_tags=False)
 33  |  
 34  |  words(self, fileids=None, categories=None)
 35  |  
 36  |  ----------------------------------------------------------------------
 37  |  Methods inherited from nltk.corpus.reader.api.CategorizedCorpusReader:
 38  |  
 39  |  categories(self, fileids=None)
 40  |      Return a list of the categories that are defined for this corpus,
 41  |      or for the file(s) if it is given.
 42  |  
 43  |  fileids(self, categories=None)
 44  |      Return a list of file identifiers for the files that make up
 45  |      this corpus, or that make up the given category(s) if specified.
 46  |  
 47  |  ----------------------------------------------------------------------
 48  |  Data descriptors inherited from nltk.corpus.reader.api.CategorizedCorpusReader:
 49  |  
 50  |  __dict__
 51  |      dictionary for instance variables (if defined)
 52  |  
 53  |  __weakref__
 54  |      list of weak references to the object (if defined)
 55  |  
 56  |  ----------------------------------------------------------------------
 57  |  Methods inherited from nltk.corpus.reader.api.CorpusReader:
 58  |  
 59  |  __repr__(self)
 60  |  
 61  |  abspath(self, fileid)
 62  |      Return the absolute path for the given file.
 63  |      
 64  |      :type file: str
 65  |      :param file: The file identifier for the file whose path
 66  |          should be returned.
 67  |      :rtype: PathPointer
 68  |  
 69  |  abspaths(self, fileids=None, include_encoding=False, include_fileid=False)
 70  |      Return a list of the absolute paths for all fileids in this corpus;
 71  |      or for the given list of fileids, if specified.
 72  |      
 73  |      :type fileids: None or str or list
 74  |      :param fileids: Specifies the set of fileids for which paths should
 75  |          be returned.  Can be None, for all fileids; a list of
 76  |          file identifiers, for a specified set of fileids; or a single
 77  |          file identifier, for a single file.  Note that the return
 78  |          value is always a list of paths, even if ``fileids`` is a
 79  |          single file identifier.
 80  |      
 81  |      :param include_encoding: If true, then return a list of
 82  |          ``(path_pointer, encoding)`` tuples.
 83  |      
 84  |      :rtype: list(PathPointer)
 85  |  
 86  |  encoding(self, file)
 87  |      Return the unicode encoding for the given corpus file, if known.
 88  |      If the encoding is unknown, or if the given file should be
 89  |      processed using byte strings (str), then return None.
 90  |  
 91  |  open(self, file, sourced=False)
 92  |      Return an open stream that can be used to read the given file.
 93  |      If the files encoding is not None, then the stream will
 94  |      automatically decode the files contents into unicode.
 95  |      
 96  |      :param file: The file identifier of the file to read.
 97  |  
 98  |  readme(self)
 99  |      Return the contents of the corpus README file, if it exists.
100  |  
101  |  ----------------------------------------------------------------------
102  |  Data descriptors inherited from nltk.corpus.reader.api.CorpusReader:
103  |  
104  |  root
105  |      The directory where this corpus is stored.
106  |      
107  |      :type: PathPointer

看下 brown的内容,如果获取brown资料库的主题和文件

 1 >>> from nltk.corpus import brown
 2 >>> brown.categories()   //返回brown资料库的主题种类
 3 [adventure, belles_lettres, editori, editorial, fiction, government, hobbies, humor, learned, lore, mystery, news, religion, reviews, romance, science_fiction]
 4 >>> brown.fileids()[1:10] //返回brown资料库内的文件
 5 [ca02, ca03, ca04, ca05, ca06, ca07, ca08, ca09, ca10]
 6 >>> brown.words(categories=news) //返回brown资料库内类别名为news的类别,并按次进行切分
 7 [The, Fulton, County, Grand, Jury, said, ...]
 8 >>> brown.words(fileids=[cg22])  //返回brown资料库内的文件名为cg22的文件,并按词进行切分
 9 [Does, our, society, have, a, runaway, ,, ...]
10 >>> brown.sents(categories=[news,editori,reviews])//返回多个类别,并按句进行切分
11 [[The, Fulton, County, Grand, Jury, said, Friday, an, investigation, of, "Atlanta‘s", recent, primary, election, produced, ``, no, evidence, "‘‘", that, any, irregularities, took, place, .], [The, jury, further, said, in, term-end, presentments, that, the, City, Executive, Committee, ,, which, had, over-all, charge, of, the, election, ,, ``, deserves, the, praise, and, thanks, of, the, City, of, Atlanta, "‘‘", for, the, manner, in, which, the, election, was, conducted, .], ...]

对brown内的特定的文体进行计数:

1 from nltk.corpus import brown
2 import nltk
3 news_text = brown.words(categories=news)   //返回brown资料库内类别名为news的类别,并按次进行切分
4 fdist = nltk.FreqDist([w.lower() for w in news_text]) //获取news的频率分布
5 modals = [can,could,may,might,must,will]
6 for m in modals :
7 print m + :,fdist[m], //获取modals的计数

输出

  can: 94 could: 87 may: 93 might: 38 must: 53 will: 389

计算多个特定类别的多个文体进行统计

  1 from nltk.corpus import brown
  2 import nltk
  3 cfd = nltk.ConditionalFreqDist(
  4         (genre,word)
  5         for genre in brown.categories()
  6         for word in brown.words(categories=genre))
  7 genres=[new,religion,hobbies,science_fiction,romance,humor]
  8 modals = [can,could,may,might,must,will]
  9 cfd.tabulate(conditions=genres,samples=modals)

                 can could  may might must will
            new    0    0    0    0    0    0
       religion   82   59   78   12   54   71
        hobbies  268   58  131   22   83  264
science_fiction   16   49    4   12    8   16
        romance   74  193   11   51   45   43
          humor   16   30    8    8    9   13

1.6  CategorizedPlaintextCorpusReader类

相比与brown(CategorizedTaggedCorpusReader),retuters(CategorizedPlaintextCorpusReader)的区别在于,retuters可以查找一个或者多个文档涵盖的主题,也可以查找包含在一个或多个类别的文档。

 1 >>> from nltk.corpus import reuters
 2 >>> reuters.fileids()[1:10]
 3 [test/14828, test/14829, test/14832, test/14833, test/14839, test/14840, test/14841, test/14842, test/14843]
 4 >>> reuters.categories()
 5 [acq, alum, barley, bop, carcass, castor-oil, cocoa, coconut, coconut-oil, coffee, copper, copra-cake, corn, cotton, cotton-oil, cpi, cpu, crude, dfl, dlr, dmk, earn, fuel, gas, gnp, gold, grain, groundnut, groundnut-oil, heat, hog, housing, income, instal-debt, interest, ipi, iron-steel, jet, jobs, l-cattle, lead, lei, lin-oil, livestock, lumber, meal-feed, money-fx, money-supply, naphtha, nat-gas, nickel, nkr, nzdlr, oat, oilseed, orange, palladium, palm-oil, palmkernel, pet-chem, platinum, potato, propane, rand, rape-oil, rapeseed, reserves, retail, rice, rubber, rye, ship, silver, sorghum, soy-meal, soy-oil, soybean, strategic-metal, sugar, sun-meal, sun-oil, sunseed, tea, tin, trade, veg-oil, wheat, wpi, yen, zinc]
 6 >>> reuters.categories(training/9865)
 7 [barley, corn, grain, wheat]
 8 >>> reuters.categories([training/9865,training/9880])
 9 [barley, corn, grain, money-fx, wheat]
10 >>> reuters.categories(training/9880)
11 [money-fx]

对比brown:

1 >>> from nltk.corpus import brown
2 >>> brown.categories([news,reviews])   //不能对多个主题进行查找
3 []
4 >>> brown.fileids([‘cr05‘,‘cr06‘])
5 []

1.7 基本语料库函数

示例 描述
fileids() 语料库的文件
fileids([categories]) 分类对应的语料库中的文件
categories() 语料库中的分类
categoried([fileids]) 文件对应的语料库中的分类
raw() 语料库的原始内容
raw(fileids=[f1,f2,f3]) 指定文件的原始内容
raw(categories=[c1,c2]) 制定分类的原始内容
words() 整个语料库中的词汇
words(fileids=[f1,f2,f3]) 指定文件的词汇
words(categories=[c1,c2]) 指定分类的词汇
sents() 指定分类的句子
sents(fileids=[f1,f2,f3]) 指定文件的句子
sents(categories=[c1,c2]) 指定分类的句子
abspath(fileid) 制定文件在磁盘的位置
encoding(fileid) 文件的编码(如果知道的话)
open(fileid) 打开指定语料库文件的文件流
root() 到本地安装的语料库根目录的路径
readme() 语料库的README文件的内容

1.8 载入自己的语料库

1 >>> from nltk.corpus import PlaintextCorpusReader
2 >>> corpus_root=/Users/rcf/workspace/python/python_test/NLP_WITH_PYTHON/chapter_2
3 >>> wordlist=PlaintextCorpusReader(corpus_root,.*)   //corpus_root 资料库路径,‘.*‘文件类型
4 >>> wordlist.fileids()
5 [1.py, 2.py, 3.py, 4.py]
6 >>> wordlist.words(3.py)
7 [from, nltk, ., corpus, import, brown, ...]

 

自然语言处理(2)之文本资料库

标签:des   style   blog   color   os   io   for   ar   文件   

原文地址:http://www.cnblogs.com/rcfeng/p/3930464.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!