标签:
http://scikit-learn.org/stable/modules/feature_extraction.html
4.2节内容太多,因此将文本特征提取单独作为一块。
1、the bag of words representation
将raw data表示成长度固定的数字特征向量,scikit-learn提供了三个方式:
tokenizing:给每一个token(字、词,粒度自己把握)一个整数索引id
counting:每个token在每个文档中出现的次数
normalizing:根据每个token在样本/文档中出现的次数 规范化/权重化 token的重要性。
重新理解什么是feature、什么事sample:
general
process (tokenization, counting and normalization) of turning a collection of text documents into numerical feature
vectors,while completely
ignoring the relative position information of the words in the document.
2、sparsity
每个文档中的词,只是整个语料库中所有词,的很小的一部分,这样造成feature vector的稀疏性(很多值为0)。为了解决存储和运算速度的问题,使用python的scipy.sparse包。
3、common vectorizer usage
CountVectorizer同时实现tokenizing和counting。
参数很多,但默认的就很合理了,适合大多数情况,具体参考:http://blog.csdn.net/mmc2015/article/details/46866537
这边的例子说明了它的使用:
http://blog.csdn.net/mmc2015/article/details/46857887
包括fit_transform、transform、get_feature_names()、ngram_range=(min,max)、vocabulary_.get()等。。。。
4、tf-idf term weighting
解决(e.g. “the”, “a”, “is” in English) 某些词出现次数太多,却又不是我们关注的词的问题。
the text.TfidfTransformer class实现了mormalization:
>>> from sklearn.feature_extraction.text import TfidfTransformer >>> transformer = TfidfTransformer() >> counts = [[3, 0, 1], ... [2, 0, 0], ... [3, 0, 0], ... [4, 0, 0], ... [3, 2, 0], ... [3, 0, 2]] ... >>> tfidf = transformer.fit_transform(counts) >>> tfidf <6x3 sparse matrix of type '<... 'numpy.float64'>' with 9 stored elements in Compressed Sparse ... format> >>> tfidf.toarray() array([[ 0.85..., 0. ..., 0.52...], [ 1. ..., 0. ..., 0. ...], [ 1. ..., 0. ..., 0. ...], [ 1. ..., 0. ..., 0. ...], [ 0.55..., 0.83..., 0. ...], [ 0.63..., 0. ..., 0.77...]]) >>> transformer.idf_ #idf_保存fit之后的结果 array([ 1. ..., 2.25..., 1.84...])
如果对于binary occurrence的feature,使用CountVectorizer的参数设置为binary更好。。。bernoulli Naive Bayes也更适合做estimator。
5、Decoding text files
text是由character组成,但file则由bytes组成,所以要让scikit-learn工作,首先要告诉他file的编码,那么 CountVectorizer就会自动解码了。默认的编码方式是UTF-8,解码后的character set称为Unicode。如果你加载的file编码方式不是UTF-8,有没有设置encoding参数,则会出现UnicodeDecodeError。
如果编码错误,try:
For example, the following snippet uses chardet (not shipped with scikit-learn, must be installed separately) to figure out the encoding of three texts. It then vectorizes the texts and prints the learned vocabulary. The output is not shown here.
(Depending on the version of chardet, it might get the first one wrong.)
6、应用和实例
推荐看一下第三个例子。
In particular in a supervised setting it can be successfully combined with fast and scalable linear models to train document classifiers, for instance:
In an unsupervised setting it can be used to group similar documents together by applying clustering algorithms such as K-means:
Finally it is possible to discover the main topics of a corpus by relaxing the hard assignment constraint of clustering, for instance by using Non-negative matrix factorization (NMF or NNMF):
7、bag of words的缺陷
misspelling、word derivations、word order dependece。拼写错误(word wprd wrod)、词汇的变形(word words、arrive arriving)、词汇之间的顺序及依赖关系。
使用N-gram而不要单单使用unigram。另外,还可以使用这里http://blog.csdn.net/mmc2015/article/details/46730289提到的词干分析方法。
给个例子,以char_wb为例了:
下三部分有时间写。。。
8、Vectorizing a large text corpus with the hashing trick,使用hashing技巧vectorizing大语料库
9、Performing out-of-core scaling with HashingVectorizer
10、Customizing the vectorizer classes
版权声明:本文为博主原创文章,未经博主允许不得转载。
scikit-learn:4.2.3. Text feature extraction
标签:
原文地址:http://blog.csdn.net/mmc2015/article/details/46997379