scikit-learn：CountVectorizer提取tf都做了什么

时间：2015-07-13 22:35:59 阅读：778 评论：0 收藏：0 [点我收藏+]

标签：

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer

class sklearn.feature_extraction.text.CountVectorizer(input=u‘content‘, encoding=u‘utf-8‘, decode_error=u‘strict‘,strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None,token_pattern=u‘(?u)\b\w\w+\b‘, ngram_range=(1, 1), analyzer=u‘word‘, max_df=1.0, min_df=1,max_features=None, vocabulary=None, binary=False, dtype=<type ‘numpy.int64‘>)[source]

作用：Convert a collection of text documents to a matrix of token counts（计算词汇的数量，即tf）；结果由 scipy.sparse.coo_matrix进行稀疏表示。

看下参数就知道CountVectorizer在提取tf时都做了什么：

strip_accents : {‘ascii’, ‘unicode’, None}：是否除去“音调”，不知道什么是“音调”？看：http://textmechanic.com/?reqp=1&reqr=nzcdYz9hqaSbYaOvrt==

lowercase : boolean, True by default：计算tf前，先将所有字符转化为小写。这个参数一般为True。

preprocessor : callable or None (default)：复写the preprocessing (string transformation) stage，但保留tokenizing and n-grams generation steps.这个参数可以自己写。

tokenizer : callable or None (default)：复写the string tokenization step，但保留preprocessing and n-grams generation steps.这个参数可以自己写。

stop_words : string {‘english’}, list, or None (default)：如果是‘english’, a built-in stop word list for English is used。如果是a list，那么最终的tokens中将去掉list中的所有的stop word。如果是None, 不处理停顿词；但参数 max_df可以设置为 [0.7, 1.0) 之间，进而根据intra corpus document frequency(df) of terms自动detect and filter stop words。这个参数要根据自己的需求调整。

token_pattern : string：正则表达式，默认筛选长度大于等于2的字母和数字混合字符（select tokens of 2 or more alphanumeric characters ），参数analyzer设置为word时才有效。

ngram_range : tuple (min_n, max_n)：n-values值得上下界，默认是ngram_range=(1, 1)，该范围之内的n元feature都会被提取出来！这个参数要根据自己的需求调整。

analyzer : string, {‘word’, ‘char’, ‘char_wb’} or callable：特征基于wordn-grams还是character n-grams。如果是callable是自己复写的从the raw, unprocessed input提取特征的函数。

max_df : float in range [0.0, 1.0] or int, default=1.0：

min_df : float in range [0.0, 1.0] or int, default=1：按比例，或绝对数量删除df超过max_df或者df小于min_df的word tokens。有效的前提是参数vocabulary设置成Node。

max_features : int or None, default=None：选择tf最大的max_features个特征。有效的前提是参数vocabulary设置成Node。

vocabulary : Mapping or iterable, optional：自定义的特征word tokens，如果不是None，则只计算vocabulary中的词的tf。还是设为None靠谱。

binary : boolean, default=False：如果是True，tf的值只有0和1，表示出现和不出现，useful for discrete probabilistic models that model binary events rather than integer counts.。

dtype : type, optional：Type of the matrix returned by fit_transform() or transform().。

结论：

CountVectorizer提取tf都做了这些：去音调、转小写、去停顿词、在word（而不是character，也可自己选择参数）基础上提取所有ngram_range范围内的特征，同时删去满足“max_df, min_df,max_features”的特征的tf。当然，也可以选择tf为binary。

这样应该就放心CountVectorizer处理结果是不是自己想要的了。。。。哇哈哈。

最后看下两个函数：

`fit`(raw_documents[, y])	Learn a vocabulary dictionary of all tokens in the raw documents.
`fit_transform`(raw_documents[, y])	Learn the vocabulary dictionary and return term-document matrix.

fit(raw_documents, y=None)[source]?

Learn a vocabulary dictionary of all tokens in the raw documents.

Parameters:

Parameters:	raw_documents : iterable An iterable which yields either str, unicode or file objects.
Returns:	self :

raw_documents : iterable

An iterable which yields either str, unicode or file objects.

Returns:

self :

fit_transform(raw_documents, y=None)[source]

Learn the vocabulary dictionary and return term-document matrix.

This is equivalent to fit followed by transform, but more efficiently implemented.

Parameters:

Parameters:	raw_documents : iterable An iterable which yields either str, unicode or file objects.
Returns:	X : array, [n_samples, n_features] Document-term matrix.

raw_documents : iterable

An iterable which yields either str, unicode or file objects.

Returns:

X : array, [n_samples, n_features]

Document-term matrix.

scikit-learn：CountVectorizer提取tf都做了什么

标签：

原文地址：http://blog.csdn.net/mmc2015/article/details/46866537

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行