标签:
http://blog.csdn.net/pipisorry/article/details/51525308
本文主要说明如何通过吉布斯采样进行文档分类(聚类),当然更复杂的实现可以看看吉布斯采样是如何采样LDA主题分布的[主题模型TopicModel:隐含狄利克雷分布LDA
]。
关于吉布斯采样的介绍文章都停止在吉布斯采样的详细描述上,如随机采样和随机模拟:吉布斯采样Gibbs Sampling(why)但并没有说明吉布斯采样到底如何实现的(how)?
也就是具体怎么实现从下面这个公式采样?
怎么在模型中处理连续参数问题?
怎么生成最终我们感兴趣的公式的期望值,而不是仅仅做T次随机游走?
下面介绍如何为朴素贝叶斯Na ??ve Bayes[概率图模型:贝叶斯网络与朴素贝叶斯网络]构建一个吉布斯采样器,其中包含两大问题:如何利用共轭先验?如何通过等式14的条件中进行实际的概率采样?
基于朴素贝叶斯框架,通过吉布斯采样对文档进行(无监督和有监督)分类。假设features是文档下的词,我们要预测的是doc-level的文档分类标签(sentiment label),值为0或1。
首先在无监督数据上进行朴素贝叶斯的采样,对于监督数据的简化会在后面说明。
Following Pedersen [T. Pedersen. Knowledge lean word sense disambiguation. In AAAI/IAAI, page 814, 1997., T. Pedersen. Learning Probabilistic Models of Word Sense Disambiguation. PhD thesis, Southern Methodist University, 1998. http://arxiv.org/abs/0707.3972.],
we’re going to describe the Gibbs sampler in a completely unsupervised setting where no labels at all are provided as training data.
朴素贝叶斯模型对应的plate-diagram:
变量代表的意思如下表:
给定文档,我们要选择文档的label L使下面的概率越大:
π来自哪里?
hyperparameters : parameters of a prior, which is itself used to pick parameters of the model.
Our generative story is going to assume that before this whole process began, we also picked π randomly. Specifically we’ll assume that π is sampled from a Beta distribution with parameters γ π1 and γ π0 .
In Figure 4 we represent these two hyperparameters as a single two-dimensional vector γ π = γ π1 , γ π0 . When γ π1 = γ π0 = 1, Beta(γ π1 , γ π0 ) is just a uniform distribution, which means that any value for π is equally likely. For this reason we call Beta(1, 1) an “uninformed prior”.
θ 0 和 θ 1来自哪里?
Let γ θ be a V -dimensional vector where the value of every dimension equals 1. If θ0 is sampled from Dirichlet(γθ ), every probability distribution over words will be equally likely. Similarly, we’ll assume θ 1 is sampled from Dirichlet(γ θ ).
Note: θ0为label为0的文档中词的概率分布;θ1为label为1的文档中词的概率分布。θ0and θ1are sampled separately. There’s no assumption that they are related to each other at all.
状态空间
朴素贝叶斯模型中状态空间的变量定义
? one scalar-valued variable π (文档j的label为1的概率)
? two vector-valued variables, θ 0 and θ 1
? binary label variables L, one for each of the N documents
We also have one vector variable W j for each of the N documents, but these are observed variables, i.e.their values are already known (and which is why W jk is shaded in Figure 4).
初始化
Pick a value π by sampling from the Beta(γ π1 , γ π0 ) distribution. sample出文档j的label为1的概率,也就知道了文档j的label的bernoulli概率分布(π, 1-π)。
Then, for each j, flip a coin with success probability π, and assign label L j(0)— that is, the label of document j at the 0 th iteration – based on the outcome of the coin flip. 通过上步得到的bernoulli分布sample出文档的label。
Similarly,you also need to initialize θ 0 and θ 1 by sampling from Dirichlet(γ θ ).
for each iteration t = 1 . . . T of sampling, we update every variable defining the state space by sampling from its conditional distribution given the other variables, as described in equation (14).
处理过程:
? We will define the joint distribution of all the variables, corresponding to the numerator in (14).
? We simplify our expression for the joint distribution.
? We use our final expression of the joint distribution to define how to sample from the conditional distribution in (14).
? We give the final form of the sampler as pseudocode.
模型对于整个文档集的联合分布为
Note: 分号右边是联合分布的参数,也就是说分号左边的变量是被右边的超参数条件住的。
联合分布可分解为(通过图模型):
因子1:
因子2:
因子3:
词的分布概率
因子4:
P (C 0 |θ 0 , L) and P (C 1 |θ 1 , L): the probabilities of generating the contents of the bags of words in each of the two document classes.
let θ = θ L n:
Wni: W n中词i的频数。
文档间相互独立,同一个class中的文档的合并概率:
Note: NCx (i) :word i在 documents with class label x中的计数。
使用式19和21:
使用式24和25:
如果使用所有文档的词(也就是使用式24和27)
可知后验分布式30是一个unnormalized Beta distribution, with parameters C 1 + γ π1 and C 0 + γ π0 ,且式32是一个unnormalized Dirichlet distribution, with parameter vector N C x (i) + γ θi for 1 ≤ i ≤ V .
也就是说先验和后验分布是一种形式,这样Beta distribution是binomial (and Bernoulli)分布的共轭先验,Dirichlet分布是多项式multinomial分布的共轭先验。
而超参数就如观察到的证据,是一个伪计数pseudocounts。
让,整个文档集的联合分布表示为:
why: 我们可以通过积分掉π来减少模型有效的参数个数。This has the effect of taking all possible values of π into account in our sampler, without representing it as a variable explicitly and having to sample it at every iteration. Intuitively, “integrating out” a variable is an application of precisely the same principle as computing the marginal probability for a discrete distribution.As a result, c is “there” conceptually, in terms of our understanding of the model, but we don’t need to deal with manipulating it explicitly as a parameter.
Note: 积分掉的意思就是
于是联合分布的边缘分布为:
只考虑积分项:
而38式后面的积分项是一个参数为C 1 + γ π1 and C 0 + γ π0的beta分布,且Beta(C 1 + γ π1 , C 0 + γ π0 )的积分为
让N = C 0 + C 1
则38式表示为:
整个文档集的联合分布表示(三因子式)为:
其中,N = C 0 + C 1
吉布斯采样就是通过条件概率给Zi一个新值
如要计算,需要计算条件分布
Note: There’s no superscript on the bags of words C because they’re fully observed and don’t change from iteration to iteration.
要计算θ 0,需要计算条件分布
直觉上,在每个迭代t开始前,我们有如下当前信息:
每篇文档的词计数,标签为0的文档计数,标签为1的文档计数,每篇文档的当前label,θ0 和 θ1的当前值等等。
采样label:When we want to sample the new label for document j, we temporarily remove all information (i.e. word counts and label information) about this document from that collection of information. Then we look at the conditional probability that L j = 0 given all the remaining information, and the conditional probability that L j = 1 given the same information, and we sample the new label L j (t+1) by choosing randomly according to the relative weight of those two conditional probabilities.
采样θ:Sampling to get the new values operates according
to the same principal.
定义条件概率
L (?j) are all the document labels except L j , and C (?j) is the set of all documents except W j .
分子是全联合概率分布,分母是除去Wj信息的相同的表达式,所以我们需要考虑的只是式40的3个因子。
其实我们要做的只是考虑除去Wj后,改变了什么。
由于因子1仅依赖于超参数,分子分母一样,不予考虑,故只考虑式40中的因子2和因子3。
式42因子2分母的计算与上一次迭代Lj是多少有关。
不过语料大小总是从N变成了N-1,且其中一个文档类别的计数减少1。如Lj=0,则,Cx只有一个有变化,这样
let x be the class for which C x(?j)= C x ? 1,式42的因子2重写为:
又Γ(a + 1) = aΓ(a) for all a
这样式42的因子2简化为:
同因子2,总有某个class对应的项没变,也就是式42的因子3中θ 0 or θ 1有一项在分子和分母中是一样的。
for x ∈ {0, 1},最终合并得到采样文档label的的条件分布为
从式49看文档的label是如何选择出来的:
式49因子1:L j = x considering only the distribution of the other labels
式49因子2:is like a word distribution “fitting room.”, an indication of how well the words in W j “fit” with each of the two distributions.
Note: 步骤3是对两个label的概率分布进行归一化。
Using labeled documents just don’t sample L j for those documents! Always keep L j equal to the observed label.
The documents will effectively serve as “ground truth” evidence for the distributions that created them. Since we never sample for their labels, they will always contribute to the counts in (49) and (51) and will never be subtracted out.
由于θ 0 and θ 1的分布估计是独立的,这里我们先消去θ下标。
显然
since we used conjugate priors, this posterior, like the prior, works out to be a Dirichlet distribution. We actually derived the full expression , but we don’t need the full expression here. All we need to do to sample a new distribution is to make another
draw from a Dirichlet distribution, but this time with parameters N C x (i) + γ θi for each i in V .
define the V dimensional vector t such that each :
new θ的采样公式
sample a random vector a = <a 1 , . . . , a V> from the V -dimensional Dirichlet distribution with parameters <α 1 , . . . , α V>
最快的实现是draw V independent samples y 1 , . . . , y V from gamma distributions, each with density
然后(也就是正则化gamma分布的采样)
[http://en.wikipedia.org/wiki/Dirichlet distribution]
=<1, 1> uninformed prior: uniform distribution
Let γ θ be a V -dimensional vector where the value of every dimension equals 1. uninformed prior
模型初始化:
Pick a value π by sampling from the Beta(γ π1 , γ π0 ) distribution. sample出文档j的label为1的概率,也就知道了文档j的label的bernoulli概率分布(π, 1-π)。
Then, for each j, flip a coin with success probability π, and assign label L j(0)— that is, the label of document j at the 0 th iteration – based on the outcome of the coin flip. 通过上步得到的bernoulli分布sample出文档的label。
模型迭代:
2.5.1文档j标签label的采样公式
算法中第3步好像写错了,应该去掉not?
Note: as soon as a new label for L j is assigned, this changes the counts that will affect the labeling of the subsequent documents. This is, in fact, the whole principle behind a Gibbs sampler!
吉布斯采样算法的初始化和采样迭代都会产生每个变量的值(for iterations t = 1, 2, . . . , T),In theory, the approximated value for any variable Z i can simply be obtained by calculating:
正如我们所知,吉布斯采样迭代进入收敛阶段才是稳定分布,所以一般式59加和不是从1开始,而是B + 1 through T,要丢弃t < B的采样结果。
In this context, Jordan Boyd-Graber (personal communication) also recommends looking at Neal’s [15] discussion of likelihood as a metric of convergence.
1 2.6 Optional: A Note on Integrating out Continuous Parameters
In Section 3 we discuss how to actually obtain values from a Gibbs sampler, as opposed to merely watching it walk around the state space. (Which might be entertaining, but wasn’t really the point.) Our discussion includes convergence and burn-in, auto-correlation and lag, and other practical issues.
In Section 4 we conclude with pointers to other things you might find it useful to read, as well as an invitation to tell us how we could make this document more accurate or more useful.
最后lz有一个问题,吉布斯采样能用在连续的n维高斯分布采样中吗?如果可以如何实现,马尔可夫毯?
from:
http://blog.csdn.net/pipisorry/article/details/51525308
ref: Philip Resnik : GIBBS SAMPLING FOR THE UNINITIATED
随机采样和随机模拟:吉布斯采样Gibbs Sampling的具体实现
标签:
原文地址:http://blog.csdn.net/pipisorry/article/details/51525308