关于吉布斯采样的介绍文章都停止在吉布斯采样的详细描述上,如随机采样和随机模拟:吉布斯采样Gibbs Sampling(why)但并没有说明吉布斯采样到底如何实现的(how)?
下面介绍如何为朴素贝叶斯Na ??ve Bayes[概率图模型:贝叶斯网络与朴素贝叶斯网络]构建一个吉布斯采样器,其中包含两大问题:如何利用共轭先验?如何通过等式14的条件中进行实际的概率采样?
基于朴素贝叶斯框架,通过吉布斯采样对文档进行(无监督和有监督)分类。假设features是文档下的词,我们要预测的是doc-level的文档分类标签(sentiment label),值为0或1。
we’re going to describe the Gibbs sampler in a completely unsupervised setting where no labels at all are provided as training data.
给定文档,我们要选择文档的label L使下面的概率越大:
hyperparameters : parameters of a prior, which is itself used to pick parameters of the model.
Our generative story is going to assume that before this whole process began, we also picked π randomly. Specifically we’ll assume that π is sampled from a Beta distribution with parameters γ π1 and γ π0 .
In Figure 4 we represent these two hyperparameters as a single two-dimensional vector γ π = γ π1 , γ π0 . When γ π1 = γ π0 = 1, Beta(γ π1 , γ π0 ) is just a uniform distribution, which means that any value for π is equally likely. For this reason we call Beta(1, 1) an “uninformed prior”.
θ 0 和 θ 1来自哪里?
Let γ θ be a V -dimensional vector where the value of every dimension equals 1. If θ0 is sampled from Dirichlet(γθ ), every probability distribution over words will be equally likely. Similarly, we’ll assume θ 1 is sampled from Dirichlet(γ θ ).
Note: θ0为label为0的文档中词的概率分布;θ1为label为1的文档中词的概率分布。θ0and θ1are sampled separately. There’s no assumption that they are related to each other at all.
? one scalar-valued variable π (文档j的label为1的概率)
? two vector-valued variables, θ 0 and θ 1
? binary label variables L, one for each of the N documents
We also have one vector variable W j for each of the N documents, but these are observed variables, i.e.their values are already known (and which is why W jk is shaded in Figure 4).
Pick a value π by sampling from the Beta(γ π1 , γ π0 ) distribution. sample出文档j的label为1的概率,也就知道了文档j的label的bernoulli概率分布(π, 1-π)。
Then, for each j, flip a coin with success probability π, and assign label L j(0)— that is, the label of document j at the 0 th iteration – based on the outcome of the coin flip. 通过上步得到的bernoulli分布sample出文档的label。
Similarly,you also need to initialize θ 0 and θ 1 by sampling from Dirichlet(γ θ ).
for each iteration t = 1 . . . T of sampling, we update every variable defining the state space by sampling from its conditional distribution given the other variables, as described in equation (14).
? We will define the joint distribution of all the variables, corresponding to the numerator in (14).
? We simplify our expression for the joint distribution.
? We use our final expression of the joint distribution to define how to sample from the conditional distribution in (14).
? We give the final form of the sampler as pseudocode.
Note: 分号右边是联合分布的参数,也就是说分号左边的变量是被右边的超参数条件住的。
P (C 0 |θ 0 , L) and P (C 1 |θ 1 , L): the probabilities of generating the contents of the bags of words in each of the two document classes.
let θ = θ L n:
Wni: W n中词i的频数。
Note: NCx (i) :word i在 documents with class label x中的计数。
可知后验分布式30是一个unnormalized Beta distribution, with parameters C 1 + γ π1 and C 0 + γ π0 ,且式32是一个unnormalized Dirichlet distribution, with parameter vector N C x (i) + γ θi for 1 ≤ i ≤ V .
也就是说先验和后验分布是一种形式,这样Beta distribution是binomial (and Bernoulli)分布的共轭先验,Dirichlet分布是多项式multinomial分布的共轭先验。
why: 我们可以通过积分掉π来减少模型有效的参数个数。This has the effect of taking all possible values of π into account in our sampler, without representing it as a variable explicitly and having to sample it at every iteration. Intuitively, “integrating out” a variable is an application of precisely the same principle as computing the marginal probability for a discrete distribution.As a result, c is “there” conceptually, in terms of our understanding of the model, but we don’t need to deal with manipulating it explicitly as a parameter.
Note: 积分掉的意思就是
而38式后面的积分项是一个参数为C 1 + γ π1 and C 0 + γ π0的beta分布,且Beta(C 1 + γ π1 , C 0 + γ π0 )的积分为
让N = C 0 + C 1
其中,N = C 0 + C 1
Note: There’s no superscript on the bags of words C because they’re fully observed and don’t change from iteration to iteration.
要计算θ 0,需要计算条件分布
每篇文档的词计数,标签为0的文档计数,标签为1的文档计数,每篇文档的当前label,θ0 和 θ1的当前值等等。
采样label:When we want to sample the new label for document j, we temporarily remove all information (i.e. word counts and label information) about this document from that collection of information. Then we look at the conditional probability that L j = 0 given all the remaining information, and the conditional probability that L j = 1 given the same information, and we sample the new label L j (t+1) by choosing randomly according to the relative weight of those two conditional probabilities.
采样θ:Sampling to get the new values operates according
to the same principal.
L (?j) are all the document labels except L j , and C (?j) is the set of all documents except W j .
let x be the class for which C x(?j)= C x ? 1,式42的因子2重写为:
又Γ(a + 1) = aΓ(a) for all a
同因子2,总有某个class对应的项没变,也就是式42的因子3中θ 0 or θ 1有一项在分子和分母中是一样的。
for x ∈ {0, 1},最终合并得到采样文档label的的条件分布为
式49因子1:L j = x considering only the distribution of the other labels
式49因子2:is like a word distribution “fitting room.”, an indication of how well the words in W j “fit” with each of the two distributions.
Note: 步骤3是对两个label的概率分布进行归一化。
Using labeled documents just don’t sample L j for those documents! Always keep L j equal to the observed label.
The documents will effectively serve as “ground truth” evidence for the distributions that created them. Since we never sample for their labels, they will always contribute to the counts in (49) and (51) and will never be subtracted out.
由于θ 0 and θ 1的分布估计是独立的,这里我们先消去θ下标。
since we used conjugate priors, this posterior, like the prior, works out to be a Dirichlet distribution. We actually derived the full expression , but we don’t need the full expression here. All we need to do to sample a new distribution is to make another
draw from a Dirichlet distribution, but this time with parameters N C x (i) + γ θi for each i in V .
define the V dimensional vector t such that each :
new θ的采样公式
sample a random vector a = <a 1 , . . . , a V> from the V -dimensional Dirichlet distribution with parameters <α 1 , . . . , α V>
最快的实现是draw V independent samples y 1 , . . . , y V from gamma distributions, each with density
=<1, 1> uninformed prior: uniform distribution
Let γ θ be a V -dimensional vector where the value of every dimension equals 1. uninformed prior
Pick a value π by sampling from the Beta(γ π1 , γ π0 ) distribution. sample出文档j的label为1的概率,也就知道了文档j的label的bernoulli概率分布(π, 1-π)。
Then, for each j, flip a coin with success probability π, and assign label L j(0)— that is, the label of document j at the 0 th iteration – based on the outcome of the coin flip. 通过上步得到的bernoulli分布sample出文档的label。
Note: as soon as a new label for L j is assigned, this changes the counts that will affect the labeling of the subsequent documents. This is, in fact, the whole principle behind a Gibbs sampler!
吉布斯采样算法的初始化和采样迭代都会产生每个变量的值(for iterations t = 1, 2, . . . , T),In theory, the approximated value for any variable Z i can simply be obtained by calculating:
正如我们所知,吉布斯采样迭代进入收敛阶段才是稳定分布,所以一般式59加和不是从1开始,而是B + 1 through T,要丢弃t < B的采样结果。
In this context, Jordan Boyd-Graber (personal communication) also recommends looking at Neal’s [15] discussion of likelihood as a metric of convergence.
