码迷,mamicode.com
首页 > 其他好文 > 详细

CCJ PRML Study Note - Chapter 1.6 : Information Theory

时间:2016-06-21 06:28:44      阅读:243      评论:0      收藏:0      [点我收藏+]

标签:

Chapter 1.6 : Information Theory

 
 

Chapter 1.6 : Information Theory

 

Christopher M. Bishop, PRML, Chapter 1 Introdcution

1. Information h(x)

Given a random variable 技术分享 and we ask how much information is received when we observe a specific value for this variable.

  • The amount of information can be viewed as the “degree of surprise” on learning the value of 技术分享.
  • information 技术分享: 技术分享 where the negative sign ensures that information is positive or zero.
  • the units of 技术分享:
    • using logarithms to the base of 2: the units of 技术分享 are bits (‘binary digits’).
    • using logarithms to the base of 技术分享, i.e., natural logarithms: the units of 技术分享 are nats.

2. Entropy H(x): average amount of information

2.1 Entropy H(x)

Firstly we interpret the concept of entropy in terms of the average amount of information needed to specify the state of a random variable.

Now suppose that a sender wishes to transmit the value of a random variable to a receiver. The average amount of information that they transmit in the process is obtained by taking the expectation of (1.92) with respect to the distribution 技术分享 and is given as

  • discrete entropy for discrete random variable by技术分享
  • or differential/continuous entropy for continuous random variable by 技术分享
  • Note that 技术分享 and so we shall take 技术分享 whenever we encounter a value for 技术分享 such that 技术分享.
  • The nonuniform distribution has a smaller entropy than the uniform one.

2.2 Noiseless coding theorem (Shannon, 1948)

The noiseless coding theorem states that the entropy is a lower bound on the number of bits needed to transmit the state of a random variable.

2.3 Alternative view of entropy H(x)

Secondly, let us introduces the concept of entropy in physics in the context of equilibrium thermodynamics and later given a deeper interpretation as a measure of disorder through developments in statistical mechanics.

Consider a set of 技术分享 identical objects that are to be divided amongst a set of bins, such that there are 技术分享 objects in the 技术分享 bin. Consider the number of different ways of allocating the objects to the bins.

  • There are 技术分享 ways to choose the first object, 技术分享 ways to choose the second object, and so on, leading to a total of 技术分享 ways to allocate all 技术分享 objects to the bins.
  • However, we don’t wish to distinguish between rearrangements of objects within each bin. In the 技术分享 bin there are 技术分享 ways of reordering the objects, and so the total number of ways of allocating the 技术分享 objects to the bins is given by 技术分享 which is called the multiplicity.
  • The entropy is then defined as the logarithm of the multiplicity scaled by an appropriate constant 技术分享
  • We now consider the limit 技术分享, in which the fractions 技术分享 are held fixed, and apply Stirling’s approximation 技术分享
  • which gives
    技术分享
    i.e., 技术分享
  • where we have used 技术分享, and 技术分享 is the probability of an object being assigned to the 技术分享 bin.
  • microstate: In physics terminology, the specific arrangements of objects in the bins is called a microstate,
  • macrostate: the overall distribution of occupation numbers, expressed through the ratios 技术分享 , is called macrostate.
  • The multiplicity 技术分享 is also known as the weight of the macrostate.
  • We can interpret the bins as the states 技术分享 of a discrete random variable 技术分享, where 技术分享 . The entropy of the random variable X is then 技术分享

2.4 Comparison between discrete entropy and continuous entropy

H(x)Discrete Distribution XContinuous Distribution X
Maximum Uniform X Gaussian X
+/- 技术分享 could be negative
  • Maximum entropy H(x) :
    • In the case of discrete distributions, the maximum entropy configuration corresponded to an equal distribution of probabilities across the possible states of the variable.
    • For a continuous variable, the distribution that maximizes the differential entropy is the Gaussian [see Page 54 in PRML].
  • Is Negative or Positive?
    • The discrete entropy in (1.93) is always 技术分享, because 技术分享. It will equal its minimum value of 技术分享 when one of the 技术分享 and all other 技术分享.
    • The differential entropy can be negative, because 技术分享 in (1.110) for 技术分享.
      If we evaluate the differential entropy of the Gaussian, we obtain 技术分享 This result also shows that the differential entropy, unlike the discrete entropy, can be negative, because 技术分享 in (1.110) for 技术分享.

2.5 Conditional entropy H(y|x)

  • Conditional entropy:
    Suppose we have a joint distribution 技术分享 from which we draw pairs of values of 技术分享 and 技术分享. If a value of 技术分享 is already known, then the additional information needed to specify the corresponding value of 技术分享 is given by 技术分享. Thus the average additional information needed to specify y can be written as 技术分享
    which is called the conditional entropy of y given x.
  • It is easily seen, using the product rule, that the conditional entropy satisfies the relation 技术分享 where 技术分享 is the differential entropy (i.e., continuous entropy) of 技术分享, and 技术分享 is the differential entropy of the marginal distribution 技术分享.
  • From (1.112) we get to know that
    the information needed to describe 技术分享 and 技术分享 is given by the sum of the information needed to describe 技术分享 alone plus the additional information required to specify 技术分享 given 技术分享.

3. Relative entropy or KL divergence

3.1 Relative entropy or KL divergence

Problem: How to relate the notion of entropy to pattern recognition?
Description: Consider some unknown distribution 技术分享, and suppose that we have modeled this using an approximating distribution 技术分享.

  • Relative entropy or Kullback-Leibler divergence, or KL divergence between the distributions 技术分享 and 技术分享: If we use 技术分享 to construct a coding scheme for the purpose of transmitting values of 技术分享 to a receiver, then the average additional amount of information (in nats) required to specify the value of 技术分享 (assuming we choose an efficient coding scheme) as a result of using 技术分享 instead of the true distribution 技术分享 is given by
    技术分享 and

技术分享

  • It can be rewritten as [see Ref-1]技术分享

  • Cross Entropy: where 技术分享 is called the cross entropy, 技术分享

    • Understanding of Cross Entropy: One can show that the cross entropy is the average number of bits (or nats) needed to encode data coming from a source with distribution 技术分享 when we use model 技术分享 to define our codebook.
    • Understanding of (“Regular”) Entropy: Hence the “regular” entropy 技术分享, is the expected number of bits if we use the true model.
    • Understanding of Relative Entropy: So the KL divergence is the difference between these (shown in 2.111). In other words, the KL divergence is the average number of extra bits (or nats) needed to encode the data, due to the fact that we used approximation distribution 技术分享 to encode the data instead of the true distribution 技术分享.
  • Asymmetric: Note that KL divergence is not a symmetrical quantity, that is to say 技术分享.

  • KL divergence is a way to measure the dissimilarity of two probability distributions, 技术分享 and 技术分享 [see Ref-1].

3.2 Information inequality [see Ref-1]

The “extra number of bits” interpretation should make it clear that 技术分享, and that the KL is only equal to zero i.f.f. 技术分享. We now give a proof of this important result. 技术分享

Proof:

  • 1) Convex functions: To do this we first introduce the concept of convex functions. A function 技术分享 is said to be convex if it has the property that every chord lies on or above the function, as shown in Figure 1.31. 技术分享
    • Convexity then implies 技术分享
  • 2) Jensen’s inequality:
    • Using the technique of proof by induction(数学归纳法), we can show from (1.114) that a convex function 技术分享 satisfies 技术分享 where 技术分享 and 技术分享, for any set of points 技术分享. The result (1.115) is known as Jensen’s inequality.
    • If we interpret the 技术分享 as the probability distribution over a discrete variable 技术分享 taking the values 技术分享, then (1.115) can be written 技术分享 For continuous variables, Jensen’s inequality takes the form 技术分享
  • 3) Apply Jensen’s inequality in the form (1.117) to the KL divergence (1.113) to give 技术分享 where we have used the fact that 技术分享 is a convex function (In fact, 技术分享 is a strictly convex function, so the equality will hold if, and only if, 技术分享 for all 技术分享), together with the normalization condition 技术分享.
  • 4) Similarly, Let 技术分享 be the support of 技术分享, and apply Jensen’s inequality in the form (1.115) to the discrete form KL divergence (2.110) to get [see Ref-1] 技术分享 where the first inequality follows from Jensen’s. Since 技术分享 is a strictly concave (i.e., the inverse of convex) function, we have equality in Equation (2.115) iff 技术分享 for some 技术分享. We have equality in Equation (2.116) iff 技术分享, which implies 技术分享.
  • 5) Hence 技术分享 iff 技术分享 for all 技术分享.

3.3 How to use KL divergence

Note that:

  • we can interpret the KL divergence as a measure of the dissimilarity of the two distributions 技术分享 and 技术分享.
  • If we use a distribution that is different from the true one, then we must necessarily have a less efficient coding, and on average the additional information that must be transmitted is (at least) equal to the Kullback-Leibler divergence between the two distributions.

Problem description:

  • Suppose that data is being generated from an unknown distribution 技术分享 that we wish to model.
  • We can try to approximate this distribution using some parametric distribution 技术分享, governed by a set of adjustable parameters 技术分享, for example a multivariate Gaussian.
  • One way to determine 技术分享 is to minimize the KL divergence between 技术分享 and 技术分享 with respect to 技术分享.
  • We cannot do this directly because we don’t know 技术分享. Suppose, however, that we have observed a finite set of training points 技术分享 , for 技术分享, drawn from 技术分享. Then the expectation with respect to 技术分享 can be approximated by a finite sum over these points, using 技术分享(1.35), so that (???don’t know how to derive it)技术分享
  • the first term is the negative log likelihood function for 技术分享 under the distribution 技术分享 evaluated using the training set.
  • Thus we see that minimizing this KL divergence is equivalent to maximizing the likelihood function.

3.4 Mutual information

Now consider the joint distribution between two sets of variables 技术分享 and 技术分享 given by 技术分享.

Mutual information between the variables 技术分享 and 技术分享:

  • If 技术分享 and 技术分享 are independent, p(x, y) = p(x)p(y).
  • If 技术分享 and 技术分享 are not independent, we can gain some idea of whether they are “close” to being independent by considering the KL divergence between the joint distribution and the product of the marginals, given by 技术分享 which is called the mutual information between the variables 技术分享 and 技术分享.
  • Using the sum and product rules of probability, we see that the mutual information is related to the conditional entropy
    through 技术分享

Understanding of Mutual information:

  • Thus we can view the mutual information as the reduction in the uncertainty about 技术分享 by virtue of being told the value of 技术分享 (or vice versa).
  • From a Bayesian perspective, we can view 技术分享 as the prior distribution for 技术分享 and 技术分享 as the posterior distribution after we have observed new data 技术分享. The mutual information therefore represents the reduction in uncertainty about 技术分享 as a consequence of the new observation 技术分享.

Reference

[1]: Section 2.8.2, Page 57, Kevin P. Murphy. 2012. Machine Learning: A Probabilistic Perspective. The MIT Press.

 

CCJ PRML Study Note - Chapter 1.6 : Information Theory

标签:

原文地址:http://www.cnblogs.com/glory-of-family/p/5602316.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!