Christopher M. Bishop, PRML, Chapter 1 Introdcution
2.2 Noiseless coding theorem (Shannon, 1948)
The noiseless coding theorem states that the entropy is a lower bound on the number of bits needed to transmit the state of a random variable.
2.3 Alternative view of entropy H(x)
Secondly, let us introduces the concept of entropy in physics in the context of equilibrium thermodynamics and later given a deeper interpretation as a measure of disorder through developments in statistical mechanics.
Consider a set of
identical objects that are to be divided amongst a set of bins, such that there are
objects in the
bin. Consider the number of different ways of allocating the objects to the bins.
- There are
ways to choose the first object,
ways to choose the second object, and so on, leading to a total of
ways to allocate all
objects to the bins.
- However, we don’t wish to distinguish between rearrangements of objects within each bin. In the
bin there are
ways of reordering the objects, and so the total number of ways of allocating the
objects to the bins is given by
which is called the multiplicity.
- The entropy is then defined as the logarithm of the multiplicity scaled by an appropriate constant

- We now consider the limit
, in which the fractions
are held fixed, and apply Stirling’s approximation 
- which gives
i.e.,

- where we have used
, and
is the probability of an object being assigned to the
bin.
- microstate: In physics terminology, the specific arrangements of objects in the bins is called a microstate,
- macrostate: the overall distribution of occupation numbers, expressed through the ratios
, is called macrostate.
- The multiplicity
is also known as the weight of the macrostate.
- We can interpret the bins as the states
of a discrete random variable
, where
. The entropy of the random variable X is then 
2.4 Comparison between discrete entropy and continuous entropy
H(x) | Discrete Distribution X | Continuous Distribution X |
Maximum |
Uniform X |
Gaussian X |
+/- |
 |
could be negative |
- Maximum entropy H(x) :
- In the case of discrete distributions, the maximum entropy configuration corresponded to an equal distribution of probabilities across the possible states of the variable.
- For a continuous variable, the distribution that maximizes the differential entropy is the Gaussian [see Page 54 in PRML].
- Is Negative or Positive?
- The discrete entropy in (1.93) is always
, because
. It will equal its minimum value of
when one of the
and all other
.
- The differential entropy can be negative, because
in (1.110) for
.
If we evaluate the differential entropy of the Gaussian, we obtain
This result also shows that the differential entropy, unlike the discrete entropy, can be negative, because
in (1.110) for
.
2.5 Conditional entropy H(y|x)
- Conditional entropy:
Suppose we have a joint distribution
from which we draw pairs of values of
and
. If a value of
is already known, then the additional information needed to specify the corresponding value of
is given by
. Thus the average additional information needed to specify y can be written as
which is called the conditional entropy of y given x.
- It is easily seen, using the product rule, that the conditional entropy satisfies the relation
where
is the differential entropy (i.e., continuous entropy) of
, and
is the differential entropy of the marginal distribution
.
- From (1.112) we get to know that
the information needed to describe
and
is given by the sum of the information needed to describe
alone plus the additional information required to specify
given
.
3. Relative entropy or KL divergence
The “extra number of bits” interpretation should make it clear that
, and that the KL is only equal to zero i.f.f.
. We now give a proof of this important result. 
Proof:
- 1) Convex functions: To do this we first introduce the concept of convex functions. A function
is said to be convex if it has the property that every chord lies on or above the function, as shown in Figure 1.31.
- Convexity then implies

- 2) Jensen’s inequality:
- Using the technique of proof by induction(数学归纳法), we can show from (1.114) that a convex function
satisfies
where
and
, for any set of points
. The result (1.115) is known as Jensen’s inequality.
- If we interpret the
as the probability distribution over a discrete variable
taking the values
, then (1.115) can be written
For continuous variables, Jensen’s inequality takes the form 
- 3) Apply Jensen’s inequality in the form (1.117) to the KL divergence (1.113) to give
where we have used the fact that
is a convex function (In fact,
is a strictly convex function, so the equality will hold if, and only if,
for all
), together with the normalization condition
.
- 4) Similarly, Let
be the support of
, and apply Jensen’s inequality in the form (1.115) to the discrete form KL divergence (2.110) to get [see Ref-1]
where the first inequality follows from Jensen’s. Since
is a strictly concave (i.e., the inverse of convex) function, we have equality in Equation (2.115) iff
for some
. We have equality in Equation (2.116) iff
, which implies
.
- 5) Hence
iff
for all
.
3.3 How to use KL divergence
Note that:
- we can interpret the KL divergence as a measure of the dissimilarity of the two distributions
and
.
- If we use a distribution that is different from the true one, then we must necessarily have a less efficient coding, and on average the additional information that must be transmitted is (at least) equal to the Kullback-Leibler divergence between the two distributions.
Problem description:
- Suppose that data is being generated from an unknown distribution
that we wish to model.
- We can try to approximate this distribution using some parametric distribution
, governed by a set of adjustable parameters
, for example a multivariate Gaussian.
- One way to determine
is to minimize the KL divergence between
and
with respect to
.
- We cannot do this directly because we don’t know
. Suppose, however, that we have observed a finite set of training points
, for
, drawn from
. Then the expectation with respect to
can be approximated by a finite sum over these points, using
(1.35), so that (???don’t know how to derive it)
- the first term is the negative log likelihood function for
under the distribution
evaluated using the training set.
- Thus we see that minimizing this KL divergence is equivalent to maximizing the likelihood function.
Reference
[1]: Section 2.8.2, Page 57, Kevin P. Murphy. 2012. Machine Learning: A Probabilistic Perspective. The MIT Press.