码迷,mamicode.com
首页 > 其他好文 > 详细

CCJ PRML Study Note - Chapter 1.2 : Probability Theory

时间:2016-06-21 06:29:18      阅读:298      评论:0      收藏:0      [点我收藏+]

标签:

Chapter 1.2 : Probability Theory

 
 

Chapter 1.2 : Probability Theory

 

Christopher M. Bishop, PRML, Chapter 1 Introdcution


1. Uncertainty

A key concept in the field of pattern recognition is that of uncertainty. It arises both through noise on measurements, as well as through the finite size of data sets. Probability theory provides a consistent framework for the quantification and manipulation of uncertainty and forms one of the central foundations for pattern recognition. When combined with decision theory, discussed in Section 1.5 (see PRML), it allows us to make optimal predictions given all the information available to us, even though that information may be incomplete or ambiguous.


2. Example discussed through this chapter

We will introduce the basic concepts of probability theory by considering a simple example. Imagine we have two boxes, one red and one blue, and in the red box we have 2 apples and 6 oranges, and in the blue box we have 3 apples and 1 orange. This is illustrated in Figure 1.9. 技术分享

Now suppose we randomly pick one of the boxes and from that box we randomly select an item of fruit, and having observed which sort of fruit it is we put it back in the box from which it came. We could imagine repeating this process many times. Let us suppose that in so doing we pick the red box 技术分享 of the time and we pick the blue box 技术分享 of the time, and that when we pick an item of fruit from a box we are equally likely to select any of the pieces of fruit in the box.

In this example, the identity of the box that will be chosen is a random variable, which we shall denote by 技术分享. This random variable can take one of two possible values, namely 技术分享 (corresponding to the red box) or 技术分享 (corresponding to the blue box). Similarly, the identity of the fruit is also a random variable and will be denoted by 技术分享 . It can take either of the values 技术分享 (for apple) or 技术分享 (for orange). To begin with, we shall define the probability of an event to be the fraction of times that event occurs out of the total number of trials, in the limit that the total number of trials goes to infinity. Thus the probability of selecting the red box is 技术分享.


3. Basic Terminology

3.1 Probability densities

  • PDF, Probability Density Function: If the probability of a real-valued variable 技术分享 falling in the interval 技术分享 is given by 技术分享 for 技术分享, then 技术分享 is called the probability density over x. 技术分享
    and pdf 技术分享 must satisfy the two conditions 技术分享

  • PMF, Probability Mass Function: Note that if 技术分享 is a discrete variable, then 技术分享 is called a probability mass function because it can be regarded as a set of “probability masses” concentrated at the allowed values of 技术分享.

  • CDF, Cumulative Distribution Function: The probability that 技术分享 lies in the interval 技术分享 is given by the cumulative
    distribution function defined by 技术分享
    which satisfies 技术分享.

3.2 Expectations and covariances

  • Expectation of 技术分享 : the average value of some function 技术分享 under a probability distribution 技术分享 is called the expectation of 技术分享 and will be denoted by 技术分享, 技术分享 and 技术分享, for discrete variables and continuous variables, respectively.

  • Approximating expectation using sampling methods: if we are given a finite number 技术分享 of points drawn from the pdf, then the expectation can be approximated as a finite sum over these points 技术分享

  • Expectations of functions of several variables: here we can use a subscript to indicate which variable is being averaged over, so that for instance 技术分享 denotes the average of the function 技术分享 with respect to the distribution of 技术分享. Note that 技术分享 will be a function of 技术分享.

  • Variance of 技术分享: is defined by 技术分享, and provides a measure of how much variability there is in 技术分享 around its mean value 技术分享. Expanding out the square, we get 技术分享.

  • Variance of the variable x itself: 技术分享.
  • Covariance of two r.v. 技术分享 and 技术分享: is defined by 技术分享

  • Covariance of two vecotrs of r.v.’s 技术分享 and 技术分享: is defined by 技术分享

  • Covariance of the components of a vector x with each other: then we use a slightly simpler notation 技术分享.

3.3 Joint, Marginal, Conditional Probability

In order to derive the rules of probability, consider the following example shown in Figure 1.10 involving two random variables 技术分享 and 技术分享. We shall suppose that 技术分享 can take any of the values 技术分享 where 技术分享 , and 技术分享 can take the values 技术分享, where 技术分享. Consider a total of 技术分享 trials in which we sample both of the variables 技术分享 and 技术分享 , and let the number of such trials in which 技术分享 and 技术分享 be 技术分享 . Also, let the number of trials in which 技术分享 takes the value 技术分享 (irrespective of the value that 技术分享 takes) be denoted by 技术分享 , and similarly let the number of trials in which 技术分享 takes the value 技术分享 be denoted by 技术分享.

技术分享

  • joint probability: 技术分享 is called the joint probability of 技术分享 and 技术分享, and is given by
    技术分享 Here we are implicitly considering the limit 技术分享.
  • marginal probability: 技术分享 is called the marginal probability, because it is obtained by marginalizing, or summing out, the other variables (in this case 技术分享 ), i.e., 技术分享
  • conditional probability: 技术分享 is called the conditional probability of 技术分享 given 技术分享, obtained by 技术分享技术分享, which is called the product rule of probability.

3.4 The Rules of Probability

  • Discrete Variables: 技术分享

  • Continuous Variables: if 技术分享 and 技术分享 are two real continuous variables, then the sum and product rules take the form技术分享

  • Bayes’ theorem: From the product rule, together with the symmetry property 技术分享, we immediately obtain the following relationship between conditional probabilities 技术分享
    Using the sum rule, the denominator in Bayes’ theorem can be expressed in terms of the quantities appearing in the numerator 技术分享
    We can view the denominator in Bayes’ theorem as being the normalization constant required to ensure that the sum of the conditional probability on the left-hand side of (1.12) over all values of 技术分享 equals one.

4. An Important Interpretation of Bayes’ Theorem

Let us now return to our example involving boxes of fruit. For the moment, we shall once again be explicit about distinguishing between the random variables and their instantiations. We have seen that the probabilities of selecting either the red or the blue boxes are given by 技术分享, and 技术分享, respectively. Note that these satisfy 技术分享.

Now suppose that we pick a box at random, and it turns out to be the blue box. Then the probability of selecting an apple is just the fraction of apples in the blue box which is 技术分享, and so 技术分享. In fact, we can write out all four conditional probabilities for the type of fruit, given the selected box 技术分享

We can now use the sum and product rules of probability to evaluate the overall probability of choosing an apple 技术分享
from which it follows, using the sum rule, that 技术分享.


Interpretation of Bayes’ Theorem (See Page 17 in PRML)

Suppose instead we are told that a piece of fruit has been selected and it is an orange, and we would like to know which box it came from. This requires that we evaluate the probability distribution over boxes conditioned on the identity of the fruit, whereas the probabilities in (1.16)–(1.19) give the probability distribution over the fruit conditioned on the identity of the box. We can solve the problem of reversing the conditional probability by using Bayes’ theorem to give技术分享 From the sum rule, it then follows that 技术分享.

We can provide an important interpretation of Bayes’ theorem as follows.

  • Prior probability: If we had been asked which box had been chosen before being told the identity of the selected item of fruit, then the most complete information we have available is provided by the probability 技术分享. We call this the prior probability because it is the probability available before we observe the identity of the fruit.

  • Posterior probability : Once we are told that the fruit is an orange, we can then use Bayes’ theorem to compute the probability 技术分享, which we shall call the posterior probability because it is the probability obtained after we have observed 技术分享 .

  • Evidence: Note that in this example, the prior probability of selecting the red box was 技术分享, so that we were more likely to select the blue box than the red one. However, once we have observed that the piece of selected fruit is an orange, we find that the posterior probability of the red box is now 技术分享, so that it is now more likely that the box we selected was in fact the red one. This result accords with our intuition, as the proportion of oranges is much higher in the red box than it is in the blue box, and so the observation that the fruit was an orange provides significant evidence favoring the red box. In fact, the evidence is sufficiently strong that it outweighs the prior and makes it more likely that the red box was chosen rather than the blue one.

  • Independent: Finally, we note that if the joint distribution of two variables factorizes into the product of the marginals, so that 技术分享, then 技术分享 and 技术分享 are said to be independent. From the product rule, we see that 技术分享, and so the conditional distribution of 技术分享 given 技术分享 is indeed independent of the value of 技术分享. For instance, in our boxes of fruit example, if each box contained the same fraction of apples and oranges, then 技术分享, so that the probability of selecting, say, an apple is independent of which box is chosen.


5. Bayesian Probability

5.1 Two Interpretations of Probabilities:

  • Classical or Frequentist Interpretation: we have viewed probabilities in terms of the frequencies of random, repeatable events, and have defined the probability of an event to be the fraction of times that event occurs out of the total number of trials, in the limit that the total number of trials goes to infinity.
  • Bayesian Interpretation of Probability: Introduce the “uncertainty” or “degrees of belief”. Consider an uncertain event, for example whether the Arctic ice cap will have disappeared by the end of the century. These are not events that can be repeated numerous times in order to define a notion of probability as we did earlier in the context of boxes of fruit. Nevertheless, we will generally have some idea, for example, of how quickly we think the polar ice is melting. If we now obtain fresh evidence, for instance from a new Earth observation satellite gathering novel forms of diagnostic information, we may revise our opinion on the rate of ice loss. Our assessment of such matters will affect the actions we take, for instance the extent to which we endeavour to reduce the emission of greenhouse gasses. In such circumstances, we would like to be able to quantify our expression of uncertainty and make precise revisions of uncertainty in the light of new evidence, as well as subsequently to be able to take optimal actions or decisions as a consequence. This can all be achieved through the elegant, and very
    general, Bayesian interpretation of probability.
技术分享

5.2 摘自:[PRML 笔记][1]

  • 与其说是Bayesian对“概率”这个概念的解释,不如说是概率碰巧可以作为量化Bayesian “degree of belief”这个概念的手段。Bayesian的出发点是“uncertainty”这一概念,对此给予“degree of belief”以表示不确定性。The use of probability to represent uncertainty, however, is not an ad-hoc choice, but is inevitable if we are to respect common sense while making rational coherent inferences. Cox showed that if numerical values are used to represent degrees of belief, then a simple set of axioms encoding common sense properties of such beliefs leads uniquely to a set of rules for manipulating degrees of belief that are equivalent to the sum and product rules of probability. 因此之故,我们才可以 use the machinery of probability theory to describe the uncertainty in model parameters.
  • 对parameters的观点, 以及Bayesian对先验、后验概率的解释: 对于 Frequentist 来说, model parameter 技术分享 是一个 fixed 的量,要用“estimator”来估计(最常见的 estimator 是 likelihood,即maximum likelihood estimation)。然而,对 Bayesian 来说, 技术分享 本身是一个不确定量,其不确定性用 prior probability 技术分享表示。为了获知 fixed 的技术分享, Frequentist 进行重复多次的试验,获得不同的 data sets D; 对于 Bayesian 而言, there is only a single data set D, namely the one that is actually observed. 在得到一个 observation D 后, 贝叶斯学派要revise原来对于参数技术分享 的 belief(prior probability), 用后验概率 技术分享表示调整后的 belief。调整的方法是贝叶斯定理Bayes’ Theorem。Bayesian 的中心定理是贝叶斯定理, 该定理 convert a prior probability into a posterior probability by incorporating the evidence provided by the observed data。其中的条件概率技术分享 called likelihood, 表示的是how probable the observed data set is for different settings of parameter vector 技术分享. 技术分享
    上式分母技术分享 只是用于归一化的量, 使得等式LHS的技术分享确实是一个概率。而 技术分享的计算已经给出在上面的分母中。
  • 理解后验概率: 即修正后的先验概率。例如,有 技术分享 个类别,先验为技术分享,这个时候如果给一个未知类别的数据让我们猜它是哪个类别, 显然应该猜先验概率最大的那个类别。在观察到数据技术分享 后,计算后验概率 技术分享 .。于是此时的“先验”修正为 技术分享 。如果现在再来一个未知类别的数据让我们猜,我们猜的方法仍旧是找先验概率最大的那个类别,只不过此时的先验概率是 技术分享

5.3 Bayes’ theorem and Bayesian Probability

Using examples to understand Bayesian Probability and Bayes’ theorem:

  • Fruit Example: Recall that in the boxes of fruit example, the observation of the identity of the fruit provided relevant information that altered the probability that the chosen box was the red one. In that example, Bayes’ theorem was used to convert a prior probability into a posterior probability by incorporating the evidence provided by the observed data.
  • Polynomial curve fitting example: we can adopt a similar approach when making inferences about quantities such as the parameters 技术分享 in the polynomial curve fitting example. We capture our assumptions about 技术分享, before observing the data, in the form of a 技术分享 probability distribution 技术分享. The effect of the observed data 技术分享 is expressed through the conditional probability 技术分享, and we shall see later, in Section 1.2.5, how this can be represented explicitly.

Bayes’ theorem:

Bayes’ theorem, which takes the form

技术分享
i.e.,
技术分享

 

then allows us to evaluate the uncertainty in 技术分享 after we have observed 技术分享 in the form of the posterior probability 技术分享.

The Bayes’ Theorem, where all of these quantities are viewed as functions of 技术分享, incorporates 4 notions:

  • Prior: 技术分享
  • Likelihood: The quantity 技术分享 on the right-hand side of Bayes’ theorem is evaluated for the observed data set 技术分享 and can be viewed as a function of the parameter vector 技术分享, in which case it is called the likelihood function. It expresses how probable the observed data set is for different settings of the parameter vector 技术分享. Note that the likelihood is not a probability distribution over 技术分享, and its integral with respect to 技术分享 does not (necessarily) equal 技术分享.
  • Evidence: 技术分享, the denominator in (1.43) is the normalization constant, which ensures that the posterior distribution on the left-hand side is a valid probability density and integrates to 技术分享.
  • Posterior: 技术分享.

How to interpret likelihood function in both the Bayesian and frequentist paradigms?

  • In a Frequentist setting: 技术分享 is considered to be a fixed parameter, whose value is determined by some form of “estimator” (A widely used frequentist estimator is maximum likelihood, in which 技术分享 is set to the value that maximizes the likelihood function 技术分享. This corresponds to choosing the value of 技术分享 for which the probability of the observed data set is maximized), and error bars (One approach to determining frequentist error bars is the bootstrap, in which multiple data sets are created by repeated sampling from the original data set) on this estimate are obtained by considering the distribution of possible data sets 技术分享.
  • From the Bayesian viewpoint, there is only a single data set D (namely the one that is actually observed), and the uncertainty in the parameters is expressed through a probability distribution over 技术分享.

5.4 Pros (+) and Cons (-)

  • Pros(+) of Bayes over Frequentist: the inclusion of prior knowledge arises naturally. Suppose, for instance, that a fair-looking coin is tossed three times and lands heads each time. A classical maximum likelihood estimate of the probability of landing heads would give 1, implying that all future tosses will land heads! By contrast, a Bayesian approach with any reasonable prior will lead to a much less extreme conclusion.
  • Cons(-) of Bayes against Frequentist: one common criticism of the Bayesian approach is that the prior distribution is often selected on the basis of mathematical convenience rather than as a reflection of any prior beliefs. Even the subjective nature of the conclusions through their dependence on the choice of prior is seen by some as a source of difficulty. Reducing the dependence on the prior is one motivation for so-called noninformative priors. However, these lead to difficulties when comparing different models, and indeed Bayesian methods based on poor choices of prior can give poor results with high confidence. Frequentist evaluation methods offer some protection from such problems, and techniques such as cross-validation remain useful in areas such as model comparison.
  • Cons of Frequentist: Over-fitting problem can be understood as a general property of maximum likelihood.

5.5 [应对over-fitting问题][摘自:Ref 1]

  • Frequentist 控制 over-fitting 的方法:
    1. Regularization,即在目标函数中加入一个 penalty term: L2 regularizer 被称为 ridge regression, L1 regularizer 被称为 Lasso regression。 加入 penalty 的方法也叫 shrinkage method, 因为它可以 reduce the value of the coefficients.
    2. Cross-validation,即留出一部分数据做 validation. Cross-validation 也是一种进行 model selection 的方法。利用留出来的validation data,可以选择多个所训练 model 中的最好的一个。
  • Bayesian 控制 over-fitting 的方法: Prior probability.

5.5 Difficulties in Carrying through the Full Bayesian Procedure: Marginalization

The practical application of Bayesian methods was for a long time severely limited by the difficulties in carrying through the full Bayesian procedure, particularly the need to marginalize (sum or integrate) over the whole of parameter space, which, as we shall see, is required in order to make predictions or to compare different models.

(^=^感觉这种心得体会之类的东西,必须得用中文说出来才过瘾! ^=^) Bayesian methods 的应用长期受制于 marginalization。对于一个 full Bayesian procedure 来说, 它要make prediction 或 compare different models, 必要的一步是 marginalize (sum or integrate) over the whole of parameter space.

The door to the practical use of Bayesian techniques in an impressive range of problem domains is opened due to the following: :

  1. the development of sampling methods, e.g., Markov Chain Monte Carlo (MCMC). Monte Carlo methods are very flexible and can be applied to a wide range of models. However, they are computationally intensive and have mainly been used for small-scale problems.
  2. Dramatic improvements in the speed (i.e. CPU) and memory capacity of computers.
  3. Highly efficient deterministic approximation schemes, such as variational Bayes and expectation propagation (discussed in Chapter 10) have been developed. These offer a complementary alternative to sampling methods and have allowed Bayesian techniques to be used in large-scale applications (Blei et al., 2003).

6. Maximum-likelihood Estimation (MLE) for a univariate Gaussian Case

6.1 Gaussian distribution:

  • 1-dimension: 技术分享 技术分享 技术分享 技术分享
  • D-dimension: 技术分享 where the D-dimensional vector 技术分享 is called the mean, the 技术分享 matrix 技术分享 is called the covariance, and 技术分享 denotes the determinant of 技术分享.

6.2 Sampling from a Gaussian distribution [see Ref 2]

技术分享

6.3 Take the univariate Gaussian for example.

Now suppose that we have a data set of observations 技术分享 , representing 技术分享 observations of the scalar variable 技术分享. We shall suppose that the observations are drawn independently from a Gaussian distribution whose mean 技术分享 and variance 技术分享 are unknown, and we would like to determine these parameters from the data set. Data points that are drawn independently from the same distribution are said to be independent and identically distributed, which is often abbreviated to i.i.d.

Because our data set 技术分享 is i.i.d., we can therefore write the probability of the data set, given 技术分享 and 技术分享 , in the form 技术分享
In practice, it is more convenient to maximize the log of the likelihood function, written in the form技术分享


Why taking the log?

  • The logarithm is a monotonically increasing function of its argument, maximization of the log of a function is equivalent to maximization of the function itself.
  • Taking the log simplifies the subsequent mathematical analysis;
  • Taking the log helps numerically because the product of a large number of small probabilities can easily underflow the numerical precision of the computer, and this is resolved by computing instead the sum of the log probabilities.

When viewed as a function of 技术分享 and 技术分享 , this is the likelihood function for the Gaussian and is interpreted diagrammatically in Figure 1.14. 技术分享

Maximizing (1.54) with respect to 技术分享, we obtain the maximum likelihood solution given by

技术分享
which is called sample mean, i.e., the mean of the observed values 技术分享. Similarly, maximizing (1.54) with respect to 技术分享 , we obtain the so-called sample variance 技术分享 measured with respect to the sample mean 技术分享 in the form
技术分享

 

6.4 One Limitation of the Maximum Likelihood Approach

Limitation: The maximum likelihood approach systematically underestimates the variance of the distribution. This is an example of a phenomenon called bias and is related to the problem of over-fitting encountered in the context of polynomial curve fitting.

技术分享

We first note that the maximum likelihood solutions 技术分享 and 技术分享 are functions of the data set values 技术分享 . Consider the expectations of these quantities with respect to the data set values, which (i.e., the data set) themselves come from a Gaussian distribution with parameters 技术分享 and 技术分享, i.e. 技术分享 . It is straightforward to show that

技术分享
技术分享
From (1.58) it follows that the following estimate for the variance parameter is unbiased 技术分享

 

In fact, as we shall see, the issue of bias in maximum likelihood lies at the root of the over-fitting problem that we encountered earlier in the context of polynomial curve fitting.


Exercise 1.12

技术分享

Solution:

技术分享


7. Curve fitting re-visited

7.1 Purpose:

MLE, Point estimate 技术分享 Probabilistic Model 技术分享 MAP 技术分享 Bayesian

We have seen how the problem of polynomial curve fitting can be expressed in terms of error minimization in Section 1.1. Here we return to the curve fitting example and view it from a probabilistic perspective, thereby gaining some insights into error functions and regularization, as well as taking us towards a full Bayesian treatment.

7.2 Goal in the curve fitting problem:

The goal in the curve fitting problem is to be able to make predictions for the target variable 技术分享 given some new value of the input variable 技术分享 on the basis of a set of training data 技术分享.

7.3 Uncertainty over the value of the target variable 技术分享

We can express our uncertainty over the value of the target variable 技术分享 using a probability distribution. For this purpose, we shall assume that, given the value of 技术分享, the corresponding value of 技术分享 has a Gaussian distribution with a mean equal to the value 技术分享 of the polynomial curve given by (1.1). Thus we have 技术分享
where, for consistency with the notation in later chapters, we have defined a precision parameter 技术分享 corresponding to the inverse variance of the distribution. This is illustrated schematically in Figure 1.16.

技术分享
For the i.i.d. training data 技术分享, the likelihood function is given by 技术分享 and the log likelihood function in the form 技术分享

We can use maximum likelihood to determine the precision parameter 技术分享 of the Gaussian conditional distribution, 技术分享


7.4 The Likelihood for Linear Regression and its Solution of MLE in the Point Estimate Category [see Ref -2]

The same idea can be found in Lecture 3 of [Ref-2], shown as below技术分享 Please note here 技术分享 is used to represent the target variable. The maximum likelihood estimate (MLE) of 技术分享 is obtained by taking the derivate of the log-likelihood, 技术分享. The goal is to maximize the likelihood of seeing the training data 技术分享 by modifying the parameters 技术分享. 技术分享

  • The MLE of 技术分享 is:

技术分享

  • The MLE of 技术分享 is:

技术分享


7.5 Making predictions:

Because we now have a probabilistic model, these are expressed in terms of the predictive distribution that gives the probability distribution over t, rather than simply a point estimate, and is obtained by substituting the maximum likelihood parameters into (1.60) to give 技术分享

As the special case, the Linear Regressionthe: [MLE plugin prediction] [Ref-2], given the training data 技术分享, for a new input 技术分享 and known 技术分享: 技术分享
shown in the following figure技术分享

7.6 Taking a step towards a more Bayesian approach

  • Prior distribution over the polynomial coefficients 技术分享: For simplicity, let us consider a Gaussian distribution of the form 技术分享
    where 技术分享 is the precision of the distribution, and 技术分享 is the total number of elements in the vector 技术分享 for an 技术分享 order polynomial.
  • Hyperparameters: Variables such as 技术分享, which control the distribution of model parameters, are called hyperparameters.
  • Calculate the Posterior distribution for 技术分享: Using Bayes’ theorem, the Posterior distribution for 技术分享 is given by 技术分享
  • MAP, maximum posterior: We can now determine 技术分享 by finding the most probable value of 技术分享 given the data, in other words by maximizing the posterior distribution. This technique is called maximum posterior, or simply MAP. Taking the negative logarithm of (1.66) and combining with (1.62) and (1.65), we find that the maximum of the posterior is given by the minimum of 技术分享
  • Equivalence between Posterior and Regularized sum-of-squares Error function: Thus we see that maximizing the posterior distribution is equivalent to minimizing the regularized sum-of-squares error function encountered earlier in the form (1.4), with a regularization parameter given by 技术分享.

Note:

Although we have included a prior distribution 技术分享, we are so far still making a point estimate of 技术分享 and so this does not yet amount to a Bayesian treatment, discussed in the following section.


8. Bayesian Curve fitting

In a fully Bayesian approach, we should consistently apply the sum and product rules of probability, which requires, as we shall see shortly, that we integrate over (i.e., to marginalize) all values of w. Such marginalizations lie at the heart of Bayesian methods for pattern recognition.

In the curve fitting problem, we are given the training data 技术分享 and 技术分享, along with a new test point 技术分享, and our goal is to predict the value of 技术分享. We therefore wish to evaluate the predictive distribution 技术分享. Here we shall assume that the parameters 技术分享 and 技术分享 are fixed and known in advance (in later chapters we shall discuss how such parameters can be inferred from data in a Bayesian setting).

A Bayesian treatment simply corresponds to a consistent application of the sum and product rules of probability, which allow the predictive distribution to be written in the form 技术分享

  • 技术分享 in RHS: is given by (1.60), and we have omitted the dependence on α and β to simplify the notation.
  • 技术分享 in RHS: is the posterior distribution over parameters, and can be found by normalizing the right-hand side of equation (1.66). It will be shown in Section 3.3 that this posterior distribution is a Gaussian and can be evaluated analytically.
  • LHS: the integration in (1.68) can also be performed analytically with the result that the predictive distribution is given by a Gaussian of the form 技术分享
    where the mean and variance are given by 技术分享
    Here the matrix S is given by 技术分享 where 技术分享 is the unit matrix, and the vector 技术分享.

Analysis of (1.71):

  • the first term 技术分享: represents the uncertainty in the predicted value of 技术分享 due to the noise on the target variables and was expressed already in the maximum likelihood predictive distribution (1.64) through 技术分享
  • the second term 技术分享: arises from the uncertainty in the parameters 技术分享 and is a consequence of the Bayesian treatment.

The predictive distribution for the synthetic sinusoidal regression problem is illustrated in Figure 1.17. 技术分享

9. Curve fitting为例子演示三种方法 [See Ref-1]

  • 1) MLE,直接对 likelihood function 求最大值,得到参数 技术分享。该方法属于 point estimation。
  • 2) MAP (poor man’s bayes),引入 prior probability,对 posterior probability 求最大值,得到技术分享。MAP 此时相当于在 MLE 的目标函数(likelihood function)中加入一个 L2 penalty。该方法仍属于 point estimation。
  • 3) fully Bayesian approach,需要使用 sum rule 和 product rule (因为“degree of belief”的machinery 和概率相同, 因此这两个 rule 对 “degree of belief”成立), 而要获得 predictive distribution 又需要 marginalize (sum or integrate) over the whole of parameter space w。技术分享
    其中, 技术分享 是待预测的点, 技术分享 是观察到的数据集, 技术分享 是数据集中每个数据点相应的 label。其实是用参数 技术分享 的后验概率为权, 对 probability 进行一次加权平均; 因此这个过程需要对技术分享 进行积分, 即 marginalization

10. References

[1]: http://www.cvrobot.net/wp-content/uploads/2015/09/PRML%E7%AC%94%E8%AE%B0-Notes-on-Pattern-Recognition-and-Machine-Learning-1.pdf, Page 4-6, Chapter 01, PRML笔记,Notes-on-Pattern-Recognition-and-Machine-Learning;

[2]: https://www.cs.ox.ac.uk/people/nando.defreitas/machinelearning/; Slides of Deep Learning Course at Oxford University;

 

CCJ PRML Study Note - Chapter 1.2 : Probability Theory

标签:

原文地址:http://www.cnblogs.com/glory-of-family/p/5602322.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!