标签:
Chapter 1.2 : Probability Theory
Christopher M. Bishop, PRML, Chapter 1 Introdcution
A key concept in the field of pattern recognition is that of uncertainty. It arises both through noise on measurements, as well as through the finite size of data sets. Probability theory provides a consistent framework for the quantification and manipulation of uncertainty and forms one of the central foundations for pattern recognition. When combined with decision theory, discussed in Section 1.5 (see PRML), it allows us to make optimal predictions given all the information available to us, even though that information may be incomplete or ambiguous.
We will introduce the basic concepts of probability theory by considering a simple example. Imagine we have two boxes, one red and one blue, and in the red box we have 2 apples and 6 oranges, and in the blue box we have 3 apples and 1 orange. This is illustrated in Figure 1.9.
Now suppose we randomly pick one of the boxes and from that box we randomly select an item of fruit, and having observed which sort of fruit it is we put it back in the box from which it came. We could imagine repeating this process many times. Let us suppose that in so doing we pick the red box of the time and we pick the blue box of the time, and that when we pick an item of fruit from a box we are equally likely to select any of the pieces of fruit in the box.
In this example, the identity of the box that will be chosen is a random variable, which we shall denote by . This random variable can take one of two possible values, namely (corresponding to the red box) or (corresponding to the blue box). Similarly, the identity of the fruit is also a random variable and will be denoted by . It can take either of the values (for apple) or (for orange). To begin with, we shall define the probability of an event to be the fraction of times that event occurs out of the total number of trials, in the limit that the total number of trials goes to infinity. Thus the probability of selecting the red box is .
PDF, Probability Density Function: If the probability of a real-valued variable falling in the interval is given by for , then is called the probability density over x.
and pdf must satisfy the two conditions
PMF, Probability Mass Function: Note that if is a discrete variable, then is called a probability mass function because it can be regarded as a set of “probability masses” concentrated at the allowed values of .
Expectation of : the average value of some function under a probability distribution is called the expectation of and will be denoted by , and , for discrete variables and continuous variables, respectively.
Approximating expectation using sampling methods: if we are given a finite number of points drawn from the pdf, then the expectation can be approximated as a finite sum over these points
Expectations of functions of several variables: here we can use a subscript to indicate which variable is being averaged over, so that for instance denotes the average of the function with respect to the distribution of . Note that will be a function of .
Variance of : is defined by , and provides a measure of how much variability there is in around its mean value . Expanding out the square, we get .
Covariance of two r.v. and : is defined by
Covariance of two vecotrs of r.v.’s and : is defined by
In order to derive the rules of probability, consider the following example shown in Figure 1.10 involving two random variables and . We shall suppose that can take any of the values where , and can take the values , where . Consider a total of trials in which we sample both of the variables and , and let the number of such trials in which and be . Also, let the number of trials in which takes the value (irrespective of the value that takes) be denoted by , and similarly let the number of trials in which takes the value be denoted by .
Discrete Variables:
Continuous Variables: if and are two real continuous variables, then the sum and product rules take the form
Bayes’ theorem: From the product rule, together with the symmetry property , we immediately obtain the following relationship between conditional probabilities
Using the sum rule, the denominator in Bayes’ theorem can be expressed in terms of the quantities appearing in the numerator
We can view the denominator in Bayes’ theorem as being the normalization constant required to ensure that the sum of the conditional probability on the left-hand side of (1.12) over all values of equals one.
Let us now return to our example involving boxes of fruit. For the moment, we shall once again be explicit about distinguishing between the random variables and their instantiations. We have seen that the probabilities of selecting either the red or the blue boxes are given by , and , respectively. Note that these satisfy .
Now suppose that we pick a box at random, and it turns out to be the blue box. Then the probability of selecting an apple is just the fraction of apples in the blue box which is , and so . In fact, we can write out all four conditional probabilities for the type of fruit, given the selected box
We can now use the sum and product rules of probability to evaluate the overall probability of choosing an apple
from which it follows, using the sum rule, that .
Suppose instead we are told that a piece of fruit has been selected and it is an orange, and we would like to know which box it came from. This requires that we evaluate the probability distribution over boxes conditioned on the identity of the fruit, whereas the probabilities in (1.16)–(1.19) give the probability distribution over the fruit conditioned on the identity of the box. We can solve the problem of reversing the conditional probability by using Bayes’ theorem to give From the sum rule, it then follows that .
We can provide an important interpretation of Bayes’ theorem as follows.
Prior probability: If we had been asked which box had been chosen before being told the identity of the selected item of fruit, then the most complete information we have available is provided by the probability . We call this the prior probability because it is the probability available before we observe the identity of the fruit.
Posterior probability : Once we are told that the fruit is an orange, we can then use Bayes’ theorem to compute the probability , which we shall call the posterior probability because it is the probability obtained after we have observed .
Evidence: Note that in this example, the prior probability of selecting the red box was , so that we were more likely to select the blue box than the red one. However, once we have observed that the piece of selected fruit is an orange, we find that the posterior probability of the red box is now , so that it is now more likely that the box we selected was in fact the red one. This result accords with our intuition, as the proportion of oranges is much higher in the red box than it is in the blue box, and so the observation that the fruit was an orange provides significant evidence favoring the red box. In fact, the evidence is sufficiently strong that it outweighs the prior and makes it more likely that the red box was chosen rather than the blue one.
Independent: Finally, we note that if the joint distribution of two variables factorizes into the product of the marginals, so that , then and are said to be independent. From the product rule, we see that , and so the conditional distribution of given is indeed independent of the value of . For instance, in our boxes of fruit example, if each box contained the same fraction of apples and oranges, then , so that the probability of selecting, say, an apple is independent of which box is chosen.
Bayes’ theorem, which takes the form
then allows us to evaluate the uncertainty in after we have observed in the form of the posterior probability .
The Bayes’ Theorem, where all of these quantities are viewed as functions of , incorporates 4 notions:
The practical application of Bayesian methods was for a long time severely limited by the difficulties in carrying through the full Bayesian procedure, particularly the need to marginalize (sum or integrate) over the whole of parameter space, which, as we shall see, is required in order to make predictions or to compare different models.
(^=^感觉这种心得体会之类的东西,必须得用中文说出来才过瘾! ^=^) Bayesian methods 的应用长期受制于 marginalization。对于一个 full Bayesian procedure 来说, 它要make prediction 或 compare different models, 必要的一步是 marginalize (sum or integrate) over the whole of parameter space.
The door to the practical use of Bayesian techniques in an impressive range of problem domains is opened due to the following: :
Now suppose that we have a data set of observations , representing observations of the scalar variable . We shall suppose that the observations are drawn independently from a Gaussian distribution whose mean and variance are unknown, and we would like to determine these parameters from the data set. Data points that are drawn independently from the same distribution are said to be independent and identically distributed, which is often abbreviated to i.i.d.
Because our data set is i.i.d., we can therefore write the probability of the data set, given and , in the form
In practice, it is more convenient to maximize the log of the likelihood function, written in the form
When viewed as a function of and , this is the likelihood function for the Gaussian and is interpreted diagrammatically in Figure 1.14.
Maximizing (1.54) with respect to , we obtain the maximum likelihood solution given by
Limitation: The maximum likelihood approach systematically underestimates the variance of the distribution. This is an example of a phenomenon called bias and is related to the problem of over-fitting encountered in the context of polynomial curve fitting.
We first note that the maximum likelihood solutions and are functions of the data set values . Consider the expectations of these quantities with respect to the data set values, which (i.e., the data set) themselves come from a Gaussian distribution with parameters and , i.e. . It is straightforward to show that
In fact, as we shall see, the issue of bias in maximum likelihood lies at the root of the over-fitting problem that we encountered earlier in the context of polynomial curve fitting.
We have seen how the problem of polynomial curve fitting can be expressed in terms of error minimization in Section 1.1. Here we return to the curve fitting example and view it from a probabilistic perspective, thereby gaining some insights into error functions and regularization, as well as taking us towards a full Bayesian treatment.
The goal in the curve fitting problem is to be able to make predictions for the target variable given some new value of the input variable on the basis of a set of training data .
We can express our uncertainty over the value of the target variable using a probability distribution. For this purpose, we shall assume that, given the value of , the corresponding value of has a Gaussian distribution with a mean equal to the value of the polynomial curve given by (1.1). Thus we have
where, for consistency with the notation in later chapters, we have defined a precision parameter corresponding to the inverse variance of the distribution. This is illustrated schematically in Figure 1.16.
For the i.i.d. training data , the likelihood function is given by and the log likelihood function in the form
We can use maximum likelihood to determine the precision parameter of the Gaussian conditional distribution,
The same idea can be found in Lecture 3 of [Ref-2], shown as below Please note here is used to represent the target variable. The maximum likelihood estimate (MLE) of is obtained by taking the derivate of the log-likelihood, . The goal is to maximize the likelihood of seeing the training data by modifying the parameters .
Because we now have a probabilistic model, these are expressed in terms of the predictive distribution that gives the probability distribution over t, rather than simply a point estimate, and is obtained by substituting the maximum likelihood parameters into (1.60) to give
As the special case, the Linear Regressionthe: [MLE plugin prediction] [Ref-2], given the training data , for a new input and known :
shown in the following figure
Although we have included a prior distribution , we are so far still making a point estimate of and so this does not yet amount to a Bayesian treatment, discussed in the following section.
In a fully Bayesian approach, we should consistently apply the sum and product rules of probability, which requires, as we shall see shortly, that we integrate over (i.e., to marginalize) all values of w. Such marginalizations lie at the heart of Bayesian methods for pattern recognition.
In the curve fitting problem, we are given the training data and , along with a new test point , and our goal is to predict the value of . We therefore wish to evaluate the predictive distribution . Here we shall assume that the parameters and are fixed and known in advance (in later chapters we shall discuss how such parameters can be inferred from data in a Bayesian setting).
A Bayesian treatment simply corresponds to a consistent application of the sum and product rules of probability, which allow the predictive distribution to be written in the form
Analysis of (1.71):
The predictive distribution for the synthetic sinusoidal regression problem is illustrated in Figure 1.17.
[1]: http://www.cvrobot.net/wp-content/uploads/2015/09/PRML%E7%AC%94%E8%AE%B0-Notes-on-Pattern-Recognition-and-Machine-Learning-1.pdf, Page 4-6, Chapter 01, PRML笔记,Notes-on-Pattern-Recognition-and-Machine-Learning;
[2]: https://www.cs.ox.ac.uk/people/nando.defreitas/machinelearning/; Slides of Deep Learning Course at Oxford University;
CCJ PRML Study Note - Chapter 1.2 : Probability Theory
标签:
原文地址:http://www.cnblogs.com/glory-of-family/p/5602322.html