码迷,mamicode.com
首页 > 其他好文 > 详细

CCJ PRML Study Note - Chapter 1.5 : Decision Theory

时间:2016-06-21 06:27:50      阅读:284      评论:0      收藏:0      [点我收藏+]

标签:

Chapter 1.5 : Decision Theory

 
 

Chapter 1.5 : Decision Theory

 

Christopher M. Bishop, PRML, Chapter 1 Introdcution

1. PRML所需要的三论:

  • Probability theory: provides us with a consistent mathematical framework for quantifying and manipulating uncertainty.
  • Decision theory: allows us to make optimal decisions in situations involving uncertainty such as those encountered in pattern recognition.
  • Information theory:

Inference step & Decision step

  • The joint probability distribution 技术分享 provides a complete summary of the uncertainty associated with these variables. Determination of 技术分享 from a set of training data is an example of inference and is typically a very difficult problem whose solution forms the subject of much of this book.
  • In a practical application, however, we must often make a specific prediction for the value of t, or more generally take a specific action based on our understanding of the values 技术分享 is likely to take, and this aspect is the subject of decision theory.

2. An example

Problem Description:

Consider, for example, a medical diagnosis problem in which we have taken an X-ray image of a patient, and we wish to determine whether the patient has cancer or not.

  • Representation: choose 技术分享 to be a binary variable such that 技术分享 corresponds to class 技术分享 and 技术分享 corresponds to class 技术分享.
  • Inference Step: The general inference problem then involves determining the joint distribution 技术分享, or equivalently 技术分享, which gives us the most complete probabilistic description of the situation.
  • Decision Step: In the end we must decide either to give treatment to the patient or not, and we would like this choice to be optimal in some appropriate sense. This is the decision step, and it is the subject of decision theory to tell us how to make optimal decisions given the appropriate probabilities.

How to predict?

Using Bayes’ theorem, these probabilities can be expressed in the form

技术分享
If our aim is to minimize the chance of assigning 技术分享 to the wrong class 技术分享, then intuitively we would choose the class having the higher posterior probability. We now show that this intuition is correct, and we also discuss more general criteria for making decisions.

 

Our objectives vary among those:

  • Minimizing the misclassification rate;
  • Minimizing the expected loss;

    补充:Criteria for making decisions【Ref -1】
    1) Minimizing the misclassification rate.
    2) minimizing the expected loss: 两类错误的后果可能是不同的,例如“把癌症诊断为无癌症”的后果比“把无癌症诊断为癌症”的后果更严重,又如“把正常邮件诊断为垃圾邮件”的后果比“把垃圾邮件诊断为正常邮件”的后果更严重;这时候,少犯前一错误比少犯后一错误更有意义。为此需要 loss function 对不同的错误的代价做量化。
    设集合 技术分享是所有可能的决策,决策函技术分享把每个观察数据技术分享映射到一个决技术分享,则有

    技术分享

     


其中 技术分享 表示条件风险:
技术分享
  • 技术分享 是对于特定的真实类别是 技术分享的数据,决策为 技术分享 时的 loss 或 risk,即表示 技术分享;
  • 技术分享 是当观察到 技术分享 后,类别 技术分享 的后验概率。决策函数 技术分享 总是把观测数据映射到条件风险最小的决策上。
  • 于是,决策函数技术分享 的总的 risk 是:
    技术分享
    即总的risk是条件风险 技术分享 在所有可能观测数据(特征空间)分布下的期望值。

3. Minimizing the misclassification rate

3.1 Decision Regions & Boundaries

Suppose that our goal is simply to make as few misclassifications as possible.
  • Decision regions: to divide the input space into regions 技术分享 called decision regions, one for each class, such that all points in 技术分享 are assigned to class 技术分享;
  • Decision boundaries or decision surfaces: The boundaries between decision regions.

3.2 Two Classes:

Consider the cancer problem for instance. A mistake occurs when an input vector belonging to class 技术分享 is assigned to class 技术分享 or vice versa. The probability of this occurring is given by

技术分享
Clearly to minimize 技术分享 we should arrange that each 技术分享 is assigned to whichever class has the smaller value of the integrand in (1.78). This result is illustrated for two classes, and a single input variable 技术分享, in Figure 1.24. 技术分享

 

3.3 K Classes:

For the more general case of 技术分享 classes, it is slightly easier to maximize the probability of being correct, which is given by 技术分享 which is maximized when the regions 技术分享 are chosen such that each 技术分享 is assigned to the class for which the joint probability 技术分享 or equivalently posterior probability 技术分享 is largest.

4. Minimizing the expected loss

4.1 Problem Description:

For many applications, our objective will be more complex than simply minimizing the number of misclassifications. Let us consider again the medical diagnosis problem. We note that, if a patient who does not have cancer is incorrectly diagnosed as having cancer, the consequences may be some patient distress plus the need for further investigations. Conversely, if a patient with cancer is diagnosed as healthy, the result may be premature death due to lack of treatment. Thus the consequences of these two types of mistake can be dramatically different. It would clearly be better to make fewer mistakes of the second kind, even if this was at the expense of making more mistakes of the first kind.

4.2 How to solve the problem? Introduce loss/cost function

  • loss function or cost function: a single, overall measure of loss incurred in taking any of the available decisions or actions, whose value they aim to minimize.
  • utility function: whose value they aim to maximize.
  • The optimal solution is the one which minimizes the loss function, or equivalently maximizes the utility function.
  • loss matrix: 技术分享

4.3 Optimal Solution

The optimal solution is the one which minimizes the loss function. * ??? However, the loss function depends on the true class, which is unknown. * For a given input vector 技术分享, our uncertainty in the true class is expressed through the joint probability distribution 技术分享 and so we seek instead to minimize the average loss, where the average is computed with respect to this distribution, which is given by 技术分享

Equivalently, for each 技术分享 we should minimize 技术分享 to choose the corresponding optimal region 技术分享.

Thus the decision rule that minimizes the expected loss (1.80) is the one that assigns each new 技术分享 to the class 技术分享 for which the quantity 技术分享 is a minimum. This is clearly trivial to do, once we know the posterior class probabilities 技术分享.

5. The reject option

We have seen that classification errors arise from the regions of input space where the largest of the posterior probabilities 技术分享 is significantly less than unity(i.e., the state of being in full agreement), or equivalently where the joint distributions 技术分享 have comparable values. These are the regions where we are relatively uncertain about class membership.

  • Reject option: In some applications, it will be appropriate to avoid making decisions on the difficult cases in anticipation of a lower error rate on those examples for which a classification decision is made. This is known as the reject option. For example, in our hypothetical medical illustration, it may be appropriate to use an automatic system to classify those X-ray images for which there is little doubt as to the correct class, while leaving a human expert to classify the more ambiguous cases.
  • We can achieve this by introducing a threshold 技术分享 and rejecting those inputs 技术分享 for which the largest of the posterior probabilities 技术分享 is less than or equal to 技术分享. This is illustrated for the case of two classes, and a single continuous input variable 技术分享, in Figure 1.26. 技术分享

  • Note that setting 技术分享 will ensure that all examples are rejected, whereas if there are 技术分享 classes then setting 技术分享 will ensure that no examples are rejected. Thus the fraction of examples that get rejected is controlled by 技术分享. We can easily extend the reject criterion to minimize the expected loss, when a loss matrix is given, taking account of the loss incurred when a reject decision is made.

6. Three distinct approaches to solving decision problems

The three distinct approaches are given, in decreasing order of complexity, by:

  • generative models: using Bayes’ theorem to find the posterior class probabilities 技术分享.
  • discriminative models: to model the posterior probabilities directly.
  • discriminant functions: function 技术分享.

6.1 Generative Models

  • 1-1) Likelihood: First solve the inference problem of determining the class-conditional densities 技术分享 for each class 技术分享 individually.
  • 1-2) Prior: separately infer the prior class probabilities 技术分享.
  • 1-3) Posterior: use Bayes’ theorem to find the posterior class probabilities 技术分享 in the form技术分享 where the denominator is obtained by 技术分享
  • 2) Equivalently, we can model the joint distribution 技术分享 directly and then normalize to obtain the posterior probabilities.
  • 3) Decision Stage: use decision theory to determine class membership for each new input 技术分享.

Why called generative models?

Approaches that explicitly or implicitly model the distribution of inputs as well as outputs are known as generative models, because by sampling from them it is possible to generate synthetic data points in the input space.

Pros and Cons:

  • (-) Generative models is the most demanding because it involves finding the joint distribution over both 技术分享 and 技术分享. For many applications, 技术分享 will have high dimensionality, and consequently we may need a large training set in order to be able to determine the class-conditional densities to reasonable accuracy.
  • (+) it allows the marginal density of data 技术分享 to be determined from (1.83). This can be useful for outlier detection or novelty detection.

6.2 Discriminative Models:

  • First solve the inference problem of determining the posterior class probabilities 技术分享,
  • and then subsequently use decision theory to assign each new 技术分享 to one of the classes.

Approaches that model the posterior probabilities directly are called discriminative models.

The classification problem is usually broken down into two separate stages as do (6.1) and (6.2):

  • inference stage: to use training data to learn a model for 技术分享.
  • decision stage: to use these posterior probabilities to make optimal class assignments.

However (6.3) provides us with a different approach, which combines the inference and decision stages into a single learning problem.

6.3 Discriminant Functions:

Discriminant functions solve both inference problem and decision problem together and simply learn a function that maps inputs 技术分享 directly into decisions.

因此discriminant function 把 inference 和 decision 合为一步解决了[Ref-1]。

Disadvantage: we no longer have access to the posterior probabilities 技术分享.

6.4 The Relative Merits of These Three Alternatives [Ref-1]

  • Generative Model 的缺点: 如果只是make classification decision, 计算 joint distribution 技术分享 is wasteful of computational resources, and is excessively demanding of data。一般有后验概率, 即Discriminative Models就足够了。
  • Discriminant function的缺点: 该方法不求后验概率posterior, but there are many powerful reasons for wanting to compute the posterior probabilities:
    • (1) Minimizing risk: 当 loss matrix 可能随时间改变时 (例如 financial application),如果已经有了计算出来的后验概率, 那么解决 the minimum risk decision problem 只需要适当修改(1.81)即可; 但是对于没有后验概率的discriminant function来时,只能全部从头再learning一次。
    • (2) Compensating for class priors: 当正负样本不平衡时(例如 X-ray 图诊断癌症, 由于cancer is rare, therefore 99.9%可能都是无癌症样本),为了获得好的分类器,需要人工做出一个balanced data set 用于training; 训练得到一个后验概率技术分享 后,需要 compensate for the effects of the modification to the training data, 即把 技术分享 obtained from our artificially balanced data set, 除以 balanced data set 的先验技术分享 , 再乘以真实数据(i.e., in the population)的先验 技术分享, 从而得到了真实数据的后验概率技术分享。没有后验概率的 discriminant function 是无法用以上方法应对正负样本不平衡问题的。
    • (3) Combining models(模型融合): 对于复杂应用,一个problem 被分解成a number of smaller subproblems each of which can be tackled by a separate module, 例如疾病诊断,可能除了 X-ray 图片数据技术分享 ,还有血液检查数据 技术分享 。与其把这些 heterogeneous information 合在一起作为input, 更有效的方法是build one system to interpret the X-ray images and a different one to interpret the blood data。即有: to assume conditional independence based on the class 技术分享in the form ( the naive Bayes model) 技术分享 The posterior probability, given both the X-ray and blood data, is then given by 技术分享

7. Loss functions for regression

So far, we have discussed decision theory in the context of classification problems. We now turn to the case of regression problems, such as the curve fitting example discussed earlier.

  • Decision stage for regression: The decision stage consists of choosing a specific estimate 技术分享 of the value of 技术分享 for each input 技术分享. Suppose that in doing so, we incur a loss 技术分享. The average, or expected, loss is then given by 技术分享

  • the squared loss: 技术分享, which is substituted into (1.86), to generate the following 技术分享

  • Solution: assume a complete flexible function 技术分享, we can do this formally using the calculus of variations to give技术分享 Solving for 技术分享, and using the sum and product rules of probability, we obtain 技术分享 which is the conditional average of 技术分享 conditioned on 技术分享 and is known as the regression function. This result is illustrated in Figure 1.28. 技术分享

  • we can identify three distinct approaches to solving regression problems given, in order of decreasing complexity, by:

    • (a) First solve the inference problem of determining the joint density 技术分享. Then normalize to find the conditional density 技术分享, and finally marginalize to find the conditional mean given by (1.89).
    • (b) First solve the inference problem of determining the conditional density 技术分享, and then subsequently marginalize to find the conditional mean given by (1.89).
    • (c) Find a regression function 技术分享 directly from the training data.
  • The relative merits of these three approaches follow the same lines as for classification problems above.
  • Minkowski loss: one simple generalization of the squared loss, called the Minkowski loss, whose expectation is
    given by 技术分享 which reduces to the expected squared loss for 技术分享.

[1] Page 6 of PRML notes;
[2] Page 7-8 of PRML notes;

 

CCJ PRML Study Note - Chapter 1.5 : Decision Theory

标签:

原文地址:http://www.cnblogs.com/glory-of-family/p/5602319.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!