标签:
Chapter 1.5 : Decision Theory
Christopher M. Bishop, PRML, Chapter 1 Introdcution
Inference step & Decision step
Consider, for example, a medical diagnosis problem in which we have taken an X-ray image of a patient, and we wish to determine whether the patient has cancer or not.
Using Bayes’ theorem, these probabilities can be expressed in the form
Our objectives vary among those:
补充:Criteria for making decisions【Ref -1】
1) Minimizing the misclassification rate.
2) minimizing the expected loss: 两类错误的后果可能是不同的,例如“把癌症诊断为无癌症”的后果比“把无癌症诊断为癌症”的后果更严重,又如“把正常邮件诊断为垃圾邮件”的后果比“把垃圾邮件诊断为正常邮件”的后果更严重;这时候,少犯前一错误比少犯后一错误更有意义。为此需要 loss function 对不同的错误的代价做量化。
设集合 是所有可能的决策,决策函把每个观察数据映射到一个决,则有
Consider the cancer problem for instance. A mistake occurs when an input vector belonging to class is assigned to class or vice versa. The probability of this occurring is given by
For the more general case of classes, it is slightly easier to maximize the probability of being correct, which is given by which is maximized when the regions are chosen such that each is assigned to the class for which the joint probability or equivalently posterior probability is largest.
For many applications, our objective will be more complex than simply minimizing the number of misclassifications. Let us consider again the medical diagnosis problem. We note that, if a patient who does not have cancer is incorrectly diagnosed as having cancer, the consequences may be some patient distress plus the need for further investigations. Conversely, if a patient with cancer is diagnosed as healthy, the result may be premature death due to lack of treatment. Thus the consequences of these two types of mistake can be dramatically different. It would clearly be better to make fewer mistakes of the second kind, even if this was at the expense of making more mistakes of the first kind.
The optimal solution is the one which minimizes the loss function. * ??? However, the loss function depends on the true class, which is unknown. * For a given input vector , our uncertainty in the true class is expressed through the joint probability distribution and so we seek instead to minimize the average loss, where the average is computed with respect to this distribution, which is given by
Equivalently, for each we should minimize to choose the corresponding optimal region .
Thus the decision rule that minimizes the expected loss (1.80) is the one that assigns each new to the class for which the quantity is a minimum. This is clearly trivial to do, once we know the posterior class probabilities .
We have seen that classification errors arise from the regions of input space where the largest of the posterior probabilities is significantly less than unity(i.e., the state of being in full agreement), or equivalently where the joint distributions have comparable values. These are the regions where we are relatively uncertain about class membership.
We can achieve this by introducing a threshold and rejecting those inputs for which the largest of the posterior probabilities is less than or equal to . This is illustrated for the case of two classes, and a single continuous input variable , in Figure 1.26.
Note that setting will ensure that all examples are rejected, whereas if there are classes then setting will ensure that no examples are rejected. Thus the fraction of examples that get rejected is controlled by . We can easily extend the reject criterion to minimize the expected loss, when a loss matrix is given, taking account of the loss incurred when a reject decision is made.
The three distinct approaches are given, in decreasing order of complexity, by:
Approaches that explicitly or implicitly model the distribution of inputs as well as outputs are known as generative models, because by sampling from them it is possible to generate synthetic data points in the input space.
Approaches that model the posterior probabilities directly are called discriminative models.
The classification problem is usually broken down into two separate stages as do (6.1) and (6.2):
- inference stage: to use training data to learn a model for .
- decision stage: to use these posterior probabilities to make optimal class assignments.
However (6.3) provides us with a different approach, which combines the inference and decision stages into a single learning problem.
Discriminant functions solve both inference problem and decision problem together and simply learn a function that maps inputs directly into decisions.
因此discriminant function 把 inference 和 decision 合为一步解决了[Ref-1]。
Disadvantage: we no longer have access to the posterior probabilities .
- Generative Model 的缺点: 如果只是make classification decision, 计算 joint distribution is wasteful of computational resources, and is excessively demanding of data。一般有后验概率, 即Discriminative Models就足够了。
- Discriminant function的缺点: 该方法不求后验概率posterior, but there are many powerful reasons for wanting to compute the posterior probabilities:
- (1) Minimizing risk: 当 loss matrix 可能随时间改变时 (例如 financial application),如果已经有了计算出来的后验概率, 那么解决 the minimum risk decision problem 只需要适当修改(1.81)即可; 但是对于没有后验概率的discriminant function来时,只能全部从头再learning一次。
- (2) Compensating for class priors: 当正负样本不平衡时(例如 X-ray 图诊断癌症, 由于cancer is rare, therefore 99.9%可能都是无癌症样本),为了获得好的分类器,需要人工做出一个balanced data set 用于training; 训练得到一个后验概率 后,需要 compensate for the effects of the modification to the training data, 即把 obtained from our artificially balanced data set, 除以 balanced data set 的先验 , 再乘以真实数据(i.e., in the population)的先验 , 从而得到了真实数据的后验概率。没有后验概率的 discriminant function 是无法用以上方法应对正负样本不平衡问题的。
- (3) Combining models(模型融合): 对于复杂应用,一个problem 被分解成a number of smaller subproblems each of which can be tackled by a separate module, 例如疾病诊断,可能除了 X-ray 图片数据 ,还有血液检查数据 。与其把这些 heterogeneous information 合在一起作为input, 更有效的方法是build one system to interpret the X-ray images and a different one to interpret the blood data。即有: to assume conditional independence based on the class in the form ( the naive Bayes model) The posterior probability, given both the X-ray and blood data, is then given by
So far, we have discussed decision theory in the context of classification problems. We now turn to the case of regression problems, such as the curve fitting example discussed earlier.
Decision stage for regression: The decision stage consists of choosing a specific estimate of the value of for each input . Suppose that in doing so, we incur a loss . The average, or expected, loss is then given by
the squared loss: , which is substituted into (1.86), to generate the following
Solution: assume a complete flexible function , we can do this formally using the calculus of variations to give Solving for , and using the sum and product rules of probability, we obtain which is the conditional average of conditioned on and is known as the regression function. This result is illustrated in Figure 1.28.
we can identify three distinct approaches to solving regression problems given, in order of decreasing complexity, by:
[1] Page 6 of PRML notes;
[2] Page 7-8 of PRML notes;
CCJ PRML Study Note - Chapter 1.5 : Decision Theory
标签:
原文地址:http://www.cnblogs.com/glory-of-family/p/5602319.html