码迷,mamicode.com
首页 > 其他好文 > 详细

Reading Notes for Statistical Learning Theory

时间:2016-04-09 20:23:43      阅读:144      评论:0      收藏:0      [点我收藏+]

标签:

Let‘s continue the discussion of reading Vapnik‘s book Statistical Learning Theory. In the very beginning of the book, Vapnik first described two fundamental approaches in pattern recognition: the parametric estimation approach and the non-parametric estimation approach. Before introducing the non-parametric approach, which the support vector machine belongs to, Vapnik first address the following three beliefs that the philosophy of parametric approach stands for (Page 4 to 5):

  1. The existence of a function defined by a limited number of parameters a good approximation to the desired function;
  2. The normal law for most real-life problems;
  3. The maximum likelihood method is a good tool for estimating parameters

In my opinion, the first condition should be required for all machine learning approaches, no matter whether they are parametric or non-parametric. For the second point, it is based on the central limit theorem, which stated that, the distribution of a large set of independent random variables approximate a Gaussian distribution. If we first pre-process the dataset to normalise the data points to centre the mean into the origin, the Gaussian distribution becomes a normal distribution. This is why the term "normal law" is used. In my opinion, it may be better to just highly the assumption of independence, since the assumption of independence is more fundamental. The Gaussian distribution, is however, only a special situation of this condition. Regarding to the third point, the statement seems a little too absolute, as maximum likelihood estimation does not stand for the whole world.

Despite we see some limitations in the statements, as long as we keep on reading the book, the author just wanted to use these assumptions that many methods followed to highlight the limitations of parametric approaches. We know that expects in parametric learning approach may be able to argue the points, as debates are often usual in the academic world.

Then Vapnik started to introduce the Perception algorithm in 1958, and the empirical risk minimisation (ERM) criterion that used for machine learning. It is of interest to note that the ERM is used to measure the error referring to the training samples, while our real problem of machine learning is the estimate the unobserved behaviours in the test dataset. There is a problem of overfitting, which occurs when the training samples is not enough. The overfitting problem occurs when the training samples is small and thus the model fit the training samples but lack of generalisation (achieving poor performance for the test dataset).

Exactly as what I thought, the next problem the author addressed is the generalisation of the algorithm. Then the very important VC dimension theory is mentioned. The basic motivation of the VC dimension relates to density estimation. We know that, due to the law of large numbers, the relative frequency of an event approximates to its real probability, when the size of the samples approaches to infinity. However, since our training dataset is always finite in reality. This drives the author to consider constructing a more general theory to estimate the capability of density estimation of a training dataset, the so call VC dimension. The motivation of support vector machine is that the machine with the lowest VC dimension is the best.

Then the author presents the main principal of designing a learning machine based on a limited size of dataset. The main principal is that, for density estimation, if we may directly estimate a specific density we needed, rather than inducting this density by first estimating the more general densities that the specific density depends on.
For example, if we can estimate the condition probability, we may not need to estimate the probability of the condition and the probability of the event under all conditions. More important, with limited information such as a small size training dataset, we may only allow to estimate a more specific density. On the other hand, the problem we are going to solve is to predict the class of unobservable samples, which requires the machine should be able to generate a solution more general than its training dataset or a specific test point. The machine should have the capability to estimate all samples in the feature space.

?

?

?

Reading Notes for Statistical Learning Theory

标签:

原文地址:http://www.cnblogs.com/jingxinxu/p/5372357.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!