统计学习笔记(3) 监督学习概论(3)

时间：2016-04-01 18:33:09 阅读：516 评论：0 收藏：0 [点我收藏+]

标签：

Some further statements on KNN:

It appears that k-nearest-neighbor fits have a single parameter, the number of neighbors k, compared to the p parameters in least-squares fits. Although this is the case, we will see that the effective number of parameters of k-nearest neighbors is N/k and is generally bigger than p, and decreases with increasing k. To get an idea of why, note that if the neighborhoods were nonoverlapping, there would be N/k neighborhoods and we would fit one parameter (a mean) in each neighborhood.

N is the size of the training set, e.g. If k=1, each member in the training set is a mean value, we should store N values, but if k>1, for each sample in the input set, we have a neighbourhood containing k elements in the training set, and if the neighbourhoods belonging to different members of the input set do not overlap, then we store N/k mean values.

When we generate the following graph:
技术分享

We need a method to generate the test set. First we generated 10 means mk from a bivariate Gaussian distribution N((1, 0)T , I) and labeled this class BLUE. Similarly, 10 more were drawn from N((0, 1)T , I) and labeled class ORANGE. Then for each class we generated 100 observations as follows: for each observation, we picked an mk at random with probability 1/10, and then generated a N(mk, I/5), thus leading to a mixture of Gaussian clusters for each class.

Some expansion on KNN:

To improve linear regression and KNN, we need to finish the following tasks:

1. Kernel methods use weights that decrease smoothly to zero with distance from the target point, rather than the effective 0/1 weights used by k-nearest neighbors.
2. In high-dimensional spaces the distance kernels are modified to emphasize some variable more than others.
3. Local regression fits linear models by locally weighted least squares, rather than fitting constants locally.
4. Linear models fit to a basis expansion of the original inputs allow arbitrarily complex models.
The meaning of basis expansion can be explained as follows:

Then to introduce kernel based on basis expansion:
To minimize function

We get

To expand the basis

技术分享

Which has a similar form as SVM.

技术分享

The use of kernel is to firstly guarantee that feature can be mapped to high dimensional spaces, secondly calculation can be simplified.

5. Projection pursuit and neural network models consist of sums of nonlinearly transformed linear models.

Statistical Decision Theory:

We seek a function f(X) for predicting Y given values of the input vector X. This theory requires a loss function L(Y, f(X)) for penalizing errors in prediction, and by far the most common and convenient is squared error loss: L(Y, f(X)) = (Y ? f(X)) squared.

Our aim is to choose f:

技术分享

Provided a given X, we should make c closer to the label Y in the training set

技术分享

The above equation gives us the exact c, and the solution is

技术分享

The above x is value in the training set.

To apply the above theory into practice, we can use KNN, that is, for any input x, we calculate its statistical value by averaging its cloest k neighbors in the training set. It would seem that with a reasonably large set of training data, we could always approximate the theoretically optimal conditional expectation by k-nearest-neighbor averaging, for the average value can approximate the statistical average value.

Local methods in high dimensions:

KNN breaks down in highdimensions, and the phenomenon is commonly referred to as thecurse of dimensionality.
Consider the nearest-neighbor procedure for inputs uniformly distributed in a p-dimensional unit hypercube. Suppose we send out a hypercubical neighborhood about a target point to capture a fraction r of the observations. Since this corresponds to a fraction r of the unit volume, r is a proportion and is less than 1. the expected edge length will be ep(r) = r^(1/p). In ten dimensions e10(0.01) = 0.63 and e10(0.1) = 0.80, while the entire range for each input is only 1.0. So to capture 1% or 10% of the data to form a local average, we must cover 63% or 80% of the range of each input variable. Such neighborhoods are no longer “local.” Reducing r dramatically does not help much either, since the fewer observations we average, the higher is the variance of our fit.

Another consequence of the sparse sampling in high dimensions is that all sample points are close to an edge of the sample. Consider N data points (training samples) uniformly distributed in a p-dimensional unit ball centered at the origin. Suppose we consider a nearest-neighbor estimate at the origin. The median distance from the origin to the closest data point is given by the expression

技术分享

A more complicated expression exists for the mean distance to the closest point. For N = 500, p = 10 , d(p, N) ≈ 0.52, more than halfway to the boundary. Hence most data points are closer to the boundary of the sample space than to any other data point. The reason that this presents a problem is that prediction is much more difficult near the edges of the training sample. For those input samples that are nearer to the centering training samples, it is easier to find enough neighbors, but for those nearer to boundary training samples, it is not.

统计学习笔记(3) 监督学习概论(3)

标签：

原文地址：http://blog.csdn.net/jyl1999xxxx/article/details/51017688

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行