PCA

时间：2014-09-14 16:39:57 阅读：357 评论：0 收藏：0 [点我收藏+]

标签：style http io os ar strong for sp cti

http://deeplearning.stanford.edu/wiki/index.php/PCA

Principal Components Analysis (PCA) is a dimensionality reduction algorithm that can be used to significantly speed up your unsupervised feature learning algorithm.

example

Suppose you are training your algorithm on images. Then the input will be somewhat redundant, because the values of adjacent pixels in an image are highly correlated. Concretely, suppose we are training on 16x16 grayscale image patches. Then $bubuko.com,布布扣$ are 256 dimensional vectors, with one feature $bubuko.com,布布扣$ corresponding to the intensity of each pixel. Because of the correlation between adjacent pixels, PCA will allow us to approximate the input with a much lower dimensional one, while incurring very little error.

PCA will find a lower-dimensional subspace onto which to project our data. From visually examining the data, it appears that $bubuko.com,布布扣$ is the principal direction of variation of the data, and $bubuko.com,布布扣$ the secondary direction of variation:

the data varies much more in the direction $bubuko.com,布布扣$ than $bubuko.com,布布扣$ .

To more formally find the directions $bubuko.com,布布扣$ and $bubuko.com,布布扣$ , we first compute the matrix $bubuko.com,布布扣$ as follows:

$bubuko.com,布布扣$

If $bubuko.com,布布扣$ has zero mean, then $bubuko.com,布布扣$ is exactly the covariance matrix of $bubuko.com,布布扣$ . (The symbol " $bubuko.com,布布扣$ ", pronounced "Sigma", is the standard notation for denoting the covariance matrix. Unfortunately it looks just like the summation symbol, as in $bubuko.com,布布扣$ ; but these are two different things.)

let us compute the eigenvectors of $bubuko.com,布布扣$ , and stack the eigenvectors in columns to form the matrix $bubuko.com,布布扣$ :

$bubuko.com,布布扣$

Here, $bubuko.com,布布扣$ is the principal eigenvector (corresponding to the largest eigenvalue), $bubuko.com,布布扣$ is the second eigenvector, and so on. Also, let $bubuko.com,布布扣$ be the corresponding eigenvalues.

The vectors $bubuko.com,布布扣$ and $bubuko.com,布布扣$ in our example form a new basis in which we can represent the data. Concretely, let $bubuko.com,布布扣$ be some training example. Then $bubuko.com,布布扣$ is the length (magnitude) of the projection of $bubuko.com,布布扣$ onto the vector $bubuko.com,布布扣$ .

Similarly, $bubuko.com,布布扣$ is the magnitude of $bubuko.com,布布扣$ projected onto the vector $bubuko.com,布布扣$ .

Rotating the Data

Thus, we can represent $bubuko.com,布布扣$ in the $bubuko.com,布布扣$ -basis by computing

$bubuko.com,布布扣$

(The subscript "rot" comes from the observation that this corresponds to a rotation (and possibly reflection) of the original data.) Lets take the entire training set, and compute $bubuko.com,布布扣$ for every $bubuko.com,布布扣$ . Plotting this transformed data $bubuko.com,布布扣$ , we get:

This is the training set rotated into the $bubuko.com,布布扣$ , $bubuko.com,布布扣$ basis. In the general case, $bubuko.com,布布扣$ will be the training set rotated into the basis $bubuko.com,布布扣$ , $bubuko.com,布布扣$ , ..., $bubuko.com,布布扣$ .

One of the properties of $bubuko.com,布布扣$ is that it is an "orthogonal" matrix, which means that it satisfies $bubuko.com,布布扣$ . So if you ever need to go from the rotated vectors $bubuko.com,布布扣$ back to the original data $bubuko.com,布布扣$ , you can compute

$bubuko.com,布布扣$

because $bubuko.com,布布扣$ .

Reducing the Data Dimension

We see that the principal direction of variation of the data is the first dimension $bubuko.com,布布扣$ of this rotated data. Thus, if we want to reduce this data to one dimension, we can set

$bubuko.com,布布扣$

More generally, if $bubuko.com,布布扣$ and we want to reduce it to a $bubuko.com,布布扣$ dimensional representation $bubuko.com,布布扣$ (where $bubuko.com,布布扣$ ), we would take the first $bubuko.com,布布扣$ components of $bubuko.com,布布扣$ , which correspond to the top $bubuko.com,布布扣$ directions of variation.

$bubuko.com,布布扣$

In our example, this gives us the following plot of $bubuko.com,布布扣$ (using $bubuko.com,布布扣$ ):

However, since the final $bubuko.com,布布扣$ components of $bubuko.com,布布扣$ as defined above would always be zero, there is no need to keep these zeros around, and so we define $bubuko.com,布布扣$ as a $bubuko.com,布布扣$ -dimensional vector with just the first $bubuko.com,布布扣$ (non-zero) components.

This also explains why we wanted to express our data in the $bubuko.com,布布扣$ basis: Deciding which components to keep becomes just keeping the top $bubuko.com,布布扣$ components. When we do this, we also say that we are "retaining the top $bubuko.com,布布扣$ PCA (or principal) components."

Recovering an Approximation of the Data

we can think of $bubuko.com,布布扣$ as an approximation to $bubuko.com,布布扣$ , where we have set the last $bubuko.com,布布扣$ components to zeros. Thus, given $bubuko.com,布布扣$ , we can pad it out with $bubuko.com,布布扣$ zeros to get our approximation to $bubuko.com,布布扣$ . Finally, we pre-multiply by $bubuko.com,布布扣$ to get our approximation to $bubuko.com,布布扣$ . Concretely, we get

$bubuko.com,布布扣$

We are thus using a 1 dimensional approximation to the original dataset.

Number of components to retain

To decide how to set $bubuko.com,布布扣$ , we will usually look at the percentage of variance retained for different values of $bubuko.com,布布扣$ . Concretely, if $bubuko.com,布布扣$ , then we have an exact approximation to the data, and we say that 100% of the variance is retained. I.e., all of the variation of the original data is retained. Conversely, if $bubuko.com,布布扣$ , then we are approximating all the data with the zero vector, and thus 0% of the variance is retained.

More generally, let $bubuko.com,布布扣$ be the eigenvalues of $bubuko.com,布布扣$ (sorted in decreasing order), so that $bubuko.com,布布扣$ is the eigenvalue corresponding to the eigenvector $bubuko.com,布布扣$ . Then if we retain $bubuko.com,布布扣$ principal components, the percentage of variance retained is given by:

$bubuko.com,布布扣$

In our simple 2D example above, $bubuko.com,布布扣$ , and $bubuko.com,布布扣$ . Thus, by keeping only $bubuko.com,布布扣$ principal components, we retained $bubuko.com,布布扣$ , or 91.3% of the variance.

PCA on Images

For PCA to work, usually we want each of the features $bubuko.com,布布扣$ to have a similar range of values to the others (and to have a mean close to zero). If you‘ve used PCA on other applications before, you may therefore have separately pre-processed each feature to have zero mean and unit variance, by separately estimating the mean and variance of each feature $bubuko.com,布布扣$ . However, this isn‘t the pre-processing that we will apply to most types of images. Specifically, suppose we are training our algorithm on natural images, so that $bubuko.com,布布扣$ is the value of pixel $bubuko.com,布布扣$ . By "natural images," we informally mean the type of image that a typical animal or person might see over their lifetime.

In detail, in order for PCA to work well, informally we require that (i) The features have approximately zero mean, and (ii) The different features have similar variances to each other. With natural images, (ii) is already satisfied even without variance normalization, and so we won‘t perform any variance normalization.

Concretely, if $bubuko.com,布布扣$ are the (grayscale) intensity values of a 16x16 image patch ( $bubuko.com,布布扣$ ), we might normalize the intensity of each image $bubuko.com,布布扣$ as follows:

$bubuko.com,布布扣$

$bubuko.com,布布扣$ , for all $bubuko.com,布布扣$

PCA

标签：style http io os ar strong for sp cti

原文地址：http://www.cnblogs.com/sprint1989/p/3971156.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行