标签:
Chapter 2.0 : Prerequisite 2 -Singular Value Decomposition (SVD)
Christopher M. Bishop, PRML, Chapter 2 Probability Distributions
Normal Vector
A normal vector (or unit vector ) is a vector of length 1, i.e.,
Orthonormal Vectors
Vectors of unit length that are orthogonal to each other are said to be orthonormal.
A matrix is orthogonal if
An eigenvector is a nonzero vector that satisfies the equation
Eigenvalues and eigenvectors are also known as, respectively, characteristic roots(特征值) and characteristic vectors(特征向量), or latent roots and latent vectors.
In linear algebra, an eigenvector or characteristic vector of a linear transformation from a vector space over a field into itself is a non-zero vector that does not change its direction when that linear transformation is applied to it. In other words, if is a vector that is not the zero vector, then it is an eigenvector of a linear transformation if is a scalar multiple of . This condition can be written as the mapping
If the vector space is finite-dimensional, then the linear transformation can be represented as a square matrix , and the vector by a column vector, rendering the above mapping as a matrix multiplication on the left hand side and a scaling of the column vector on the right hand side in the equation
There is a correspondence between by square matrices and linear transformations from an n-dimensional vector space to itself. For this reason, it is equivalent to define eigenvalues and eigenvectors using either the language of matrices or the language of linear transformations.
Geometrically, an eigenvector corresponding to a real, nonzero eigenvalue points in a direction that is stretched by the transformation and the eigenvalue is the factor by which it is stretched. If the eigenvalue is negative, the direction is reversed.
It can be shown in the following figure, where matrix acts by stretching the vector , not changing its direction, so is an eigenvector of .
Singular value decomposition (SVD) can be looked at from three mutually compatible points of view.
SVD is based on a theorem from linear algebra which says that a rectangular matrix can be broken down into the product of three matrices:
The theorem is usually presented something like this:
assuming [see Ref-4 for this figure]:
The columns of and the columns of are called the left-singular vectors and right-singular vectors of , respectively.
The columns of are orthonormal eigenvectors of .
There is a brief proof. Let , where the column vector , for , with .
Similarly, we can prove that the columns of are orthonormal eigenvectors of ,
is a diagonal matrix containing the square roots of non-zero eigenvalues of both and . A common convention is to list the singular values in descending order. In this case, the diagonal matrix is uniquely determined by (though not the matrices and ).
Let , where , for ; and , where , for .
To gain insight into the SVD, treat the rows of an (here we use instead of , since it is common to be used to represent those n points of d-dimension) matrix as points in a d-dimensional space.
Consider the problem of finding the best k-dimensional subspace with respect to the set of points. Here “best” means minimize the sum of the squares of the perpendicular distances of the points to the subspace. We begin with a special case of the problem where the subspace is 1-dimensional, a line through the origin. We will see later that the best-fitting k-dimensional subspace can be found by k applications of the best fitting line algorithm (i.e., 应用k次1-dim直线fitting即可得到the fitting k-dim subspace). Finding the best fitting line through the origin with respect to a set of points in the plane means minimizing the sum of the squared distances of the points to the line. Here distance is measured perpendicular to the line (the corresponding problem is called the best least squares fit), or more often measured vertical in the y direction, to the subspace of (with the corresponding problem - least squares fit).
Returning to the best least squares fit problem, consider projecting a point onto a line through the origin. Then based on the following figure
we can get
From (3.9) and the observation that is a constant ( i.e., independent of the line), we get the equivalence
The First Singular Vector: With this in mind, define the first singular vector, of , which is a column vector, as the best fit line through the origin for the points in d-space that are the rows of . Thus
The First Singular Value: The value is called the first singular value of . Note that is the sum of the squares of the projections of the points to the line determined by .
The Second Singular Vector: The second singular vector , is defined by the best fit line perpendicular to
The Second Singular Value: The value is called the second singular value of . Note that is the sum of the squares of the projections of the points to the line determined by .
Consider one row, say of matrix . Since span the space of all rows of , 0 for all perpendicular to . Thus, for each row , . Summing over all rows,
As shown in the figure, the singular values can be interpreted as the semiaxes of an ellipse in 2D. This concept can be generalized to n-dimensional Euclidean space, with the singular values of any square matrix being viewed as the semiaxes of an n-dimensional ellipsoid. See below for further details.
Since and are unitary, the columns of each of them form a set of orthonormal vectors, which can be regarded as basis vectors. The matrix maps the basis vector to the stretched unit vector . By the definition of a unitary matrix, the same is true for their conjugate transposes and , except the geometric interpretation of the singular values as stretches is lost. In short, the columns of , , and are orthonormal bases.
Show that a real, symmetric matrix satisfying the eigenvector equation cam be expressed as an expansion of its eigenvalues and eigenvectors of the following form
The proof of (4.1) and (4.2) use (4.5) and (4.6). For any column vector ,
we have
Since the inner product in (4.7) is a scalar, and is also a scalar, therefore we can change the order of the terms,
Since , inverting both sides gives , and hence . Applying the above result to , noting that is just the diagonal matrix of the inverses of the diagonal elements of , we have proved (4.2).
Let be an matrix and think of the rows of as points in d-dimensional space. There are two important matrix norms, the Frobenius norm denoted and the 2-norm denoted .
Let and
The rows of matrix are the projections of the rows of onto the subspace spanned by the first singular vectors of .
Let be an matrix, for any matrix of rank at most , it holds that
Let be an matrix, for any matrix of rank at most , it holds that
Let be an matrix, for in (5.2) it holds that
Let us begin by looking at some simple matrices, namely those with two rows and two columns. Our first example is the diagonal matrix
Geometrically, we may think of a matrix like this as taking a point in the plane and transforming it into another point using matrix multiplication:
The effect of this transformation is shown below : the plane is horizontally stretched by a factor of , while there is no vertical change.
Now let’s look at
It is not so clear how to describe simply the geometric effect of the transformation. However, let’s rotate our grid through a angle and see what happens. The four vertices of the red square, are transformed into , respectively, which produces this effect
We see now that this new grid is transformed in the same way that the original grid was transformed by the diagonal matrix: the grid is stretched by a factor of in one direction.
This is a very special situation due to the fact that the matrix is symmetric, i.e., . If we have a symmetric matrix, it turns out that
we may always rotate the grid in the domain so that the matrix acts by stretching and perhaps reflecting in the two directions. In other words, symmetric matrices behave like diagonal matrices.
以上的几张图,就是为了讨论given a symmetric matrix , 即
- 如何放置坐标grid(或者说如何确定一个单位长度的正方形在坐标系中的位置和方向, 要知道这个正方形可以用两个彼此互相垂直的单位向量 和 来表示),使得当该正方形被施加transformation(represented by a symmetric matrix )时,这个正方形的形变发生沿着 和 方向的单纯的拉伸或压缩。这就与后面即将讨论的矩阵的特征向量和特征值联系起来。 即:
表示特征向量 被矩阵 变换之后,新的向量与原来向量平行(包括同向和反向),只是模长发生了改变而已。- 如何求得这样的 和 呢? 答案就是当为对称矩阵时(当然,对称矩阵是一种特殊情况,接下来我们会讨论更为一般的矩阵),这样的 和 就是对称矩阵的两个特征向量。即由,求得特征向量和特征值为:
which accords with the rotation of the red sqaure shown above.- 对于这种特殊的对称矩阵, 它的SVD就演变成了 Lemma 4-1: 实对称矩阵正交相似于对角矩阵,正如(4.5)所示。可以把它看成是SVD的一种特殊情况, 即:对于矩阵, 有如下SVD:
- 对于一般的矩阵, 存在正交矩阵和(即), 使得
对于(6.2), 即为的特征向量组成, 即为的特征向量组成,对角矩阵由(或者)的特征值的正平方根构成 。- 当是实对称矩阵时, 存在正交矩阵(即), 使得
对于(6.3), 即为对称矩阵的特征向量组成,对角矩阵为对称矩阵的特征值构成 。当然也可以通过上面介绍的方法求解,即 是由的特征向量组成,对角矩阵由的特征值的正平方根构成。两种方法是等价的、是一致的。
Said with more mathematical precision, given a symmetric matrix , we may find a set of orthogonal vectors so that is a scalar multiple of ; that is
Geometrically, this means that the vectors are simply stretched and/or reflected(即方向改变了180°) when multiplied by . Because of this property, we call
An important fact, which is easily verified, is that eigenvectors of a symmetric matrix corresponding to different eigenvalues are orthogonal. If we use the eigenvectors of a symmetric matrix to align the grid, the matrix stretches and/or reflects the grid in the same way that it does the eigenvectors.
The geometric description we gave for this linear transformation is a simple one: the grid is simply stretched in one direction. For more general matrices, we will ask if we can find an orthogonal grid that is transformed into another orthogonal grid. Let’s consider a final example using a matrix that is not symmetric:
This matrix produces the geometric effect known as a shear, shown as
It’s easy to find one family of eigenvectors along the horizontal axis. However, our figure above shows that these eigenvectors cannot be used to create an orthogonal grid that is transformed into another orthogonal grid.
Based on the discussion in (6.2), The columns of are the eigenvectors of , results in , and
This is the geometric essence of the singular value decomposition for matrices:
for any matrix, we may find an orthogonal grid that is transformed into another orthogonal grid. We will express this fact using vectors:
- with an appropriate choice of orthogonal unit vectors and , the vectors and are orthogonal.
We will use and to denote unit vectors in the direction of and . The lengths of and – denoted by and – describe the amount that the grid is stretched in those particular directions. These numbers are called the singular values of . (In this case, the singular values are the golden ratio and its reciprocal, but that is not so important here.)
We therefore have
We may now give a simple description for how the matrix treats a general vector . Since the vectors and are orthogonal unit vectors, we have
This means that
Remember that the inner dot product may be computed using the vector transpose
This is usually expressed by writing
This shows how to decompose the matrix into the product of three matrices:
The power of the singular value decomposition lies in the fact that we may find it for any matrix. How do we do it? Let’s look at our earlier example and add the unit circle in the domain (定义域). Its image will be an ellipse whose major and minor axes define the orthogonal grid in the co-domain (值域).
Notice that the major and minor axes are defined by and . These vectors therefore are the longest and shortest vectors among all the images of vectors on the unit circle.
In other words, the function on the unit circle has a maximum at and a minimum at . This reduces the problem to a rather standard calculus problem in which we wish to optimize a function over the unit circle. It turns out that the critical points of this function occur at the eigenvectors of the matrix . Since this matrix is symmetric (since it is obvious that ), eigenvectors corresponding to different eigenvalues will be orthogonal. This gives the family of vectors .
The singular values are then given by , and the vectors are obtained as unit vectors in the direction of .
But why are the vectors orthogonal? To explain this, we will assume that and are distinct singular values. We have
Let’s begin by looking at the expression and assuming, for convenience, that the singular values are non-zero.
- On one hand, this expression is zero due to the orthogonal-to-one-another vectors s’ and s’, which are required to be eigenvectors of the symmetric matrix , i.e.,
Therefore,
- On the other hand, we have
Therefore, and are orthogonal, so we have found an orthogonal set of vectors that is transformed into another orthogonal set . The singular values describe the amount of stretching in the different directions.
In practice, this is not the procedure used to find the singular value decomposition of a matrix since it is not particularly efficient or well-behaved numerically.
Let’s now look at the singular matrix
We can get , the corresponding eigenvectors are
The geometric effect of this matrix is the following:
In this case, the second singular value is zero so that we may write:
In other words, if some of the singular values are zero, the corresponding terms do not appear in the decomposition for . In this way, we see that the rank of , which is the dimension of the image of the linear transformation, is equal to the number of non-zero singular values.
Singular value decompositions can be used to represent data efficiently. Suppose, for instance, that we wish to transmit the following image, which consists of an array of black or white pixels.
Since there are only three types of columns in this image, as shown below, it should be possible to represent the data in a more compact form.
We will represent the image as a matrix in which each entry is either a 0, representing a black pixel, or 1, representing white. As such, there are entries in the matrix. If we perform a singular value decomposition on , we find there are only three non-zero singular values
Therefore, the matrix may be represented as
This means that we have three vectors , each of which has entries, three vectors , each of which has entries, and three singular values . This implies that we may represent the matrix using only numbers rather than the that appear in the matrix. In this way, the singular value decomposition discovers the redundancy in the matrix and provides a format for eliminating it.
Why are there only three non-zero singular values? Remember that the number of non-zero singular values equals the rank of the matrix. In this case, we see that there are three linearly independent columns in the matrix, which means that .
The previous example showed how we can exploit a situation where many singular values are zero. Typically speaking, the large singular values point to where the interesting information is. For example, imagine we have used a scanner to enter this image into our computer. However, our scanner introduces some imperfections (usually called “noise“) in the image.
We may proceed in the same way: represent the data using a matrix and perform a singular value decomposition. We find the following singular values:
Clearly, the first three singular values are the most important so we will assume that the others are due to the noise in the image and make the approximation
Noise also arises anytime we collect data: no matter how good the instruments are, measurements will always have some error in them. If we remember the theme that large singular values point to important features in a matrix, it seems natural to use a singular value decomposition to study data once it is collected. As an example, suppose that we collect some data as shown below:
We may take the data and put it into a matrix:
With one singular value so much larger than the other, it may be safe to assume that the small value of is due to noise in the data and that this singular value would ideally be . In that case, the matrix would have rank one meaning that all the data lies on the line defined by .
This brief example points to the beginnings of a field known as principal component analysis (PCA), a set of techniques that uses singular values to detect dependencies and redundancies in data.
In a similar way, singular value decompositions can be used to detect groupings in data, which explains why singular value decompositions are being used in attempts to improve Netflix’s movie recommendation system. Ratings of movies you have watched allow a program to sort you into a group of others whose ratings are similar to yours. Recommendations may be made by choosing movies that others in your group have rated highly.
[1]: Kirk Baker, Singular Value Decomposition Tutorial, https://www.ling.ohio-state.edu/~kbaker/pubs/Singular_Value_Decomposition_Tutorial.pdf;
[2]: Singular Value Decomposition (SVD) tutorial, http://web.mit.edu/be.400/www/SVD/Singular_Value_Decomposition.htm;
[3]: We Recommend a Singular Value Decomposition, http://www.ams.org/samplings/feature-column/fcarc-svd;
[4]: Computation of the Singular Value Decomposition, http://www.cs.utexas.edu/users/inderjit/public_papers/HLA_SVD.pdf;
[5]: CMU, SVD Tutorial, https://www.cs.cmu.edu/~venkatg/teaching/CStheory-infoage/book-chapter-4.pdf.
[6]: Wiki: Singular value decomposition, https://en.wikipedia.org/wiki/Singular_value_decomposition.
[7]: Chapter 6 Eigenvalues and Eigenvectors, http://math.mit.edu/~gs/linearalgebra/ila0601.pdf.
[8]: Expressing a matrix as an expansion of its eigenvalues, http://math.stackexchange.com/questions/331826/expressing-a-matrix-as-an-expansion-of-its-eigenvalues.
[9]: Wiki: Eigenvalues and eigenvectors, https://en.wikipedia.org/wiki/Eigenvalues_and_eigenvectors.
机器学习学习笔记 PRML Chapter 2.0 : Prerequisite 2 -Singular Value Decomposition (SVD)
标签:
原文地址:http://www.cnblogs.com/glory-of-family/p/5645554.html