This post is dedicated to figuring out why SVD works when it does. The image below is a Java applet that demonstrates the use of SVD for image compression. Click on the image and use the arrow keys on your keyboard to change the rank of the pixel matrix. Up/down keys change the rank by a factor of two, left/right keys increase and decrease the rank by one. The original image is 256x256. It is still recognizable when we reduce it to 16x16. The comments below are notes I took while I was trying to understand SVD.
Deep Learning is cheap Solomonoff induction?
1 hour ago
15 comments:
Latent Semantic Indexing
Latent Semantic Indexing is an idea to improve document retrieval performance by making the system less sensitive to word choice. "LSI explicitly represents terms and documents in a rich, high-dimensional space, allowing the underlying (``latent''), semantic relationships between terms and documents to be exploited during searching." A large incidence matrix is constructed where rows correspond to words, columns to documents, and entries to the number of times the words occurs in the documents. The entries are rescaled according to some weighting scheme. Then (this is the important step) the matrix is reduced in dimensionality using Singular Value Decomposition. Now why should this work?
Documents, terms, and queries are represented as points in the same reduced dimension term-document space. Queries are just mini documents. What about terms? And what does it mean for a document to be a "point" in the reduced term-document space while it was a whole column in the original incidence matrix? We need to understand SVD first to figure the rest out. The oft cited reference seems to be:
http://www.cs.utk.edu/~library/TechReports/1994/ut-cs-94-270.ps.Z
Using Linear Algebra for Intelligent Information Retrieval. Michael W. Berry, Susan T. Dumais, and Gavin W. O'Brien, December 1994. Published in SIAM Review 37:4 (1995), pp. 573-595.
Link: http://www.cs.utk.edu/~berry/lsi++/node8.html
Latent Semantic Analysis
Latent Semantic Analysis (LSA) is a mathematical/statistical technique for extracting and representing the similarity of meaning of words and passages by analysis of large bodies of text. It uses singular value decomposition, a general form of factor analysis, to condense a very large matrix of word-by-context data into a much smaller, but still large-typically 100-500 dimensional-representation (Deerwester, Dumais, Furnas, Landauer & Harshman, 1990). The right number of dimensions appears to be crucial; the best values yield up to four times as accurate simulation of human judgments as ordinary co-occurence measures.
Link:
http://lsa.colorado.edu
Singular Value Decomposition I
From wikipedia: "The geometric content of the SVD theorem can thus be summarized as follows: for every linear map T: Kn → Km one can find orthonormal bases of Kn and Km such that T maps the i-th basis vector of Kn to a non-negative multiple of the i-th basis vector of Km, and sends the left-over basis vectors to zero. With respect to these bases, the map T is therefore represented by a diagonal matrix with non-negative real diagonal entries."
What does this mean when the matrix is an incidence matrix of words and documents?
Here are a bunch of statements about SVD from various sources:
- The SVD represents an expansion of the original data in a coordinate system where the covariance matrix is diagonal.
- Performing PCA is the equivalent of performing Singular Value Decomposition (SVD) on the covariance matrix of the data.
- The ith singular value indicates the amount of variation along the ith axis.
Link:
http://en.wikipedia.org/wiki/Singular_value_decomposition
Trefethen Book
This book seems to have a good explanation and chapters 4 and 5 are available online. Basically, a matrix can be identified with the hyper-ellipse it takes the unit sphere to. The unit vectors along the principle axes of the ellipse form the columns of the U matrix (left singular vectors), their magnitudes form the diagonal S matrix (singular values), and their pre-images on the unit sphere form the columns of the V matrix (right singular vectors). The action of any matrix can be summarized as rotate-stretch-rotate. A=USV* means first rotate the pre-images to align with the axes (V* is the inverse of V. V, like any matrix, takes the unit vectors to its columns), then stretch the axes by amounts specified in S, then rotate to the target orientation using U. Both U and V are unitary matrices (their rows and columns form orthonormal bases). U is defined to be that way, but I am not sure why V has to be.
One confusion is there is more than one way to define the dimensions of U, V, S. Standard is to define U and V as square and S has the same dimensions as A. Another is to define S as square and adjust U and V accordingly. There are only rank(A) non-zero entries on S, so the rest of the values are padded with zeros. We can further reduce S by throwing away the smallest values, in which case we have a lower rank object which is the closest approximation (by several measures) to A.
SVD is different than eigen decomposition since eigen-vectors do not have to be orthogonal. (A matrix takes eigen-vectors to their own multiples). When they are orthogonal, eigens are just a special case of SVD.
Now that we got the geometric interpretation pat, all that is left is to figure out what goes on when we have a covariance matrix or a data matrix. These matrices are not very intuitive to think about as geometric transforms.
Link:
http://web.comlab.ox.ac.uk/oucl/work/nick.trefethen/text.html
Data Matrices
Coming back to the question of why SVD makes sense for data matrices (matrices whose columns are high dimensional data points, e.g. documents expressed as word frequency vectors, or experiments expressed as gene expression vectors). At first visualizing the geometric interpretation of the SVD (the unit sphere being transformed into a hyper-ellipsoid) on an arbitrary data matrix proved difficult. What is the action of a data matrix? According to the standard transform concept, a matrix transforms vectors in its row space into vectors in its column space. This meant a term-document matrix transforms a term into a document, or a gene matrix transforms a gene into an experiment, which made no sense.
The great Trefethen book starts with one simple observation: Ax is a linear combination of columns of A. This is an alternative viewpoint to the usual: Ax is a transformation of x into a new vector. But it makes certain things easier to understand. For example a matrix transforms unit vectors into its column vectors. The alternative is that a unit vector selects a column of the matrix. This may sound like hair splitting but let's look at its consequence for the data matrices. Each unit vector (do not interpret these as entities in the row space like a new gene or a new term, but just as a selector), picks a column (i.e. one of the data points). Thus the images of these unit vectors give us the cloud of data points (documents or gene experiments) in their multidimensional space. Back to the SVD concept of the unit sphere going into the ellipsoid. The cloud of data points all lie at the edges of our ellipsoid (because all the unit vectors they come from lie at the edges of the unit sphere). They determine its outline. We would expect the extent of the ellipsoid along a certain direction to be related to the spread of our data along that direction.
What does reduced SVD do? Well imagine an ellipsoid shaped like a convex lens. i.e. one of its dimensions is dwarfed by the others. Reducing SVD will get rid of that dimension and turn the thing into a circle. Why would you have a dwarf dimension to start with? Well, maybe one of the genes is expressed very little compared to others. One of the document terms is seen very rarely. In that case you may not want the elimination, so maybe you should rescale that dimension. Another reason may be several terms or genes that are correlated. In the perfect correlation case our ellipse would lie on the 45 degree hyperplane between those two axes, thus both terms expressed as a single dimension. I imagine this is the more relevant effect everybody is after. Manning's document term example has two docs that have no words in common, so originally they look orthogonal with 0 similarity. But words from both documents co-occur frequently in other documents, so SVD collapses them and in the modified space the two docs look pretty similar.
SVD on Images
SVD can also be applied to images (represented as a matrix of pixels) to compress them. This is very different from data matrices where each column represents an individual object of interest. The reason SVD works in the image application is that when one eliminates the smallest singular values and computes the reduced rank matrix, the result is the closest possible to the original for that rank. It is closest in various definitions of close, but the relevant one for images is the least squares of individual pixel value differences. Here is an applet to demonstrate:
Use the arrow keys of your keyboard to change the rank of the matrix, left/right goes by one, up/down goes by factors of two. You may have to click on the image first.
http://home.ku.edu.tr/~dyuret/java/svd.html
Next step is to understand covariance matrices and the statement "PCA is equivalent to performing SVD on the covariance matrix". There is also the relation with clustering. http://crd.lbl.gov/~cding/Spectral talks about the equivalence of K-means clustering and the PCA.
Here is another excerpt that tries to explain the relation of SVD, PCA and Eigen decomposition:
Formally: There is a direct relation between PCA and SVD in the case where principal components are calculated from the covariance matrix. If one conditions the data matrix X by centering each column, then X'X = sum gi gi' is proportional to the covariance matrix of the variables of gi. By diagonalization of X'X yields V', which also yields the principal components of {gi}. So, the right singular vectors {vk} are the same as the principal components of {gi}. The eigenvalues of X'X are equivalent to s_k^2, which are proportional to the variances of the principal components. The matrix US then contains the principal component scores, which are the coordinates of the activations in the space of principal components.
Some thoughts from last night:
1. I need to go over the proofs but basically both Ripley and Trefethen books seem to suggest SVD and PCA based on covariance matrix give the same results, and the SVD solution is more stable
2. I have been thinking about irrelevant and redundant features. Here are two simple observations: If you add extra dimensions to a data set (a) distances always increase, (b) separable points stay separable, non-separable ones may become separable (x^2 SVM example). The opposite is true for decreasing dimensions naturally. Sets that are not separable will not become separable when you get rid of dimensions. This argues that discriminant approaches like SVMs will probably do well in the presence of extra dimensions, whereas nearest neighbor, clustering type approaches will be misled. (SVM will find a separation along the correct dimensions if one exists but the random dims may prevent it from finding the largest margin one).
3. Why would you ever want to decrease the number of dimensions? It obviously doesn't help separability. Some ideas:
- to get better similarity: irrelevant dims may make similar points look different (but then how do you know SVD gives the right solution, maybe the highest variation directions are the noise)
- to uncover correlations: the astronaut text example from manning.
- fight with sparse data: if the data is the tip of a histogram, sparseness prevents the real distribution difficult to estimate. (again astronaut text example). I think in that case dimensions of SVD correspond to whole distributions, and the mapping to lower dimensions allows the tip of histograms to be classified as the right histogram learnt from more frequent words. (I should look at the dims of the SVD in peter's data to see what sort of distributions they give)
4. this may argue that SVD actually performs clustering. Imagine n word distributions which have no intersection (orthogonal). Wouldn't SVD have the first n vectors point to the centers of these clusters?
5. SVD is just one way to project data. If we have a quality metric (like how well similar pairs can be distinguished from dissimilar pairs), one can numerically search for dimensions along which this distinction can be made better. See projection pursuit.
6. Finally our data may be such that each semantic relation is represented along distinct dimensions, rather than by some linear combination of common dimensions. Will any kind of projection work in such a problem?
Some observations from looking at the spectral clustering slides by Ding with Alkan yesterday:
1. There is incredible symmetry in the problem. No wonder why I am so confused. Take the data matrix. Find cross products between features. That is the covariance matrix. Find cross products between observations. That is the similarity matrix. They are duals of each other. (Guess you have to make the means zero first.)
2. The eigens of the covariance matrix are the same as the singular vectors of the data matrix. I bet the eigens of the similarity matrix are the other singular vectors. Spectral clustering works basically by finding the eigen vectors of the laplacian matrix of a graph. Laplacian matrix is the adjacency matrix with (negative) degree information on the diagonal. A similarity matrix is sort of a continuous analogue of the adjacency matrix. (At least I am guessing that's how people use spectral clustering.) Interestingly it is the vectors with smallest eigen values that is of interest in spectral clustering.
SVD vs PCA:
A = USV'
A'A = VSU'USV' = V(S^2)V'
A'A is the covariance matrix and V(S^2)V' is its Eigen decomposition. Still would be nice to have a geometric interpretation.
To stop confusing rows and columns this note will summarize what to do to get compressed vectors:
1. A is mxn, columns are objects, rows are attributes. So there are n objects with m attributes. A transforms Rn -> Rm. SVD produces A=USV'. Columns of U are orthonormal vectors in R^m. When reduced, U has r column vectors in R^m, S is rxr, and V has r column vectors in R^n (if not reduced => r=n). To reduce an object v from m dimensions to r dimensions we use U'v. To reduce all n objects to the new r dimensional coordinate system use U'A. We can just use transpose because the columns of U are orthonormal.
2. B is nxm, rows are objects, columns are attributes. So there are n objects with m attributes. B transforms Rm -> Rn. SVD produces B=USV'. Columns of U are orthonormal vectors in R^n. When reduced, U has r column vectors in R^n, S is rxr, and V has r column vectors in R^m (if not reduced => r=n). In fact, B=A' since (USV')' = VSU', we obtain the same three matrices for A and B, except U and V have changed roles. To reduce an object v (row vector) from m dimensions to r dimensions we can use vV. To reduce all objects to r dimensions we can use BV (which is just the transpose of U'A from last example.
I wrote up a little primer on SVD that includes link to Matlab code I wrote for visualizing the SVD that can be found here.
Your site provided some useful references.
Nice posts !
Informative ..
I have a doubt ..
assume A ,is a matrix which is the sum of I(Identity) and a*b' for some vectors a and b.
What can u say about (A'*A)'s eigen vectors, values
and also SVD's
In my comment of Aug 26, 2005 I wrongly claimed that a matrix takes the edges of the unit sphere to the edges of the hyper-ellipse. That is not true in general, the whole unit sphere goes to the whole hyper-ellipse, but edges are not necessarily mapped to one another. This does not invalidate the rest of my observations about the SVD of the data matrix.
In fact edges of the unit sphere correspond to unit length vectors in the row space. When the matrix is multiplied with a unit length vector, what comes out is a linear combination of its columns where the squares of the coefficients add up to one.
Post a Comment