PCA On Large Matrices: You don't need Spark.

The other day I found this post on the Domino Data Science blog that covers calculating a PCA of a matrix with 1 million rows and 13,000 columns. This is pretty big as far as PCA usually goes. They used a Spark cluster on a 16 core machine with 30GB of RAM and it took them 27 hours.

I read up a bit on PCA and realized you can do PCA on large (several billion element) matrices much faster and without using any Big Data tech like Spark by using better algorithms and more RAM. In this blog post I’ll show how I computed the same problem in only 5 minutes using a method called Randomized SVD.

An overview of SVD and PCA

Quick SVD Refresher: SVD factorizes a matrix A into 3 pieces:

A = U V E*

The V matrix is k x k, where k is called the rank. The other two are m x k and k x n. Usually k is much smaller than n, and this makes the U, V, E matrices much smaller than the original, A. For this reason SVD is sometimes used for data compression. SVD has a ton of other technical applications like solving systems of equations, but I won’t go into them her.

Another thing to note is that the E matrix contains the Principal Components of A. Computing PCA is pretty much always done by computing an SVD first, then returning E, so I’ll use SVD and PCA interchangeably in this post as the algorithms are nearly the same. SVD and PCA are some of the biggest, most fundamental algorithms in statistics and data analysis. The default method of doing SVD involves finding eigenvalues and is really slow, as mentioned in the datadominoes blog post.

A technique for Fast PCA on large matrices

I would like to introduce Randomized SVD. It multiplies your original (transposed) m x n matrix by a random m x k matrix to get a kind of “fingerprint” of the matrix, which has the same properties and is much smaller (n x k), and then does a traditional, slow SVD on that print.

Here are the three best resources I have found on the topic:

An algorithm for the Principal Component Analysis of Large Datasets by Halko et al, which describes a technique called Randomized PCA
Facebook’s PCA, implementing the above paper
Memory Efficient PCA Methods for Large Group ICA by Rachakonda et al, basically a study of how to do PCA on large fMRI matrices exactly like the one in the blog post.

For people interested in the background behind the algorithm, there is a bit more math on the Facebook post and a lot more in the Halko et al paper.

These randomized SVD methods are really fast with large matrices. Halko et al got the following results on a 40B element matrix, computed with a 2GB RAM machine:

The top table has timings for keeping the matrix on disk, and the bottom table generated the matrix on the fly, which is the equivalent of having a computer with enough RAM to fit the dataset. They were able to do SVD on 40 billion elements with k=24 in 132 seconds! This is much faster than 27 hours. If you notice, the on-disk times in the first table don’t change much, showing that the computation is RAM-bound.

Key takeaway: get enough RAM, use a Randomized SVD, and you can fly.

A randomized SVD example in python

Here I’ll show how to apply this fast SVD in python, using two different libraries: fbpca, written by Facebook in 2014, and scikit-learn, which added Randomized SVD in 2011 with version 0.11.

Up until last year fbpca was much faster and numerically stable than scikit-learn’s randomized SVD. But this year, scikit-learn implemented numerical stabilisation techniques described in Halko et al and fixed an important performance bug. They were so confident about their fixes they merged the RandomizedPCA algorithm into the main PCA function. If your matrix is large enough, randomized PCA will now automatically kick in when you use the default sklearn.decomposition.pca. You can also do it manually by using the solver=‘randomized’ parameter when creating a sklearn pca object.

These changes were merged into the dev branch on github a year ago and were just published in the 0.18 release. But if just want to test the technique for a prototype and don’t want to bother changing scikit-learn versions, go with fbpca. It’s one file and both libraries have nearly exactly the same implementation.

Here is code that shows how to use both:

from __future__ import print_function
import fbpca
import numpy as np
import numpy.random
import time

from sklearn.utils.extmath import randomized_svd
r = int(1e6)
c = int(1.5e4)
start = time.time()
A = np.random.uniform(size=[r, c])
end = time.time()
print("Done generating test matrix, it took %2.2f seconds." % (end - start))

print("Running pca on matrix of size %d" % (r * c))

for k in [2, 10, 20, 50, 100]:
    print("Using k=", k)
    start = time.time()
    (U, s, Va) = fbpca.pca(A, k=k, raw=True, n_iter=2)
    end = time.time()
    print("fbpca: %2.2f" % (end - start))

    start = time.time()
    randomized_svd(A, k, n_iter=2)
    # or sklearn_pca = pca(k, copy=False, whiten=False,
    # svd_solver='randomized', iterated_power=2)
    # sklearn_pca.fit(A)
    end = time.time()
    print("sklearn: %2.2f" % (end - start))

As mentioned above, you could also use sklearn.pca’s pca.fit() function, but it is a bit slower because it computes a couple of other things, like explained_variance fractions.

I used a matrix here of 1 million rows and 15 thousand columns, roughly the same dimensions as the matrix in the datadominoes blog post. I ran this on a 244GB RAM r3.8xlarge ec2 spot instance for $0.70 an hour.

This is overkill as the original 15B samples are probably 32 bit floating point and would then fit in 60GB of RAM, but I had the instance time left over from a previous task.

In theory, you can scale this approach pretty high without changing a thing as AWS has instances that go up to ~2TB of RAM. That one’s price is currently $13, which looks like a good deal in exchange for having PCA results in under an hour without any more programming. The table below summarizes the results I got from the above script:

method	k	Time (seconds)
sklearn	2	33.3
sklearn	10	28.3
sklearn	20	36.6
sklearn	50	54.4
sklearn	100	90.7
fbpca	2	25.5
fbpca	10	31.7
fbpca	20	28.7
fbpca	50	49.9
fbpca	100	102.7

The reason for the slightly slower sklearn run times is that sklearn plays it slightly numerically safer with its default settings. Overall, these times compares very favorably with the original blog post: you can do a top 20 component PCA on a 15B dense matrix in just 40 seconds if your data fits in RAM.

Errors

The method we’re using here in an approximation. The original paper includes some thorough error bounds, and others have investigated the error. The n_iter parameter changes how many “passes” of matrix multiplication the algorithm does, increasing the running time but getting more accurate results. Empirical results have shown that n_iter beyond 6 does not go very far, and around 2 is usually fine.

Conclusion

I introduced Randomized SVD, which is becoming the de facto standard for PCA on bigger matrices and allows us to scale to heretofore very large matrices. I showed some resources for understanding Randomized SVD, and two ways to use randomized SVD in python. I timed both libraries on a 15B element matrix and got sub-minute results.

Footnote: The future with GPUs

If for some reason you have a need for frequently doing SVD on >10B elements and need a faster implementation than shown here, you should implement randomized SVD on GPUs, possibly with theano or a similar library. Randomized SVD’s computational bottleneck is matrix multiplications, which is the best kind of computation to speed up with GPUs.

I tried using Berkeley’s BIDMach, which is a GPU computation platform and (non-randomized) SVD on a 1.5B element matrix with k=50 in 13 seconds (on a g2.8xlarge). Then I ran into a bug where you can’t have more than 2B elements (Long.max_num) in a matrix because all the matrix sizes are int, so I stopped investigating there, but there’s still a lot of room for GPU speedups.