The other day I found this post on the Domino Data Science blog that covers calculating a PCA of a matrix with 1 million rows and 13,000 columns. This is pretty big as far as PCA usually goes. They used a Spark cluster on a 16 core machine with 30GB of RAM and it took them 27 hours.
I read up a bit on PCA and realized you can do PCA on large (several billion element) matrices much faster and without using any Big Data tech like Spark by using better algorithms and more RAM.
Continue reading