Amédée d'Aboville

  • About
  • Projects

PCA On Large Matrices: You don't need Spark.

October 11, 2016 · 7 minute read · Tags: data science , pca , spark

The other day I found this post on the Domino Data Science blog that covers calculating a PCA of a matrix with 1 million rows and 13,000 columns. This is pretty big as far as PCA usually goes. They used a Spark cluster on a 16 core machine with 30GB of RAM and it took them 27 hours. I read up a bit on PCA and realized you can do PCA on large (several billion element) matrices much faster and without using any Big Data tech like Spark by using better algorithms and more RAM.
Continue reading

© 2025 - Powered by Hugo with the Type Theme