Databricks, the startup focused on commercializing the popular Apache Spark data-processing framework, has used Spark to crush a benchmark record previously set using Hadoop MapReduce. The company says it’s a misconception that Spark is only significantly faster than MapReduce for datasets that can fit a cluster’s memory, and that this test ran entirely on disk helps to prove that.
Using 206 machines and nearly 6,600 cores on the Amazon Web Services cloud, Databricks completed the Daytona GraySort test, which involves sorting 100 terabytes of data, in just 23 minutes. The previous record was set by Yahoo, which used a 2,100-node Hadoop cluster with more than 50,000 cores to complete the test (albeit on 102.5 terabytes) in 72 minutes.
To address early concerns about Spark’s ability to reliably handle large-scale datasets, the Databricks team then ran an unofficial the test and completed the same benchmark across a petabyte of data, on 190…
View original post 422 more words