Databricks demolishes big data benchmark to prove Spark is fast on disk, too


Databricks, the startup focused on commercializing the popular Apache Spark data-processing framework, has used Spark to crush a benchmark record previously set using Hadoop MapReduce. The company says it’s a misconception that Spark is only significantly faster than MapReduce for datasets that can fit a cluster’s memory, and that this test ran entirely on disk helps to prove that.

Using 206 machines and nearly 6,600 cores on the Amazon Web Services cloud, Databricks completed the Daytona GraySort test, which involves sorting 100 terabytes of data, in just 23 minutes. The previous record was set by Yahoo, which used a 2,100-node Hadoop cluster with more than 50,000 cores to complete the test (albeit on 102.5 terabytes) in 72 minutes.

To address early concerns about Spark’s ability to reliably handle large-scale datasets, the Databricks team then ran an unofficial the test and completed the same benchmark across a petabyte of data, on 190…

View original post 422 more words


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s