Tree algorithms for larger-than-memory data sets

Follow the Data

We came across a couple of interesting papers about decision tree algorithms for data sets that don’t fit into memory. In general, we have been thinking a bit lately about this type of approximate data analysis where various tricks can be used to summarize relevant characteristics of a huge data set. The Mining of Massive Datasets book (free PDF) has some nice material to chew over when it comes to approximate “streaming” algorithms (e g chapter 4).

The first of the “streaming decision tree” methods came to our attention via a post on the Revolution Analytics blog, rxDTree(): a new type of tree algorithm for big data. It links to a paper from 2010, A streaming parallel decision tree algorithm (note: PDF link) by Ben-Haim and Yom-Tov, which describes a tree algorithm that addresses the problem of sorting a very large number of values (to determine split…

View original post 224 more words

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s