In this post we peek under the hood to explore a handy technique used here at BigML for understanding numeric data.
If you are familiar with the MapReduce paradigm, you’ve probably seen the word count example. It’s a simple, distributed way to reduce a large set of words into a more concise summary (a set of unique words and their counts). But what if you want a summary for a set of numeric values? Tracking basic statistics like sum, count, and variance are easy but aren’t enough to really understand the data. A better way is to estimate the data’s underlying distribution.
There are some pretty standard ways to compress a distribution, like histograms and kernel density estimators. But we’re a bit picky. We want a technique with some fancy extras:
- Memory constrained, so we can define the memory available up front. The less memory allocated, the more lossy the compressed distribution.
- Parameter free-ish. Other than constraining the memory footprint, we want the method to adapt itself to the data without any intervention.
- Streaming, so we only need one pass over the data.
- Fast, so we can capture millions of data points in seconds, not minutes or hours.
- Anytime accessibility, so that we can get an estimate of the distribution whenever we want (even while processing mid-stream).
- Merge friendly, so that distributions constructed on disjoint subsets of the data can be combined afterword. This makes parallelization and distributed computing (like MapReduce) easy.
- Robust to ordered data, so that we end up with a decent model of the distribution even if a data stream is sorted or otherwise non-stationary.
That’s quite the wish list. Thankfully, Ben-Haim published a great streaming histogram that does it all. We implemented it as a Clojure/Java library, so with a touch of Incanter magic we can show you a histogram dynamically fitting itself to a stream of data (alcohol content for 2500 Portuguese white wines):
We only allowed the histogram to use 16 bins. While that’s much less memory than we use in real-life, it helps make the dynamic bin allocation easier to see. This is especially true in the next video where we stream the same wine data into the histogram, but this time in sorted order. The histogram starts out with a fine resolution and then is then forced to reduce the detail as more and more data appears:
Some streaming histogram techniques choose their bin locations by peeking the beginning of a data stream. No surprise – these peek methods fail for non-stationary data. Yet Ben-Haim’s histogram manages pretty well.
But wait, there’s more!
Stephen Tyree, Kilian Weinberger et al. extended the histogram to capture information about a second numeric field. This is nice for understanding the correlation between fields and, specifically, helps when building regression trees. This video shows the same sorted wine alcohol data as before, but now the histogram is also tracking the average wine quality:
And there we have it, white wine drinkers prefer more alcohol!
We’ve extended the histograms further so that they can track a categorical field, a set of fields (numeric and categorical), or even a nested histogram. Yep, we put histograms in our histograms. What that really means is a heat map. But a heat map with the same dynamic/streaming properties as the histograms. Here’s an example of a 16×16 bin heat map being built on some census data (illustrating the relationship between age and weekly hours worked):
And that’s our short tour of BigML’s streaming histograms. We’re hoping to open source the library in the near future. So check back soon if you’re a Clojure or Java dev.