We’re happy to announce that we’ve open-sourced our “fancy” streaming histograms. We’ve talked about them before, but now the project has been tidied up and is ready to share.

The histograms are a handy way to compress streams of numeric data. When you want to summarize a stream using limited memory there are two general options. You can either store a sample of data in hopes that it is representative of the whole (such as a reservoir sample) or you can construct some summary statistics, updating as data arrives. The histogram library provides a tool for the latter approach.
The project is a Clojure/Java library. Since we use a lot of Clojure at BigML, the readme’s examples are all Clojure oriented. However, Java developers can still find documentation for the histogram’s public methods.
Since the histogram provides an approximation of the data’s original distribution, you can find all the basic stats you’d expect, such as mean, median, and arbitrary percentiles. You can even generate functions for the PDF and CDF. Below we show the library in action (using a Clojure REPL) while exploring a histogram built on 200K samples from a normal distribution (mean of 0, variance of 1).
examples> (def hist (reduce insert! (create) ex/normal-data)) examples> (mean hist) -0.0026 examples> (median hist) -0.0009 examples> (variance hist) 0.9985 examples> (sum hist 0) 100077.6513 examples> (density hist 0) 80165.2707 examples> (percentiles hist 0.5 0.95 0.99) {0.5 -0.0009, 0.95 1.6446, 0.99 2.3263} examples> (map (cdf hist) [-2 0 2]) (0.0233 0.5004 0.9775) examples> (map (pdf hist) [-2 0 2]) (0.0558 0.4008 0.0537)

The histograms have a few more tricks. Along with the primary variable the histograms can track information about secondary numeric or categorical variables. We use this feature when growing decision trees, but it could be useful whenever you want to watch for correlation between variables in a streaming context. For example, you could build a histogram on time-of-day for HTTP requests and also track the response time. With that, you might see that evenings show a spike in the number of requests and a corresponding increase in response time.
If you’re interested, there’s a lot more info on the histograms in our previous post and on the project page. As always, feel free to share questions and comments. Thanks!
Clone or fork the project here:
2 comments