Streaming Histograms for Clojure and Java

Posted by

We’re happy to announce that we’ve open-sourced our “fancy” streaming histograms. We’ve talked about them before, but now the project has been tidied up and is ready to share.

PDF & CDF for a 32-bin histogram approximating a multimodal distribution.
PDF & CDF for a 32-bin histogram approximating a multimodal distribution.

The histograms are a handy way to compress streams of numeric data. When you want to summarize a stream using limited memory there are two general options. You can either store a sample of data in hopes that it is representative of the whole (such as a reservoir sample) or you can construct some summary statistics, updating as data arrives. The histogram library provides a tool for the latter approach.

The project is a Clojure/Java library. Since we use a lot of Clojure at BigML, the readme’s examples are all Clojure oriented. However, Java developers can still find documentation for the histogram’s public methods.

Since the histogram provides an approximation of the data’s original distribution, you can find all the basic stats you’d expect, such as mean, median, and arbitrary percentiles. You can even generate functions for the PDF and CDF. Below we show the library in action (using a Clojure REPL) while exploring a histogram built on 200K samples from a normal distribution (mean of 0, variance of 1).

examples> (def hist (reduce insert! (create) ex/normal-data))
examples> (mean hist)
-0.0026
examples> (median hist)
-0.0009
examples> (variance hist)
0.9985
examples> (sum hist 0)
100077.6513
examples> (density hist 0)
80165.2707
examples> (percentiles hist 0.5 0.95 0.99)
{0.5 -0.0009, 0.95 1.6446, 0.99 2.3263}
examples> (map (cdf hist) [-2 0 2])
(0.0233 0.5004 0.9775)
examples> (map (pdf hist) [-2 0 2])
(0.0558 0.4008 0.0537)
A 64-bin histogram built on (x, y) pairs where x is drawn from normal and y is sine(x).
A 64-bin histogram built on (x, y) pairs where x is drawn from normal and y is sine(x).

The histograms have a few more tricks. Along with the primary variable the histograms can track information about secondary numeric or categorical variables. We use this feature when growing decision trees, but it could be useful whenever you want to watch for correlation between variables in a streaming context. For example, you could build a histogram on time-of-day for HTTP requests and also track the response time. With that, you might see that evenings show a spike in the number of requests and a corresponding increase in response time.

If you’re interested, there’s a lot more info on the histograms in our previous post and on the project page.  As always, feel free to share questions and comments.  Thanks!

Clone or fork the project here:

https://github.com/bigmlcom/histogram

2 comments

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s