At BigML we take advantage of many fantastic open source libraries produced by the Clojure community. We want to do our bit as well, so we’re happy to announce a Clojure library for random sampling.
Perhaps random sampling isn’t the sexiest of topics, but it is a fundamental tool for machine learning and, more generally, statistics. At its core the problem is simple; select N items from M items. As with many subjects, however, the devil is in the details. Do you want to sample with replacement or without replacement? Do you need to seed the sample for repeatable results? Do you need to weight the items so some are more likely to be sampled? Can you fit the original population or the resulting sample in memory?
Our library tries to offer solutions no matter how you answer those questions. For example, depending on your memory situation the library offers three implementations of random sampling (each with its own namespace): simple sampling, reservoir sampling, and stream sampling.
Simple sampling is the best option if your data is small enough to comfortably keep in memory. If that’s not true but the sample you want to take is small, then reservoir sampling is a good choice. If neither the original population nor the sample can reside in memory, then stream sampling is appropriate.
Simple Sampling
To show off a few of the features, we’ll fire up the REPL and walk through some examples. First we’ll explore simple sampling. It keeps the original population in memory and returns a lazy sequence of samples. By default the sample is without replacement, so it’s equivalent to a lazy Fisher-Yates shuffle.
user> (ns demo (:require (bigml.sampling [simple :as simple] [reservoir :as reservoir] [stream :as stream]) (bigml.sampling.test [stream :as stream-test]))) demo> (simple/sample [:apple :orange :banana]) (:banana :apple :orange)
Setting the replace
parameter as true will sample with replacement. Since there is no limit to the number of items that may be taken with replacement from a population, the result will be an infinite length list. So make sure to take
however many samples you need.
demo> (take 6 (simple/sample [:heads :tails] :replace true)) (:heads :tails :tails :heads :heads :tails) demo> (frequencies (take 100 (simple/sample [:heads :tails] :replace true))) {:tails 48, :heads 52}
A sample may be weighted using the weigh
parameter. When given a function that takes an item and produces a non-negative weight, then the resulting sample will be weighted accordingly.
demo> (frequencies (take 100 (simple/sample [:heads :tails] :weigh {:heads 2 :tails 1} :replace true))) {:heads 66, :tails 34}
By default, every call to simple/sample
will result in a new sample order. If you need a sample to be reproducible, just set the seed
parameter to any hashable value.
demo> (simple/sample (range 5)) (2 3 1 0 4) demo> (simple/sample (range 5)) (4 3 0 1 2) demo> (simple/sample (range 5) :seed "foo") (3 1 4 0 2) demo> (simple/sample (range 5) :seed "foo") (3 1 4 0 2)
Reservoir Sampling
Reservoir sampling keeps the sample in memory (hence the ‘reservoir’). However, the original population is streamed through the reservoir so we don’t need to worry about the population size. This makes reservoirs useful when the original population is too large to fit into memory or the overall size of the population is unknown.
To create a sample reservoir, use reservoir/create
and give it the number of samples you desire. The resulting reservoir acts as a collection, so you can simply conj
values into it to build a sample.
demo> (reduce conj (reservoir/create 3 :seed 0) (range 8)) (0 1 7) demo> (reductions conj (reservoir/create 3 :seed 0) (range 8)) (() (0) (0 1) (2 0 1) (3 0 1) (3 0 1) (3 0 1) (3 0 1) (0 1 7))
A collection can also be passed through a reservoir with into
. Alternatively, reservoir/sample
accepts the original population and a reservoir size and returns the final reservoir.
demo> (into (reservoir/create 3 :seed 0) (range 8)) (0 1 7) demo> (reservoir/sample (range 8) 3 :seed 0) (0 1 7)
Like simple sampling, reservoirs support the seed
, replace
, and weight
parameters. Thanks goes out to this blog post and this paper for tips on weighted reservoir sampling.
Stream Sampling
Stream sampling is useful when taking a large sample from a large population. Neither the original population nor the resulting sample are kept in memory. There are a couple of caveats. First, unlike the other sampling techniques, the resulting sample stream will not be in random order but in the order of the original population. So if you need a random ordering, you’ll want to shuffle the sample. The second caveat is that, unlike reservoir sampling, the size of the population must be declared up-front.
To use stream sampling, call stream/sample
with the population, the desired number of samples, and the size of the population. The result is a lazy sequence of samples.
As an example, we take five samples from a population of ten values:
demo> (stream/sample (range) 5 10) (1 2 4 7 9)
As with the other sampling techniques, stream sampling supports replace
and seed
.
demo> (stream/sample (range) 5 10 :replace true :seed 6) (0 2 5 7 7)
If an item isn’t selected as part of a sampling, it’s called out-of-bag. Setting the out-of-bag
parameter to true will return a sequence of the out-of-bag items instead of the sampled items. When paired with seed
, this can be handy for partitioning a population.
demo> (stream/sample (range) 7 10 :seed 0) (0 2 3 5 6 7 9) demo> (stream/sample (range) 7 10 :seed 0 :out-of-bag true) (1 4 8)
The rate
parameter is useful if you want to sample the population at a particular frequency rather than collect a specific sample size. To illustrate, when stream/sample
is given an infinite list of values as the population, the default behavior is to take the requested samples from the expected population. In this case, it means taking exactly two samples from the first ten items in the population:
demo> (stream/sample (range) 2 10) (3 7)
However, when rate
is true each item is considered independently from the rest. Each item has a 20% chance of being selected as part of the sample. Since the original population is infinite, the resulting sample is also infinite. Here are the first 10 items from that infinite sample:
demo> (take 10 (stream/sample (range) 2 10 :rate true)) (0 2 5 9 13 18 19 30 33 36)
Stream sampling does not support weighted samples, however the stream/cond-sample
function provides a way to fine tune a sample. cond-sample
accepts a population followed by pairs of clauses and sample definitions. A clause should be a function that accepts an item and returns either true of false. Following each clause should be a sample definition that describes the sampling behavior when the condition is true.
As an example, we’ll use the well known iris dataset:
demo> (first stream-test/iris-data) [5.1 3.5 1.4 0.2 "Iris-setosa"]
There are 50 instances of each species:
demo> (frequencies (map last stream-test/iris-data)) {"Iris-setosa" 50, "Iris-versicolor" 50, "Iris-virginica" 50}
Let’s say we want to sample all of Iris-setosa, half as many Iris-versicolor, and none of the Iris-virginica. If you knew the population for each class ahead of time, you could use cond-sample
like so:
demo> (def new-sample (stream/cond-sample stream-test/iris-data #(= "Iris-setosa" (last %)) [50 50] #(= "Iris-versicolor" (last %)) [25 50] #(= "Iris-virginica" (last %)) [0 50])) demo> (frequencies (map last new-sample)) {"Iris-setosa" 50, "Iris-versicolor" 25}
If you did not know the class populations ahead of time, a similar sample could be done using rate
. Also, an item that doesn’t satisfy any condition will be left out of the final sample. So Iris-virginica does not need to have its own clause:
demo> (def new-sample (stream/cond-sample stream-test/iris-data #(= "Iris-setosa" (last %)) [1 1 :rate true] #(= "Iris-versicolor" (last %)) [1 2 :rate true])) demo> (frequencies (map last new-sample)) {"Iris-setosa" 50, "Iris-versicolor" 23}
Wrap Up
We’ve left out a few details such as stream/multi-sample
, stream/multi-reduce
, reservoir implementation options, and the configurable random number generator (Marsenne twister or linear congruential). But we don’t want to spoil all the mystery of the official documentation. Nonetheless, you should now have a pretty good idea what the library offers. We hope you’ll find it useful. As always, please feel free to give us feedback and to pitch in with your own contributions!
4 comments