Skip to content

Programmatically Fill in Missing Values in Your Dataset with WhizzML

For new WhizzML developers, WhizzML’s power as a full-blown functional programming language can sometimes obscure the relationship between WhizzML and the BigML Machine Learning platform. At BigML, we refer to WhizzML as a functional programing language for orchestrating workflows on the BigML platform. In this post we describe an example script in the WhizzML script gallery for filling in missing data values in a BigML Dataset object to help elucidate how WhizzML and the BigML machine learning platform interact.

The BigML developer documentation provides one view into the Machine Learning functions the BigML platform makes available to users. This functionality can be accessed through multiple programming methods including the BigML REST API, the downloadable BigMLer  command line toolBigML bindings for all popular programming languages, and now as WhizzML functions.  There is an important difference between the first three programming methods (REST API, BigMLer, and bindings) and WhizzML: Solutions using the first three methods run on other user platforms, increasing the volume of data and metadata transfers between user platforms and the BigML platform. Production WhizzML scripts run on the BigML platform, eliminating data transport costs and leveraging parallelism and performance optimizations for BigML Machine Learning on large datasets.

WhizzML and Flatline

When using WhizzML to orchestrate workflows, you might quickly come up against an additional subtlety: To realize the full potential of WhizzML, WhizzML functions should not themselves process the data in datasets, but only orchestrate execution of BigML Machine Learning functions in the BigML API.  However, in your ML application, you might need to process dataset data in unique ways. For example, the API to create a BigML Cluster object from a BigML Dataset object includes an argument “default_numeric_value” that allows us to specify the single type of the numeric value — “mean”, “median”, “minimum”, “maximum”, “zero” — (the first four types computed on a per-column basis) that should be used to fill all missing values in a dataset in all columns considered in the clustering operation.

It is easy to conceive of applications, where you might need more flexibility in filling missing values in a dataset.  We don’t want to do this by processing the data in WhizzML itself, because we couldn’t leverage all the performance benefits the BigML platform provides for handling datasets.  This is where we can turn to Flatline in WhizzML.  Flatline is a row-oriented processing language for datasets in the BigML platform itself.  The BigML Developer tools includes a Flatline editor for directly applying Flatline operations to datasets, but we can also use Flatline directly in WhizzML.

The Whizzml script Clean Data Fill in the BigML WhizzML script gallery is an example of how we can use WhizzML and Flatline to fill in missing values in a dataset by using default values supplied in a map to a WhizzML function.  We can’t cover all of the Flatline operations and use cases in here, so in our example we’ll just show how to apply the the Flatline function:

(all-with-defaults <field-designator-0> <field-value-0>
                   <field-designator-1> <field-value-1>
                   <field-designator-n> <field-value-n>)

to modify a dataset.  We do this by using the built in WhizzML  (flatline …) function and a macro-like structure to fill in the field information:

(flatline "(all-with-defaults @{{fargs}})")

where fargs is a WhizzML list that includes field key/value (field names/values) pairs as sequential entries.  Our example script does this by using four WhizzML functions, three of which are actually functions to simplify the task of specifying default values for the fourth function that does the real work by using Flatline.

Specifying Default Values for Missing Dataset Values

Arguably the most burdensome task we have to undertake is to build a map of default values to fill in the missing values in a dataset. Three of the four functions in our example WhizzML script (extract-meta …), (extract-meta-func …), and (generate-configmap …) implement our illustrative approach for doing this.  Before discussing these functions, note that the metadata for BigML Dataset objects include two properties “input_fields” and “fields” that provide the metadata items we need to build our default value map.  The “input_fields” property is a list of field (column) IDs in the dataset, e.g.:

{ ...

The “fields” property is a dictionary of summary information for each field keyed on the IDs in the “input_fields” property, e.g.:

{ ...
  { ...
   :datatype "double",
   :name "Employment Rate",
   :optype "numeric",
   { ...
    :mean 58.35941,
    :median 58.00162,
    :minimum 29.96302,
    :maximum 83.55616,
     ... }}
    ... }

The first three functions in our script process the “input_fields” and “fields” properties of the input dataset metadata to generate a template map for specifying default values to the function that fills the missing values in the dataset.

The first function (extract-meta …), is a helper function that accepts the submap {:fields {:00000 {…}} for a single field from the “fields” property as an input parameter:

(define (extract-meta mpi) 
  (let (mpis (get mpi "summary")
        mpos {"mean" (get mpis "mean")
              "median" (get mpis "median")
              "minimum" (get mpis "minimum")
              "maximum" (get mpis "maximum")})
    {"datatype" (get mpi "datatype") 
     "name" (get mpi "name")
     "optype" (get mpi "optype")
     "summary" mpos}))

The function extracts and returns just the contents we need for the corresponding field entry in our default value map:

{:datatype "double",
 :name "Employment Rate",
 :optype "numeric",
 {:mean 58.35941,
  :median 58.00162,
  :minimum 29.96302,
  :maximum 83.55616}}

This map provides the minimum information you might find useful about the type and contents of a column.

The next function (extract-meta-func …) is a factory function that returns a lambda function suitable for use in a WhizzML (reduce fn {…} [..]) function.

(define (extract-meta-func ds)
  (let (fields (get ds "fields"))
    (lambda (mp id)
      (let (mpi (get fields id)
            mpo (extract-meta mpi))
        (assoc mp id mpo)))))

This function creates a closure that captures the contents of the “fields” property of the metadata for the dataset whose ID is supplied as the “ds” input parameter. The returned lambda function (lambda (mp id) …) accepts a partial metadata map “mp” and a column “id” (from the “input_fields” property of the dataset metadata map) as input parameters. It returns a new version of the input map augmented with the submap returned by the (extract-meta …) function for the column specified by “id”.

Our third function (generate-configmap …) just repetitively applies the function returned by (extract-meta-func …) to the dataset metadata to build up a template map for supplying default values to our dataset:

(define (generate-configmap dataset-id)
  (let (ds (fetch dataset-id)
        flds (get ds "input_fields")
        metafn (extract-meta-func ds))
    (reduce metafn {} flds)))

The result is a WhizzML list of maps, one per column, of the minimum metadata for the column.  For each field, we can then add a property “default” to the submap for the field to specify the value that should be plugged in to the rows of the dataset with missing values in that column:

 {:datatype "double",
  :name "Employment Rate",
  :optype "numeric",
  {:mean 58.35941,
   :median 58.00162,
   :minimum 29.96302,
   :maximum 83.55616},
  :default 0.0}
    ... }

Filling Missing Dataset Values with Flatline

Once we have a map that explicitly specifies the default values for the columns of our dataset, we can use the fourth function in the example WhizzML script (fill-missing …) to create a new dataset with all missing values in the source dataset specified by “dataset-id” replaced with the default values in the “dflt-mp” map:

(define (fill-missing dataset-id dflt-mp)
  (let (frdce (lambda (lst itm) 
                (let (dkey (get itm "name")
                      dval (get itm "default"))
                  (append (append lst dkey) dval)))
        fargs (reduce frdce [] (values dflt-mp)))
    (log-info fargs)
    (create-and-wait-dataset {"origin_dataset" dataset-id
                              "all_fields" false
                              "new_fields" [{"fields" (flatline "(all-with-defaults @{{fargs}})")}]})))

This function first declares a function frdce that is used in a WhizzML (reduce …) function to extract a WhizzML list fargs of sequential per-column name-value pairs.

The heart of our example (fill-missing …) function is the WhizzML (create-and-wait-dataset …) function that creates a modified copy of the source dataset with our default values inserted. Referring to the BigML API documentation for the Dataset object API arguments for extending a dataset, a false value for  “all_fields” argument specifies that the function should not pass any of the input fields of the source dataset directly to the new dataset.  The “new_fields” argument specifies new fields that should be added to the new dataset by using Flatline.

Our example function uses a “new_fields” argument form that includes a WhizzML map in the argument  [{“fields” (flatline …)}], which specifies values for all of the fields in the new dataset with a single Flatline expression.   The (flatline …) function  accepts a single string argument that is passed to the BigML backend at execution time. The string argument “(all-with-defaults @{{fargs}})” in turn incorporates a WhizzML macro form where fargs is the WhizzML list of sequential per-column name-value pairs, which were defined earlier. When the (flatline …) function is executed, WhizzML expands the string argument with the value of fargs.  The resulting string value for the “new_fields” argument is passed to the BigML backend along with the other arguments by the (create-and-wait-dataset …) function. The BigML platform backend generates the new dataset with the default values inserted by using whatever optimizations it can.

A Final Comment

Our WhizzML script is primarily intended as an example of how you can use WhizzML and Flatline to process datasets on the BigML platform backend.   In your applications, you may want to compute default values in other ways or perform other data manipulations. For instance,  you may want to compute default values on a row-basis by using Flatline rather than on a column-basis. Most data manipulations can be accomplished by using WhizzML and Flatline, but some computations may be harder to implement than others. We will take up other ways to use WhizzML and Flatline to facilitate Machine Learning tasks in subsequent WhizzML demonstration scripts and blog posts.


Using Anomaly Detectors to Assess Covariate Shift


BigML first discussed some time ago how the performance of a predictive model can suffer when the model is applied to new data generated from a different distribution than the data used to train the model.  Machine Learning practitioners commonly identify two types of data variations that can cause problems for predictive models. The first, Covariate Shift, refers to differences between the distribution of the data fields used as predictors in the training and production datasets.  The other type of variation, Dataset Shift, denotes changes in the joint distribution of the predictors and the predicted data fields arising between the target and production datasets. A recent blog post showed how to implement one technique for detecting both types of data shift by using WhizzML.

The introduction of Anomaly Detectors in BigML provides yet another means using WhizzML for detecting data shifts that can affect the performance of predictive models.  As background for the simple technique we describe next,  you can read more about anomaly detection in a previous BigML blog post.  In a nutshell, an anomaly detector is an iforest of over-fitted decision trees.  Anomalous data items are outliers from the dataset and therefore are detected at shallower depth in the decision trees.  The depth at which an item is classified compared to the average depth of the decision trees is converted to an anomaly score that ranges from 0 (least anomalous) to 1 (most anomalous).

BigML provides two anomaly detector functions in WhizzML useful for building a data shift detector:

  • (create-and-wait-anomaly …): We can use this function to build an anomaly detector object from the same training dataset we use to build a predictive model.
  • (create-and-wait-batchanomalyscore …): Once we have built an anomaly detector, we can use this function to apply that anomaly detector to a production datashift.

There are some features of the (create-and-wait-batchanomalyscore …) function that are useful for our purpose.   When this function is applied to the input production dataset, it creates a Batch Anomaly Score object and an output Dataset object that includes every row of the input production dataset object with an added score field containing the anomaly score for that row.  The batch anomaly score function also adds summary metadata to the output dataset metadata that we can use to compute the desired dataset shift measure.

The BigML WhizzML script gallery includes an example Anomaly Shift Estimate script that demonstrates how we to use the anomaly detector functions to create a dataset shift measure.  In the rest of this post, we describe the component functions in this demonstration WhizzML script.  The script can be used as-is, or you can use the component functions as starting points for custom WhizzML scripts.

A Few Helper Functions to get Started

To begin, we recall that predictive models are typically built by learning a model from a training subset of the source data.  The data shift detection script starts with a simple WhizzML function (sample-dataset …) that allows us to select a subset of the training dataset:

(define (sample-dataset dst-id rate oob seed)
  (create-and-wait-dataset {"sample_rate" rate
                            "origin_dataset" dst-id
                            "out_of_bag" oob
                            "seed" seed}))

This minimal helper function primarily illustrates the few parameters one likely would want to use to select a subset of an input dataset.  In a WhizzML script customized for your application, you may want to use other parameters of the (create-and-wait-dataset …) function, that is described in the BigML Dataset documentation.

The script also includes a minimal WhizzML helper function (anomaly-evaluation …) to apply an anomaly detector to every row of the production dataset:

(define (anomaly-evaluation anomaly-id dst-id)
  (create-and-wait-batchanomalyscore {"anomaly" anomaly-id
                                      "dataset" dst-id
                                      "all_fields" true
                                      "output_dataset" true }))

Again, just those parameters of the (create-and-wait-batchanomalyscore …) function needed to apply the “anomaly” detector with anomaly-id to an input “dataset” with dst-id are used.  Specifying both the “output_dataset”  and “all_fields” parameters as true requests creation of an output dataset that includes all  fields in the input dst-id and an anomaly score for each row in the dataset.  The Batch Anomaly Score documentation describes the full set of parameters you might find useful in your own WhizzML scripts.

The script includes one last WhizzML helper function (avg-anomaly …) that uses metadata the WhizzML (create-and-wait-batchanomalyscore …) function adds to the basic metadata of the output dataset, which it creates to compute an average anomaly score for how anomalous the input dataset is relative to the training set used to build the anomaly detector:

(define (avg-anomaly evdst-id)
  (let (evdst (fetch evdst-id)
        score-field (get-in evdst ["objective_field" "id"])
        sum (get-in evdst ["fields" score-field "summary" "sum"])
        population (get-in evdst ["fields" score-field "summary" "population"]))
    (/ sum population)))

There are a few details worth noting here. We first must fetch the output dataset evdst identified by evdst-id.  The metadata associated with evdst includes a map “objective_field” that includes a sub-map “id”  that identifies the score-field in the metadata containing the anomaly results we need. Using that score-field value, we can access the “summary” sub-map in the “fields” sub-map of the metadata, where the total sum of the anomaly scores for all rows in the dataset and the population count of the number of rows in the dataset are found.  We return the quotient of these two quantities as our average anomaly score for the entire dataset as a measure of data shift.  The Dataset Properties section of the Dataset documentation provides more information about the properties we describe here as well as other properties you might find useful in a custom WhizzML script.

Using our Helper Functions to do Anomaly Scoring

We now can combine these minimal helper functions into a single function that computes an anomaly score for an entire production dataset relative to a training dataset.

(define (anomaly-measure train-dst train-exc prod-dst prod-exc seed clean)
  (let (traino-dst (sample-dataset train-dst 0.8 false seed)
        prodo-dst (sample-dataset prod-dst 0.8 true seed)
        anomaly (create-and-wait-anomaly {"dataset" traino-dst
                                          "excluded_fields" train-exc})
        ev-id (anomaly-evaluation anomaly prodo-dst)
        evdst-id (get-in (fetch ev-id) ["output_dataset_resource"])
        score (avg-anomaly (wait evdst-id)))
      (if clean
        (prog (delete evdst-id)
              (delete ev-id)
              (delete anomaly)
              (delete prodo-dst)
              (delete traino-dst)))

In summary, this (anomaly-measure …) function:

  1. Creates samples of both datasets (traino-dst,  prodo-dst)
  2. Creates an anomaly detector from the training sample (anomaly)
  3. Applies the anomaly detector to the production sample to create a batch score (ev-id)
  4. Computes the average anomaly score for the entire production sample (score)

This function also includes several details you might handle differently in your own WhizzML scripts.  The “train-exc” parameter is a WhizzML list of fields in the training dataset that should be ignored when the anomaly detector is created.  The “prod-exc” input parameter is ignored here since the contents of the “train-exc” input parameter determines what fields the anomaly detector will ignore in the production dataset.

In addition to these input parameters, there are some internal details of the function that should be noted. The (anomaly-evaluation …) function returns the ID of a Batch Anomaly Score object identified by ev-id; the metadata map for this object includes a property “output_dataset_resource” that contains the BigML ID evdst-id of the output dataset created by the batch anomaly score function.  We note that the BigML platform backend produces the batch anomaly score object before the output dataset object is complete.  We must use the (wait …) function or an equivalent operation to insure the dataset referenced by “output dataset resource” is available before we attempt to access the anomaly score information in the output dataset metadata that we need.

Finally, our (anomaly-evaluation …) function includes some housekeeping features to support the higher-level functions in our WhizzML script.  You might find similar features useful in your own scripts.   The “seed” input string parameter passed to the (sample-dataset …) functions causes deterministic, and therefore repeatable, sampling.  Specifying the “clean” input as true causes the function to delete the intermediate working objects it creates before returning the average anomaly score.  This can be helpful when one repetitively computes the average anomaly score on a sequence of pairs of subsets of the testing and production dataset.

As just suggested, in practice we likely would want to repetitively sample the training and production datasets and compute a sequence of average anomaly scores for that sequence of samples.  The (anomaly-loop …) function in the script is explicitly does this in a form  that illustrates how you could also easily add other computations or logging in your own custom WhizzML scripts:

(define (anomaly-loop train-dst train-exc prod-dst prod-exc seed niter clean logf)
  (loop (iter 1
         scores-list [])
    (if logf
      (log-info "Iteration " iter))
    (let (score (anomaly-measure train-dst train-exc prod-dst prod-exc (str seed " " iter) clean)
          scores-list (append scores-list score))
      (if logf
        (log-info "Iteration " iter scores-list))
      (if (< iter niter)
        (recur (+ iter 1)

This function just calls the (anomaly-measure …) function “niter” times and returns the resulting sequence of average anomaly scores.  Note that the input parameters include the “clean” boolean parameter specifying whether the intermediate objects created by each use of the (anomaly-measure …) function should be preserved or deleted.  Finally, this function illustrates how we can use logging features on the BigML platform to log results from the sequence of (anomaly-measure …) calls under control of the “logf” boolean input parameter.

Next in our script we call the (anomaly-loop …) function inside a wrapper function:

(define (anomaly-measures train-dst train-exc prod-dst prod-exc seed niter clean logf)
  (let (values (anomaly-loop train-dst train-exc prod-dst prod-exc seed niter clean logf))

Although this function could be eliminated in our script, you might find a similar function useful in your own custom WhizzML script as the place for adding additional computations on the sequence of average anomaly scores returned by the (anomaly-loop …) function.

The final high-level function in our script computes our final single numeric measure of data shift.  In our script, this is simply the average of the sequence of average anomaly scores returned by the (anomaly-measures …) function:

(define (anomaly-estimate train-dst train-exc prod-dst prod-exc seed niter clean logf)
  (let (values (anomaly-measures train-dst train-exc prod-dst prod-exc seed niter clean logf)
        sum (reduce + 0 val)
        cnt (count values))
    (/ sum cnt)))

In a custom WhizzML script one could combine the (anomaly-estimate …) function and (anomaly-measures …) function by just replacing the use of (anomaly-measures …) with (anomaly-loop ..).   If one doesn’t need to access the list of scores, one could also pull the contents of the (anomaly-loop …) function into this function. On the other hand, you might need to use the list of scores from (anomaly-measures …) directly in your own WhizzML scripts, rather than just computing the average values of the average anomaly scores in that list.

Finally, the example in the WhizzML script gallery concludes with the definition required to use the script in the BigML Dashboard:

(define result (anomaly-estimate train-dst train-exc prod-dst prod-exc seed niter clean logf))

This definition also demonstrates how you would call the top-level (anomaly-estimate …) function directly in your own WhizzML scripts. Thanks to WhizzML’s composability, using Anomaly Detectors to detect covariate shift is more convenient that ever.  We hope you get a chance to give it a spin and let us know how it goes!



Predictive Analytics in the Financial Industry – The Art of What, How and Why

Mobey Forum, the global industry association empowering banks and other financial institutions to play a leading role in ushering the future of digital financial services, has just published the first in a new series of reports that are exploring the most important aspects, challenges and key application areas of predictive analytics in financial services. As Co-chair of the Mobey Forum’s predictive analytics workgroup, I had a front row seat in observing the challenges the industry is facing, while transitioning to a much more data-driven operational mode necessitated by competitive pressures. I would like to thank the colleagues from Danske Bank, UBS, Nets, PostFinance, Ericson, HSBC, Nordea, CaixaBank, Teconcon, Giesecke&Devrient and many more leading institutions that contributed to the final report.

Statistics vs. Machine Learning

Predictive Analytics in the Financial Industry – The Art of What, How and Why’, is a primer that lays the ground work for subsequent reports that will go into much more detail in exploring different technical and organizational aspects of predictive analytics. The Mobey Forum workgroup aims to strike a balance between the technical underpinnings of key enabling technologies such as Machine Learning and the real-life best practices commercial applications that can serve as benchmarks for beginners in their initiatives.

We hope this effort provides the spark to get your organization to start investing in predictive analytics. And in a way that can accelerate innovative data-driven products and services that can adapt to the dynamic marketplace that is threatening to make one-size-fits-all type traditional product and service portfolios obsolete.

As a reminder, an in-depth discussion by the authors of the report will be broadcast on BrighTALK on July 11th at 4PM CET.

Automatically Estimate the Best K for K-Means Clustering with WhizzML

(Thanks to Alex Schwarm of for bringing to our attention the Pham, Dimov, and Nguyen paper, which is the subject of this post.)

The BigML platform offers a robust K-Means Clustering API that uses the G-Means algorithm for determining K if you don’t have a good guess for K.  However, sometimes you may find that the divisive top-down approach of the G-Means algorithm does not always yield the Best-K for your dataset.  After a little experimentation, you may also discover that the G-Means algorithm does not choose a value of K that makes sense based on your knowledge of your dataset (see the “k” and “critical_value” arguments in the Cluster Arguments section).  You could manually try running the cluster operation on your dataset for a range of K, but that approach does not inherently include a way to recognize the best K.  And it can be very time consuming!

The Pham, Dimov, and Nguyen Algorithm and the K-means Algorithm in BigML

Fortunately, WhizzML allows us to easily implement another approach for choosing K using an algorithm by Pham, Dimov, and Nguyen. (D.T. Pham, S. S. Dimov, and C. D. Nguyen, “Selection of K in K-means clustering“.  Proc. IMechE, Part C: J. Mechanical Engineering Science, v. 219, pp. 103-119.)  Pham,Dimov,and Nguyen define a measure of concentration f(K) on a K-means clustering and use that as an evaluation function to determine the best K.  In this post, we show how to use the Pham-Dimov-Nguyen (PDN) algorithm in WhizzML to calculate f(k) over an arbitrary range of Kmin to Kmax.  You can then consider using the k that yields the optimum (minimum) value of f(k) as the best K for a K-means clustering of your dataset.

Before jumping into the WhizzML code, we first note the clustering functions WhizzML provides via the BigML API calls:

  • (create-and-wait-cluster …):  Using this function we can create a BigML Cluster object for a BigML Dataset object using K-means or G-means clustering.
  • (create-and-wait-centroid …):  Once we have a BigML Cluster for a BigML Dataset we can create a BigML Centroid object for a row in the dataset using this function.
  • (create-and-wait-batchcentroid …):  Given a Cluster object and a Dataset object, we can use this function to create a BigML Batch Centroid object and a new Dataset that labels every row with the number of the cluster centroid to which the row is assigned.
  • (create* “cluster” …): With this function we can initiate the creation of a sequence of BigML Cluster objects on the BigML platform in parallel.
  • (wait* …): Although not a clustering function, this synchronization function re-establishes serial program flow in WhizzML after (create* …) initiates parallel creation of BigML objects.

We’ll use the latter two parallel operations to increase the speed of our WhizzML script that implements the PDN algorithm.

Our WhizzML script in the BigML gallery uses the PDN concentration function f(k) and finds the best K in several steps.  Given a BigML Dataset object, the steps of the generic algorithm are:

  1. Compute a sequence of Bigml Cluster objects for k ranging from Kmin to Kmax.
  2. Evaluate f(k) for each cluster in the sequence of BigML Cluster objects.
  3. Choose the k with the optimum (minimum) value of f(k) as the best K.
  4. Finally, if desired, create a BigML Batch Centroid object from the best K Cluster object and the source Dataset object.

It turns out that our example WhizzML script implements a sequence of component WhizzML functions that aren’t quite one-to-one with the steps in this generic algorithm.  The functions in our script are organized into three layers: The base layer are foundation functions to enable computation of the PDN concentration function f(k).  The functions in the middle layer use these foundation functions to implement our algorithm to find the best k for K-Means clustering of a dataset.  The top layer are WhizzML functions that provide examples of different ways to use our best k implementation of K-Means clustering in your own workflows.

Foundation Functions for a PDN-based Approach to Finding the Best k

Our WhizzML script begins with a set of four simple foundation functions  (generate-clusters …), (extract-eval-data …)(alpha-func …) and (evaluation-func …).  The (generate-clusters …) function implements the first step in the generic algorithm we outlined.  Given a BigML dataset ID and a range for values of k, this script creates a sequence of BigML Cluster objects:

(define (generate-clusters dataset cluster-args k-min k-max)
  (let (dname (get (fetch dataset) "name")
        fargs (lambda (k)
                (assoc cluster-args "dataset" dataset
                                    "k" k
                                    "name" (str dname " - cluster (k=" k ")")))
        clist (map fargs (range k-min (+ 1 k-max)))
        ids (create* "cluster" clist))
    (map fetch (wait* ids))))

In addition to the “dataset” ID and range for k specified by “k-min” and “k-max”, the function accepts a map “cluster-args” of arguments for the BigML API to create Cluster objects. This base “cluster-args” map is expanded to a map for a specific value of k by the function fargs(k) created as a lambda function.

The rest of the function creates the clist of argument maps for each value of k and uses the WhizzML (map …) function.  The WhizzML (create* …) and (wait* …) functions are then used to create the BigML Cluster objects for k in “k-min” to “k-max” in parallel.  The function then returns a list of the metadata for the resulting clusters on the BigML server.

As we will explain subsequently, the PDN concentration function f(k) for a given k is computed from certain members of the metadata map for the cluster object for k.  To illustrate this and simplify the code,  the next helper function (extract-eval-data …) in the script encapsulates the required values from the metadata map in a separate map:

(define (extract-eval-data cluster)
  (let (id (get cluster "resource")
        k (get cluster "k")
        n (count (get cluster "input_fields"))
        within_ss (get-in cluster ["clusters" "within_ss"])
        total_ss (get-in cluster ["clusters" "total_ss"]))
    {"id" id "k" k "n" n "within_ss" within_ss "total_ss" total_ss}))

In addition to the BigML cluster “id” and “k”, this smaller map includes the number “n” of fields in the dataset that are actually considered when doing the clustering. The “within_ss” property is the total sum-squared distance between every dataset row in the cluster and the centroid of the cluster.  Similarly, “total_ss” is the total sum-squared distance between every row in the entire dataset and the global centroid of the dataset. Therefore, it will be the same value for each cluster.

The next two functions (alpha-func …) and (evaluation-func …), are actually factory functions that together create the PDN concentration function f(k) for a clustering. This function includes an internal weighting function a(k) parameterized on the number n of input fields considered in clustering the dataset.  WhizzML does not provide an equivalent to the LISP (apply …) or the Clojure (partial …) for creating partial function evaluations, but it does create standard closures.  This allows us to use Javascript methods based on lambda functions and closures to  build the PDN concentration function f(k) parameterized on n in WhizzML.  We do this by using a factory function (alpha-func …) that returns the weighting function a(k), and a factory function (evaluation-func …) that returns a custom version of the concentration function f(k).

The concentration function f(k) in the PDN paper incorporates a weighting function a(k) that is recursive in k and parameterized on n (eqns. (3a) and (3b) in the paper).  Because we want to evaluate f(k) over an arbitrary range of k, we need a closed form expression for a(k).  We can’t go through the derivation here, but the closed form we need is:

       | 1 - 3/4n                               k=2
a(k) = |
       | (5/6)^(k-2) a(2) + [1 - (5/6)^(k-2)]   k>2

We could write our factory function (alpha-func …) in multiple ways.  The implementation in our WhizzML script follows a simple Javascript pattern that returns an anonymous function:

(define (alpha-func n)
  (let (alpha_2 (- 1 (/ 3 (* 4 n)))
        w (/ 5 6))
    (lambda (k)
      (if (<= k 2)
        (+ (* (pow w (- k 2)) alpha_2) (- 1 (pow w (- k 2))))))))

This factory function implicitly creates a closure that captures the input parameter “n” and then returns a lambda function that computes a(k).

We next use (alpha-func …) in our factory function (evaluation-func …) that creates the concentration function f(k).  As with the weighting function a(k), since we want to evaluate f(k) over an arbitrary range of k we need to slightly transform f(k) in the PDN paper (eqn. (2)):

                     | 1                   k=1 
f(k, S(k), S(k-1)) = | 1                   S(k-1) undefined or S(k-1)=0
                     | S(k)/[a(k)S(k-1)]   otherwise

where S(k) is the “within_ss” property in the map returned by the (extract-eval-data …) function we described above.  Our factory function implements the simple Javascript pattern that returns an anonymous function:

(define (evaluation-func n)
  (let (fa (alpha-func n))
    (lambda (k sk skm)
      (if (or (<= k 1) (not skm) (zero? skm))
        (/ sk (* (fa k) skm))))))

This factory function accepts the single input parameter “n”, implictly creates a closure that includes an instance of the weighting function a(k), and then returns an anonymous instance of our modified concentration function f(k, S(k), S(k-1)).

At this point it’s worth recapping the functions we’ve built so far.  In just a few lines of WhizzML code, we’ve implemented four routines that form the foundation layer of the Best-K script in the WhizzML script gallery and illustrate the power of WhizzML.  The (generate-clusters …) function orchestrates a potentially large amount work on the BigML backend to create a sequence of BigML cluster objects for K-means clusterings of our dataset over a range of k.  Each BigML cluster object itself embodies a large amount of data and metadata, so we’ve defined a function  (extract-eval-data …) that you could customize further in your own WhizzML scripts to extract just the metadata we’ll need.  Finally, we’ve implemented two factory functions (alpha-func …) and (evaluation-func …) that together generate a version of the Pham-Dimov-Nguyen concentration function f(k) suitable for our needs.

Using Our Foundation Functions to Implement a Best k Algorithm

We next combine our foundation functions with other WhizzML built-in functions in a set of three functions at the heart of our implementation of the PDN algorithm for choosing the best K-means clustering.  The first function (evaluate-clusters …) accepts a list of clusters created by (generate-clusters …) and returns a corresponding list of metadata maps:

(define (evaluate-clusters clusters)
  (let (cmdata (map extract-eval-data clusters)
        n (get (nth cmdata 0) "n")
        fe (evaluation-func n))
    (loop (in cmdata
           out []
           ckz {})
       (if (= [] in)
         (let (ck (head in)
               ckr (tail in)
               k (get ck "k")
               within_ss (get ck "within_ss")
               within_ssz (if (<= k 2) (get ck "total_ss") (get ckz "within_ss"))
               cko (assoc ck "fk" (fe k within_ss within_ssz)))
           (recur ckr (append out cko) ck))))))

Each metadata map in the returned list includes a property “fk” that is  the value of the PDN function f(k) for the corresponding K-means clustering.

This function uses (extract-eval-data …) to build a list cmdata of metadata maps for the list of K-means clusterings, and the factory function (evaluation-func …) to create a function “fe” that is our version f(k, S(k), S(k-1)) of the PDN concentration function f(k).  The body of the function is a WhizzML (loop …) function that steps through the input list “in” of metadata maps (initially the cmdata list) to sequentially generate the output list “out” of metadata maps.  The loop body operates on the head metadata map of the “in” list and the metadata map from the head member of the last iteration “ckz” to compute the “fk” property, and then appends an augmented metamap map to the output list “out”.  We note that the input list of “clusters” spans an arbitrary range of k and that the computation to generate “within_ssz”  generates the initial value for S(k-1) required by our concentration function f(k, S(k), S(k-1)) for the first cluster in the “clusters” list.

Our next two functions are helper functions used by our top level functions we describe next.  The first function (clean-clusters …) just deletes unneeded BigML Cluster objects created by our PDN-based algorithm:

(define (clean-clusters evaluations cluster-id logf)
  (for (x evaluations)
    (let (id (get x "id")
          _ (if logf (log-info "Testing for deletion " id " " cluster-id)))
      (if (!= id cluster-id)
        (prog (delete id)
              (if logf (log-info "Deleted " id))))))

We note that this function includes an input parameter “logf”. When this parameter is true, the function logs information about the delete operation to the BigML logging system.  The function is intended to be a base example you could expand with additional logging information in your own version of the script.

The other function (best-cluster …) generates a new BigML Cluster object:

(define (best-cluster dataset cluster-args k)
  (let (dname (get (fetch dataset) "name")
        ckargs (assoc cluster-args "dataset" dataset
                                   "k" k
                                   "name" (str dname " - cluster (k=" k ")")))
    (create-and-wait-cluster ckargs)))

This helper function is intended to increase the flexibility of our WhizzML script. In the initial evaluation stage we generate a list of BigML Cluster objects using the (generate-clusters …) function using an arbitrary map “cluster-args” of  values for the BigML clustering operation arguments.  Using this helper function, we can generate a final version of the BigML Cluster object for a given k using a different “cluster-args” map.

Before introducing the final top level functions in our example WhizzML script, we can add a few additional notes.  First note that our middle level functions only access data in WhizzML to do their work, they don’t need to access the BigML Cluster objects in the BigML system after we created the BigML Cluster objects with the (generate-clusters …) function.  Correspondingly, our example (clean-clusters …) WhizzML function queues the object deletion requests to the BigML platform but doesn’t need to wait for them to complete.  Finally, although the sample (best-cluster …) function allows us to regenerate the K-means clustering for the best k and waits for BigML to complete, you could just queue the request  to create the BigML Cluster object in your own custom WhizzML script and check if it is complete with the (wait …) function when you need it. The BigML platform takes care of all the cumbersome work of creating and deleting objects, and just provides our WhizzML code with the small out of data we need. This greatly simplifies orchestrating and optimizing the performance of our workflows.

Functions that Illustrate Several Applications of the PDN Best k Approach

The final group of functions in our example WhizzML script are three simple top level functions that provide us with a stack of operations relevant to different applications.  We step through them in order. We then provide example WhizzML calls of each function.

The first top-level function (evaluate-k-means …) just creates the list of BigML Cluster objects for K-means clustering for k ranging from “k-min” to “k-max” and returns the list of metadata maps that includes the value of the PDN concentration function f(k) as the property “fk”:

(define (evaluate-k-means dataset cluster-args k-min k-max clean logf)
  (let (clusters (generate-clusters dataset cluster-args k-min k-max)
        evaluations (evaluate-clusters clusters))
    (if clean
      (clean-clusters evaluations "" logf))

In addition to the basic input parameters “dataset”, “k-min”, and “k-max”,  the function allows us to specify a WhizzML map “cluster-args” of our choice of arguments for the BigML cluster operation.  When the “clean” parameter is true, it causes the function to call the (clean-clusters …) function to optionally delete the BigML Cluster objects on the BigML platform before returning the result list.  In this example function, the value of the parameter “logf” is just passed on to the (clean-clusters …) function.  In your own custom version of this WhizzML script you can use this parameter to control whatever additional logging you might want.

Our next function (best-k-means …) builds on (evaluate-k-means …) to return a BigML Cluster object for the best k:

(define (best-k-means dataset cluster-args k-min k-max bestcluster-args clean logf)
  (let (evaluations (evaluate-k-means dataset cluster-args k-min k-max false logf)
        _ (if logf (log-info "Evaluations " evaluations))
        besteval (min-key (lambda (x) (get x "fk")) evaluations)
        _ (if logf (log-info "Best " besteval))
        cluster-id (if (= cluster-args bestcluster-args)
                     (get besteval "id")
                     (best-cluster dataset bestcluster-args (get besteval "k"))))
    (if clean
      (clean-clusters evaluations cluster-id logf))

After we generate the list evaluations of metadata maps with the PDN concentration function values, we used the WhizzML (min-key …) built-in function to find the metadata map for the best k.  We then check if the “cluster-args” map used in the first stage when we find k differs from the “bestcluster-args” map.  If the two maps don’t agree, we generate a new BigML Cluster object for the best k.   Regardless, if “clean” is specified as true, we direct the BigML platform to asynchronously delete the BigML Cluster objects on the platform that we don’t need.  Finally, we return the ID of the BigML Cluster object for the best k the routine found.

Our last routine (best-batchcentroid …), uses the BigML Cluster object created by the (best-k-means …) function and the input BigML Dataset object to create a BigML Batch Centroid object:

(define (best-batchcentroid dataset cluster-args k-min k-max bestcluster-args clean logf)
  (let (cluster-id (best-k-means dataset cluster-args k-min k-max bestcluster-args clean logf)
        batchcentroid-id (create-and-wait-batchcentroid {"cluster" cluster-id
                                                         "dataset" dataset
                                                         "output_dataset" true
                                                         "all_fields" true}))

Because the argument map in our call  to the WhizzML (create-and-wait-batchcentroid …) function includes the “all_fields” snf the “output_dataset” properties, the function also creates a BigML Dataset object that includes all columns in the input “dataset” and an extra column that specifies the cluster number to which the dataset row was assigned.

Using the Best k Algorithm Implementation

All of our top 3 WhizzML functions have the same parameters, so we can call them in the same way:

(define bestk-evaluations (evaluate-k-means dataset cluster-args k-min k-max clean logf))

(define bestk-cluster (best-k-means dataset cluster-args k-min k-max bestcluster-args clean logf))

(define bestk-batchcentroid (best-batchcentroid dataset cluster-args k-min k-max bestcluster-args clean logf))

These three examples illustrate how we compute a list of PDN concentration function f(k) evaluations, the BigML Cluster object for the best k, and the BigML Batch Centroid object for the best k, respectively.

In your application, you might have a guess for the best k.  In that case, you might want to specify a range “k-min” to “k-max” that brackets that k value.  You could then use the first call to the (evaluate-k-means …) function above, examine the results, and choose the best k.  Alternatively you could use (evaluate-k-means …) in a loop to test a series of intervals [k_1,k_2], [k_2,k_3] … [k_N-1, k_N], and then choose the best k from all of those tests.  Finally, if you know a  range “k-min” to “k-max”, you can use (best-k-means …) or (best-batchcentroid …) to generate the BigML Cluster object or BigML Batch Centroid object, respectively, for the best k.



Machine Learning in Objective-C Has Never Been Easier


Taking the opportunity provided by our recent Spring release, BigML is pleased to announce our new SDK for Objective-C, bigml-objc, which provides a modern, block-based Objective-C API, and a new more maintainable and coherent design. Additionally, bigml-objc includes support for WhizzML, our exciting new DSL for the automation of ML workflows. bigml-objc evolves and supersedes the old ML4iOS library, which has been kindly supported by BigML’s friend Felix Garcia Lainez.


bigml-objc‘s API design follows along the lines of our Swift SDK. Its main aim is to allow iOS, OS X, watchOS, and tvOS developers to easily integrate BigML services into their apps while also benefitting from modern Objective-C features (first and foremost Objective-C blocks), which allow for a simple handling of asynchronous operations.

The main features BigML SDK for Objective-C provides can be divided into two areas:

  • Remote resource processing: BigML SDK exposes BigML’s REST API through a higher-level Objective-C API that will make is easier for you to create, retrieve, update, and delete remote resources. Supported resources are:

    source Sources source Datasets
    source Models source Clusters
    source Anomalies source Ensembles
    source Predictions source WhizzML
  • Local resource processing: BigML SDK allows you to mix local and remote distributed processing in a seamless and transparent way. You will be able to download your remote resources (e.g. a cluster) and then apply supported algorithms to them (e.g. calculate its nearest centroid based on your input data). This is one definite advantage that BigML offers in comparison to competing services, which mostly bind you into either using their remote services or do everything locally. BigML’s SDK for Objective-C combines the benefits of both approaches by making it possible to use the power of a cloud solution and to enjoy the flexibility/transparency of local processing right when you need it. The following is a list of currently supported algorithms that BigML’s SDK for Objective-C provides:

    • Model predictions
    • Ensemble predictions
    • Clustering
    • Anomaly detections.

    A dive into BigML’s Objective-C API

    The BMLAPIConnector class is the workhorse of all remote processing: it allows you to create, delete, and get remote resource of any supported type. When instantiating it, you should provide your BigML’s account credentials and specify whether you want to work in development or production mode:

    BMLAppAPIConnector* connector =
      [[BMLAppAPIConnector alloc]
         initWithUsername:@&quot;your BigML username here&quot;
                   apiKey:@&quot;your BigML API Key here&quot;

    You can safely pass nil for the version argument, since there is actually only one API version supported by BigML.

    Once you connector is instantiated, you can use it to create a data source from a local CSV file:

    NSString* filePath = ...;
    BMLMinimalResource* file =
    [[BMLMinimalResource alloc]
       initWithName:@&quot;My Data Source&quot;
    [connector createResource:BMLResourceTypeSource
                         name:@&quot;My frist data source&quot;
                   completion:^(id resource, NSError* error) {
          if (error == nil) {
                 //-- use resource
          } else {
                 //-- handle error

    As you can see, BMLAPIConnector’s createResource allows you to specify the type of resource you want to create, its name, a set of options, and the resource that should be used to create it, in this case a local file.

    BigML SDK for Objective-C’s API is entirely asynchronous and relies on completion blocks, where you will get the resource that has been created, if any, or the error that aborted the operation as applicable. The resource you will receive in the completion block is an instance of the BMLMinimalResource type, which conforms to the BMLResource protocol.

    typedef NSString BMLResourceUuid;
    typedef NSString BMLResourceFullUuid;
    @class BMLResourceTypeIdentifier;
     * This protocol represents a generic BigML resource.
    @protocol BMLResource &lt;NSObject&gt;
    /// the json body of the resource. See BigML REST API doc (
    @property (nonatomic, strong) NSDictionary* jsonDefinition;
    /// the current status of the resource
    @property (nonatomic) BMLResourceStatus status;
    /// the resource progress, a float between 0 and 1
    @property (nonatomic) float progress;
    /// the resource name
    - (NSString*)name;
    /// the resource type
    - (BMLResourceTypeIdentifier*)type;
    /// the resource UUID
    - (BMLResourceUuid*)uuid;
    /// the resource full UUID
    - (BMLResourceFullUuid*)fullUuid;

    The BMLResource protocol encodes the most basic information that all resources share: a name, a type, a UUID, the resource’s current status, and a JSON object that describes all of its properties. You are supposed to create your own custom class that conforms to the BMLResource protocol and that best suits your needs e.g., it might be a Core Data class that allows you to persist your resource to a local cache. Of course you are welcome to reuse our BMLMinimalResource implementation as you wish.

    In a pretty similar way you can create a dataset from the data source just created:

    [connector createResource:BMLResourceTypeDataset
                         name:@&quot;My first dataset&quot;
                   completion:^(id resource, NSError* error) {
            if (error == nil) {
                  //-- use resource
            } else {
                  //-- handle error

    If you know the UUID of an existing resource of a given type and want to retrieve it from BigML, you can use BMLAPIConnector’s getResource method:

    [connector getResource:BMLResourceTypeDataset
                   completion:^(id resource, NSError* error) {
              if (error == nil) {
                   //-- use resource
              } else {
                   //-- handle error

    Creating WhizzML Scripts

    You can create a WhizzML script in a way similar to how you create a datasource, i.e., by using BMLAPIConnector‘s createResource method and providing a BMLResourceTypeWhizzmlSource resource that encodes the WhizzML source code:

        BMLMinimalResource* resource =
        [[BMLMinimalResource alloc] initWithName:
        NSDictionary* dict = @{ @&quot;source_code&quot; : @&quot;My source code here&quot;,
                                @&quot;description&quot; : @&quot;My first WhizzML script&quot;,
                                @&quot;inputs&quot; : @[@{@&quot;name&quot; : @&quot;inDataset&quot;, @&quot;type&quot; : @&quot;dataset-id&quot;}],
                                @&quot;tags&quot; : @[@&quot;tag1&quot;, @&quot;tag2&quot;] };
        [[BMLAPIConnector newConnector]
         name:@&quot;My first WhizzML Script&quot;
         completion:^(id&lt;BMLResource&gt; resource, NSError* error) {
             if (resource) {
                // execute script passed in resource
             } else {
                // handle error

    Creating WhizzML scripts yourself is not the only way to take advantage of our new workflow automation DSL. Indeed, you can browse our WhizzML script Gallery and find a growing collection of scripts to solve recurrent machine learning tasks such as removing anomalies from a dataset, identifying a dataset’s best features, doing cross-validation, and many more. Once you have found what you are looking for, you can clone that script (many are even free!) to your account for use from your Objective-C program.

    Once you have created or cloned your script from the gallery, you can execute it very easily:

    [connector createResource:BMLResourceTypeWhizzmlExecution
                                  name:@&quot;New Execution&quot;
                               options:@{ @&quot;inputs&quot; : @[@[@&quot;inDataset&quot;, @&quot;dataset/573d9b147e0a8d70da01a0b5&quot;]] }
                            completion:^(id&lt;BMLResource&gt; resource, NSError* error) {&lt;/pre&gt;
    if (resource) {
                // execute script passed in resource
             } else {
                // handle error

    Read a thorough description of WhizzML and how you can use WhizzML scripts, libraries and executions in our REST API documentation! A great resource to learn about the language is our series of training videos.

    Local algorithms

    The most exciting part of BigML’s SDK for Objective-C is surely its support for a collection of the most widely used ML algorithms such as model prediction, clustering, anomaly detection etc. What is even more exciting is that the family of algorithms that BigML’s SDK for Objective-C supports is constantly growing!

    As an example, say that you have a model in your BigML account and that you want to use it to make a prediction based on some set of data that you have got. This is a two step process:

    • Retrieve the model from your account, as shown above, with getResource.
    • Use BigML’s SDK for Objective-C to calculate a prediction locally.

    The second step can be executed inside of the completion block that you pass to getResource. This could look like the following:

    [connector getResource:BMLResourceTypeModel
                   completion:^(id resource, NSError* error) {
                           if (error == nil) {
                               NSDictionary* prediction =
            [BMLLocalPredictions localPredictionWithJSONModelSync:resource.jsonDefinition
                                                        arguments:@{ @&quot;sepal length&quot;: @(6.02),
                                                 @&quot;sepal width&quot;: @(3.15),
                                                 @&quot;petal width&quot;: @(1.51),
                                                 @&quot;petal length&quot;: @(4.07) }
                           } else {
                              //-- handle error

    The prediction object returned is a dictionary containing the value of the prediction and its confidence. In similar ways, you can calculate the nearest centroid, or do anomaly scoring.

    Practical Info

    The BigML SDK for Objective-C is compatible with Objective-C 2.0 and later. You can fork BigML’s SDK for Objective-C from BigML’s GitHub account and send us your PRs. As always, let us know what you think about it and how we can improve it to better suit your requirements!

PAPIs 2016 – Call for Proposals Deadline is This Friday!

PAPIs 2016 Boston

2016 marks the first year is making it across the Atlantic to Boston. The conference will take place on October 10-12, 2016, and the deadline for proposals is this Friday.  As a founding member and initial sponsor of, BigML will be actively participating in this third edition too. Besides BigML, last year’s event in Sydney included presenters from large tech companies such as Amazon, Microsoft, Google, NVIDIA as well as key government organizations and innovative startups focusing on Machine Learning.

PAPIs remains the premier forum for the presentation of new machine learning APIs, techniques, architectures and tools to build predictive applications. It is a community conference that brings together practitioners from industry, government and academia to present new developments, identify new needs and trends, and discuss the challenges of building real-world predictive, intelligent applications.

This year’s conference program will feature 4 types of presentations:

  • Technical and Business Talks (e.g., use cases, innovations, challenges, lessons learnt)
  • Tutorials
  • Research Presentations
  • Startup Pitches (as part of the AI Startup Battle)

As evidenced by 700+ attendees came from 25 different countries to 4 previous events, presenting at the conference is a great way to share your learnings, showcase leadership on behalf of your organization, and engage with likeminded peers. With the aim of a diverse and creative line-up of speakers, the organizing committee is welcoming practical presentations on a wide range of experience levels — from beginner-friendly how-to’s to cautionary tales to deep dives for experienced professionals.

Please follow these guidelines for the best chance of having your proposal selected. If you have additional questions, don’t hesitate to email the orginizers at

We’re looking forward to receiving your best proposals!

WhizzML Training Videos are Here!

This week we completed four in-depth training webinars focused on WhizzML, BigML’s new domain-specific language for automating Machine Learning workflows, implementing high-level Machine Learning algorithms, and easily sharing them with others. We already have our first batch of WhizzML graduates merely a week after launch. However, many of you were either not able to secure a live webinar spot or not able to join us at the scheduled date and time. Don’t fret if you missed any of these training sessions. You can now watch the whole series at your own pace on BigML’s YouTube channel.

We suggest that you follow the same order in the series as there are dependencies that may slow down your comprehension if you skip things.  Here is brief guide on how the series is structured:

1. Introduction to WhizzML

The first session covers all the basics describing how WhizzML is implemented on the BigML platform. Ryan Asensio, BigML’s Machine Learning Engineer, introduces the purpose of the language and some benefits over other ways of implementing Machine Learning workflows and algorithms.

2. Language Overview and Basic Workflows

This intermediate webinar explores the WhizzML domain-specific language in greater detail, with a whirlwind tour of its syntax, programming constructs and basic standard library functions. In this second training session, Charles Parker, BigML’s VP of Machine Learning Algorithms, explains how to create and use WhizzML resources (libraries, scripts and executions) by means of several simple yet fully functional example workflows.

3. Advanced Machine Learning Workflows

The third training session is an advanced webinar where we continue our exploration of the WhizzML language, diving into more complex examples and using more advanced features of the language. Charles Parker, BigML’s VP of Machine Learning Algorithms, explains how some of the most effective Machine Learning algorithms can be implemented and automated on top of the BigML with WhizzML.

4. Real-world Machine Learning Workflows

In the fourth session, Poul Petersen, BigML’s Chief Infrastructure Officer, walks you through some real-world workflow automations with an eye towards the kind of problems posed by complex use cases. In this advanced webinar we use some of the best tricks to solve your Machine Learning problems with confidence.

You can always visit the dedicated WhizzML landing page for the most up to date info and resources.

Have an idea for a new script for a Machine Learning task? As always, forward us your questions or comments anytime at We are looking forward to hear about the Machine Learning projects that you are looking to automate.

Happy WhizzMLing!




WhizzML Tutorial II: Covariate Shift


If this is your first time writing in the new WhizzML language, I suggest that you start here with a more simple tutorial. In this post, we are going to write a WhizzML script that automates the process of investigating Covariate Shift. To get a deeper understanding of what we’re trying to do, read the beginning of this article first.

We want a workflow that:

  1. Given two datasets (one that represents the data used to train a predictive model, one that represents production data)
  2. Returns an indication of whether the distribution of data is different between the two datasets


As we read in the article (or on Wikipedia), the indicator of change in our data distribution is called the phi coefficient. Our WhizzML script will return us this number, so let’s name our base function phi-coefficient.


Screen Shot 2016-06-02 at 12.55.03 PM.png

What are we doing here?

To start, the function takes three arguments. The first two are ids for our training and production datasets, respectively. We call them training-dataset and production-dataset. The third argument, seed, is used to make our sampling deterministic. We’ll talk about this later.

There’s quite a bit going on in this function, but it’s all broken into manageable pieces. First, we use let to set local variables. These local variables are the result of a few different functions, which we will have to define. The local variables are comb-data, ids, model, and eval. After these are set, we can compute the phi coefficient with the function avg-phi. Let’s go over each of the local variables.


comb-data is the result of (combined-data training-dataset production-dataset). Here, we combine the two datasets into one big dataset. But before they are combined, we have to do a transformation on each dataset (add the “Origin” field). We’ll talk about that transformation when we define combined-data.

The dataset returned by our comb-data function looks something like this:

Screen Shot 2016-06-02 at 10.32.59 AM.png


Next, we have a variable called ids. This is a list of two dataset IDs – the result of:

Screen Shot 2016-06-02 at 10.43.35 AM.png

Our split-dataset function takes the comb-data (one big dataset) and randomly splits it into two datasets. We split it so that we can train a predictive model with the larger portion of the split, and then evaluate its performance on the smaller part. The split-dataset function returns something like this:

["dataset/83bf92b0b38gbgb" "dataset/83hf93gf012bg84b20"]


model is a BigML predictive model resource. We are creating this model from the first element of our ids list: "dataset" (nth ids 0). The model is built to predict whether the value for the “Origin” field is “Training” or “Production”. Thus, the “objective_field” is “Origin”: "objective_field" "Origin".


eval is a BigML evaluation resource. To create an evaluation, we need two arguments: a predictive model and a dataset we want to test the model against. Our model is stored in model and our dataset is the second element in the ids list, hence: (nth ids 1)


We’re done with the local variables, but what does the whole phi-coefficient function return – what’s our end product?

Screen Shot 2016-06-02 at 10.44.32 AM.png
That line gives us the average phi score for the evaluation we just created. A bunch of information is stored inside the eval data object that will be retrieved from BigML. But of course we have to tell the function avg-phi how to get what we want! We’ll save that for later.

So we have built our base function (phi-coefficient)and understand its components. Now we have to go back and build the functions we haven’t defined yet, specifically comb-data, split-dataset, model-evaluation and avg-phi. We’ll start with comb-data.


Screen Shot 2016-06-02 at 12.55.42 PM.png

Again, this function combines two datasets. We tell BigML what datasets we want to combine using the “origin_datasets” parameter and passing it a list of dataset ids.

But what are train-data and prod-data?

Those are helper functions that add the “Origin” field we talked about.

  • train-data adds the “Origin” field with the value “Training” in each row
  • prod-data adds the “Origin” field with the value “Production” in each row

They are defined here:

Screen Shot 2016-06-02 at 10.48.00 AM.png
Since we are doing pretty similar things in both functions, (adding an “Origin” field) we can separate that logic into its own function. Here it is:


Screen Shot 2016-06-02 at 10.48.10 AM.png
In that function we are…

  1. Creating a new dataset from an existing one "origin_dataset" dataset-id
  2. Adding a new field "new_fields" [...]
  3. Giving the new field a column name and label "name" "Origin" "label" "Origin"
  4. Setting the row’s value "field" value

The value will either be the string "Production" or "Training". This string is passed in as an argument where prod-data and train-data are defined.

Nice. Nows let’s go over split-dataset.


Screen Shot 2016-06-02 at 10.51.12 AM.png

  1. What are we splitting? dataset-id – the dataset we pass in.
  2. How are we splitting it, 80%/20%? 90%/10%? We can do whatever we want. This is determined by rate.
  3. How are we going to shuffle our data before we split it? The seed determines this.

As you can see, we are sampling the same dataset twice. One sample will be used to build a predictive model, the other will be used to evaluate the predictive model.

sample-dataset is another function. Here it is below:


Screen Shot 2016-06-02 at 10.52.42 AM.png
This function interacts with the BigML API. We create a new dataset, passing in the rate, the original dataset (dataset-id), whether it is out_of_bag or not (we’ll go over this) and the seed used to determine how the original dataset was shuffled.

Here’s a little diagram that will help explain how the seed and out_of_bag (oob) work.

Screen Shot 2016-06-02 at 10.53.41 AM.png
So if out_of_bag is set to true, we grab the rows labeled “oob”. Otherwise, we grab the ones marked “x”. The seed just changes which rows we label “oob” and “x”. The seed also enables this whole process be deterministic. So if you run the phi-coefficient function with the same seed (and the same datasets), you’ll get the same results!

Cool. That wraps up our split-dataset function. Next up, model-evaluation.


Screen Shot 2016-06-02 at 10.55.01 AM.png

I apologize if you were hoping for something more exciting. This function is just a wrapper for the method included with WhizzML, create-and-wait-evaluation. As you can see, we are simply creating an evaluation with a model and a dataset. Our last function is…


Screen Shot 2016-06-02 at 10.55.46 AM.png
Pretty simple too!

We take the evaluation ev-id and fetch its data from BigML (fetch ev-id). Then we access the “average_phi” attribute nested under “model” and “result”. The data object looks like this:

Screen Shot 2016-06-02 at 10.56.19 AM.png

And there we have it. A WhizzML script that helps predict Covariate shift.

All together:

Screen Shot 2016-06-02 at 12.57.38 PM.png

We can run our function like this:

Screen Shot 2016-06-02 at 11.01.25 AM.png

As we read in the previous post, it is best to do this process several times and look at the average of the results. How could we add some more code to to do this programmatically? Here’s one implementation.


Screen Shot 2016-06-02 at 12.58.13 PM.png

Again, we are giving this function our training-dataset and production-dataset. But we are also passing in n, which is the number of phi-coefficients we want to calculate. As you can see, we are defining a loop.

Within this loop, we set some variables.

seeds, we give the default (starting) value of (range 0 n). If we pass in 4 for the value of n then the initial value of seeds = [0 1 2 3]

out is our output. We will add the result of a phi-coefficient run each time through the loop. Initially, out = []

We also define the end-scenario.

seeds = (tail seeds). This grabs everything but the first element of seeds. So the first time through, it might be [0 1 2 3], then it will be [1 2 3], then [2 3]

If seeds is not empty, we go back to the loop, but define values for seeds and out.

If seeds is empty, then we return a map with the values list and average (we’ll explain these in a bit).

out = (append out (phi-coefficient ...)) We take the result of our phi-coefficient function and add it to the out list. The first time through, it’s [], then [-0.0838], then [-0.0838, 0.1240] etc.

The seed we will use for each of these phi-coefficient runs will be "test-0", "test-1", "test-2" etc.  Thats what (str "test-" (head seeds)) is doing – joining the string "test-" with the first element of the seeds list.

The last thing we should discuss is the end-case return value:

Screen Shot 2016-06-02 at 11.13.26 AM.png
The value of “list” (out) is just the list of phi-coefficient values from each run. The “average” is… Yep. The average of all the runs.reduce adds up the elements. count counts the number of elements. / divides the first argument by the second. That’s it!*

Example run:

Screen Shot 2016-06-02 at 11.20.43 AM.png

We have now automated the process to investigate whether our distribution of data has changed. Great! You might want to create a scheduled job to check your production data against the data you used to create a predictive model. When the covariate shift exceeds a threshold, retrain the model!

Why WhizzML?

Wait… couldn’t we already do this with the API bindings? What’s special about WhizzML?

Yes, we could use the API bindings. However, there are two significant advantages to WhizzML. First, what if we write this workflow in Python and later decide we want to do the same thing workflow in NodeJS app? We would have to rewrite it the whole workflow!  WhizzML lets us codify our workflow once and use it from any language. Second, WhizzML removes the complexity and brittleness of needing to send multiple HTTP requests to the BigML server (for creating intermediate resources, fetching data, etc.). One HTTP request is all you need  to execute a workflow with WhizzML.

Stay tuned for more blog posts like this that will help you get started automating your own Machine Learning workflows and algorithms.

*There is actually one more thing we can do: a performance enhancement. In each phi-measure run, we recreate the train-data , prod-data and comb-data datasets. This is unnecessary – we can reuse the comb-data dataset and just sample it differently for each run! You can check out the code that includes this improvement here. Note that the comb-data logic from the phi-coefficient function is moved into the loop of multi-phis , and thus the phi-coefficient function is renamed to sample-and-score.


WhizzML Tutorial I: Automated Dataset Transformation


I hope you’re as excited as I am to start using WhizzML to automate BigML workflows! (If you don’t know what WhizzML is yet, I suggest you check out this article first. In this post, we’ll write a simple WhizzML script that automates a dataset transformation process.

As those of you who have dealt with datasets in a production environment know, sometimes there are fields which are missing a lot of data. So much so, that we want to ignore the field altogether. Luckily, BigML will automatically detect useless fields like this and ignore them automatically if we create a predictive model. But what if we want to specify the required “completeness” of the data field? Like if we only want to include fields that have values for more than 95% of the rows.

We can use WhizzML!

Let’s do it! Look to the WhizzML Reference Guide if you need it along the way. Also, the source code can be found in this GitHub Gist.

We want to write a function that: given a dataset and a specified threshold (e.g., 0.95), returns a new dataset with only the fields that are more than 95% populated. Our top-level function is defined below.



Hey, slow down!

Ok. Let’s take it step-by-step. We define a new function called filtered-dataset that takes two arguments: our starting dataset, dataset-id and a threshold (e.g., 0.95).

Screen Shot 2016-05-26 at 5.12.15 PM.png

What do we want this function to do? We want it to return a new dataset, hence:

Screen Shot 2016-05-26 at 5.12.23 PM.png

But we don’t just want any old dataset, we want one based off our old dataset:

Screen Shot 2016-05-26 at 5.12.32 PM.png

And we also want to exclude some fields from our old dataset!

Screen Shot 2016-05-26 at 5.12.39 PM.png

Ah, but which fields do we want to exclude? We can let a new function called excluded-fields figure that out for us. But for now, all we need to know is that this new function (excluded-fields) takes two arguments: our original dataset and our specified threshold.

The line above becomes: (indentation removed for clarity)

Screen Shot 2016-05-26 at 5.12.49 PM.png

As we progress, keep in mind that we want this new function (excluded-fields) to return a list of field names (e.g., ["field_1" "field_2" "field_3"]).

Great! We have defined our base function. Now we have to tell our new function,  excluded-fields, how to give us the list that we want.



Wow what?
You can use that code for reference, but don’t be intimidated. We’ll go over each piece. First we define the function, declaring its two arguments: our original dataset, and the threshold we want to use.

Screen Shot 2016-05-26 at 5.12.59 PM.png

Before we write any more code, let’s talk about the meat of this function. We want to look at all the fields (columns) of this dataset, and find the ones that are missing too much data. We’ll keep the names of these “bad” fields so that we can exclude them from our new dataset. To do this, we can use the function filter. It takes two arguments: a list and a predicate (a predicate is like a test) and will return a new list based on the predicate. In our case, the predicate is that the field has less than 95% of the rows populated.

Screen Shot 2016-05-26 at 5.13.31 PM.png

The predicate should be a function that either evaluates to true or false based on each element of the list we pass to it. If the predicate returns true, then that element of the list is kept. Otherwise, it is thrown out.

We can define the predicate function using lambda.lambda is like any other function definition. We have to tell it the name of the thing we are passing into it

Screen Shot 2016-05-26 at 5.13.43 PM.png

and also tell it what we are going to do with that thing.

Screen Shot 2016-05-26 at 5.13.50 PM.png

In our case, we are checking to see if the threshold is greater than the amount of data present. We will keep the field-name(s) that do not have enough data. (Because remember, these are the fields that will be excluded from our new dataset.) Two things still missing from our filter.

  1. all-field-names
  2. <percent-of-data-that-the-field-has>

How do we get these? The first isn’t too difficult because BigML Datasets have this information readily available. We just have to “fetch” it from BigML first.

Screen Shot 2016-05-26 at 5.14.10 PM.png

and then specify which value we want to get.

Screen Shot 2016-05-26 at 5.14.17 PM.png

Nice. To figure out what percent of the rows are populated for a specific field, we get to… Define a new function! But before we do that, let’s talk about some things we skipped over in our excluded-fields function. Here it is again, for convenience.


What is let?

let is the method for declaring local variables in WhizzML.

  1. We set the value of data to the result of (fetch dataset-id).
  2. We set the value of all-field-names to the result of (get data "input_fields")
  3. We set the value of total-rows to the result of (get data "rows"). (We didn’t talk about this yet. It’s one of the values we need to pass to the present-percent function)

let is useful for a couple of reasons in this function. First, we use data twice. So we can avoid the repetition of writing (fetch dataset-id) twice. Second, naming these variables at the top of the function makes the rest much easier to read and comprehend!

So to wrap up this excluded-fields function, lets talk through what it does again.
First, it declares local variables that we’ll need. Then, we take the list of all-field-names and filter it based on a function that checks its “present percent” of data points. We keep the names of the fields that do not pass our predicate. Cool! Now we’ll go over that present-percent function.



Ah. Not so bad. To calculate the percentage of data points that are present in a given field, we need a few things:

  1. The big collection of data from our dataset (data).
  2. The name of the field we are inspecting (field-name).
  3. The total number of rows in our dataset (total-rows).

We’ll set another local variable using let and call it fields. This is another object containing data about each of the fields. We’ll be using it below.

Screen Shot 2016-05-26 at 5.14.30 PM.png

Then, we divide the missing-count from the field by the total-rows. This gives us a “missing percent”.

Screen Shot 2016-05-26 at 5.14.39 PM.png

We subtract the “missing percent” from 1 and that gives us the “present percent”!

Screen Shot 2016-05-26 at 5.14.45 PM.png

But “missing-count” is another function!

Yes it is!

missing-countScreen Shot 2016-05-30 at 2.09.51 PM.png


missing-count takes two arguments. First, the name of the field we are inspecting (field-name) Second, the fields object we mentioned earlier. It holds a bunch of information about each of the Dataset fields. To get the count of missing rows of data in the field, we do this:

Screen Shot 2016-05-26 at 5.14.54 PM.png

It lets us access an inner value (e.g., 10) from a data object structured like so:

Screen Shot 2016-05-26 at 5.27.40 PM.png

And… That’s it! We have now written all the pieces to make our filtered-dataset function work! All together, the code should look like this:


And we can run it like this:

Screen Shot 2016-05-26 at 5.15.07 PM

And get a result like this "dataset/574317c346522fcd53000102"– a new dataset without those empty fields. I can add this script to my BigML Source dashboard and use it with one click. Or I can put it in a library, and incorporate it into a more advanced workflow. Awesome!

Stay tuned for more blog posts like this that will help you get started automating your own Machine Learning workflows and algorithms.

WhizzML Launch Webinar Recording is Here! In-depth WhizzML Training Series Open for Registration

Last week BigML announced WhizzML, a new domain-specific language for automating Machine Learning workflows, implementing high-level Machine Learning algorithms, and easily sharing them with others.  If you missed the announcement event, you can watch the launch webinar by clicking the link below. This webinar will be complemented by a series of in-depth training sessions for the true innovators, who are looking to push the envelope when it comes to the uptake of Machine Learning in their organizations. Consider this your FREE invitation to join this exclusive four part online event. See the details below.

WhizzML marks a turning point in how companies can automate Machine Learning as it offers out-of-the-box scalability, abstracts away the complexity of underlying infrastructure, and helps analysts, developers, and scientists double or even triple their productivity by reducing the burden of repetitive, brittle and time-consuming Machine Learning tasks.  If you complete the following four training sessions, you will not only leap ahead in your understanding of real life Machine Learning automation challenges but also receive a BigML T-shirt to commemorate your achievement.


The first session will cover all the basics describing how WhizzML is implemented on the BigML platform. Ryan Asensio, BigML’s Machine Learning Engineer, will be introducing the purpose of the language and some benefits over other ways of implementing Machine Learning workflows and algorithms. Join us on Monday, May 30, 2016 at 10:00 AM US PDT (Portland, Oregon. GMT -07:00) / 7:00 PM CEST (Valencia, Spain. GMT +02:00).

webinar2In this intermediate webinar, Charles Parker, BigML’s VP of Machine Learning Algorithms, will start exploring the WhizzML domain-specific language in greater detail, with a whirlwind tour of its syntax, programming constructs and basic standard library functions. We will also learn how to create and use WhizzML resources (libraries, scripts and executions) by means of several simple yet fully functional example workflows. It will take place on Tuesday, May 31, 2016 at 10:00 AM US PDT (Portland, Oregon. GMT -07:00) /7:00 PM CEST (Valencia, Spain. GMT +02:00). Register now, as space is limited!

webinar3In this advanced webinar, we will continue our exploration of the WhizzML language, diving into more complex examples and using more advanced features of the language. Charles Parker, BigML’s VP of Machine Learning Algorithms, will explain how some of the most effective Machine Learning algorithms can be implemented and automated on top of the BigML with WhizzML. Sign up and reserve your spot for Wednesday, June 1, 2016 at 10:00 AM US PDT (Portland, Oregon. GMT -07:00) / 7:00 PM CEST (Valencia, Spain. GMT +02:00).

webinar4In this advanced session, we will walk you through some real-world workflow automations with an eye towards the kind of problems posed by complex use cases, and use some of the best tricks to solve them with confidence. This webinar will be presented by Poul Petersen, BigML’s Chief Infrastructure Officer. It will take place on Thursday, June 2, 2016 at 10:00 AM US PDT (Portland, Oregon. GMT -07:00) / 7:00 PM CEST (Valencia, Spain. GMT +02:00). We hope to see you all there!

More training resources:

In addition to these online training sessions, if you prefer the self-study approach, you may want to download and read our WhizzML guides, documentation, tutorials, as well as the slide decks with basic, intermediate and advanced Machine Learning workflows. We have also prepared a number of useful scripts that you can practice with to get more hands on with WhizzML. You’ll find those on BigML’s Gallery. There are also plenty of example scripts and libraries available in the WhizzML Github repository. Please visit our release page and the dedicated WhizzML page to easily navigate to your resource of choice. Welcome to the world of WhizzML!

%d bloggers like this: