Skip to content

Hands-on Summer School on Machine Learning in Valencia – 2nd Edition

The Machine Learning revolution has no signs of slowing down, as evidenced by its proven success and continued momentum that leading companies like Google or Facebook are experiencing, as well as numerous tech startups putting it at the core of their value propositions. It is especially encouraging to observe the pick up in the recent pace for us, as compared to our beginnings in 2011, when the BigML Team decided to take on the worthy challenge of making Machine Learning beautifully simple for everyone!

In order to play our part in increasing the awareness and application areas of Machine Learning, BigML has been actively organizing summer schools. Last year BigML helped organize the first edition of our summer school on Machine Learning, and this year we intend to improve it further with this second edition, which will take place on September 8 and 9 in Valencia, Spain.


BigML will be holding the two-day hands-on summer school for business leaders, advanced undergraduates, as well as graduate students and industry practitioners, who are interested in boosting their productivity by applying Machine Learning techniques. All lectures will take place at Las Naves from 8:30 AM to 6:00 PM CEST during September 8 and 9. You will be guided through this Machine Learning journey starting with basic concepts and techniques that you need to know to become the master of your data with BigML. Check out the program here!


The summer school 2016 is FREE, but by invitation only. The deadline to apply is Friday, September 2, at 9 PM CEST. Applications will be processed on an as received basis, and invitations will be granted right after individual confirmations to allow for travel plans. Make sure that you register soon since space is limited!

P.S: Following the tradition, any attendee contributing to the classroom discussion by asking questions will get a BigML t-shirt!


Datatrics is Bridging the Gap between Machine Learning and Marketing with BigML

We first ran into the predictive marketing startup Datatrics from the Netherlands at the PAPI’s Connect event in Valencia earlier this year, where they competed in the first ever AI Startup Battle. The Dutch startup offers marketing teams an easy and actionable way to leverage Machine Learning with its innovative data management platform, which we believe sets a great example for other startups in showing how BigML can add to their competitive edge and supercharge their growth. So we interviewed Bas Nieland, CEO and co-founder of Datatrics to find out more.

Bridging the ML Gap

BigML: Congrats on your high score at the first ever AI Startup Battle. Can you tell us what was the motivation behind starting Datatrics?

Bas Nieland: Nowadays digital marketers are awashed with data due to the fragmentation of consumer attention on many more channels. Naturally, they are all looking for better ways to leverage all the data their companies collect, yet there is a big gap between what data can offer marketing teams and what marketers actually use. The main culprit is the fact that there is a perceived necessity of a team of data scientists and collaborating developers to make sense of all that data. Since the average small and medium sized marketing teams do not have access to such resources, new tools are needed to translate data into meaningful actions to optimize the digital customer journey.

An example of a 360 degree customer profile in Datatrics

‘An example of a 360 degree customer profile in Datatrics’

BigML: What is the lowdown on Datatrics? How does it help bridge that gap?

Bas Nieland: Datatrics was founded in 2014 and it currently has 10 employees in the Netherlands. We define ourselves as a data management platform (DMP) that helps marketing teams gain actionable insights. It is an easy and accessible platform that gives concrete insights and actions every marketer can understand. It allows marketing teams to build 360-degree customer profiles, based on internal data sources such as their CRM tools, social media accounts, websites and external data sources such as the weather, social trends and traffic information. By following the recommended Next Best Actions by Datatrics, marketing teams know exactly who to contact, at what time, with what content, and through which channel.

BigML: Can you tell a bit about how Machine Learning comes into play?

Bas Nieland: All of this is driven by smart algorithms applied to those data sources, which is powered by BigML’s Machine Learning platform, among other components that make up our platform. We especially love how BigML helps us to deploy many predictive models in a fast and scalable way by abstracting away the infrastructure level concerns needed to crunch the data. This way our product team can concentrate on the actual analytics tasks and development of the platform for our clients. BigML is also very user-friendly and has a well-documented API, which is very important if you want to go beyond simply gaining insights by deploying scalable predictive applications to your end users.

An example of a Next Best Action in Datatrics

‘An example of a Next Best Action in Datatrics’

BigML: What are some of the predictive use cases you have and which other ones are you looking to add?

Bas Nieland: I already mentioned the Next Best Action models, which is a big benefit to our audience.  We also are in the process of testing BigML’s ‘Associations’ functionality to see how it can benefit us. We believe it can make our product recommendations even more relevant.

BigML: Can you share specifics on customer traction and measurable business outcomes Datatrics have been delivering?

Bas Nieland: We are seeing great uptake especially in retail and travel industries. Over the past year, we have noted a clear demand in the travel industry for DMPs such as Datatrics. As it is a highly competitive market, it is important for companies such as travel agencies and hotel chains to use customer insights from their data in order to communicate in a more personal and relevant way. Some of our customers have increased their revenue by as much as 30%!

BigML: That sounds great. What would you recommend other startups and self-starting developers that want to implement similar smart applications? Any key lessons learnt that you would like to share?

Bas Nieland: They should think hard before going the route of building their Machine Learning infrastructure from scratch. Provided that you have pertinent data, platforms like BigML can help you in building real world applications very fast while letting you get there at a fraction of the cost of hiring a new analyst. Of course our platform consists of many more components and there is not one solution that fits all, but a good Machine Learning platform such as BigML provides can get you a long way.

BigML: Thanks Bas. It is very impressive to see how you have been able to ramp up your Machine Learning efforts in such a limited time period despite constrained resources. We hope stories like yours inspire many more startups to realize that they too can turn their data and know-how into sustainable competitive advantages.

How to Put Machine Learning in your Machine Learning

There are so many Machine Learning algorithms and so many parameters for each one.  Why can’t we just use a meta-algorithm (maybe even one that uses Machine Learning) to select the best algorithm and parameters for our dataset?

— Every first year grad student who has taken a Machine Learning class

It seems obvious, right?  Many Machine Learning problems are formalized as an optimization wherein you’re given some data, there are some free parameters, and you have some sort of function to measure the performance of those parameters on that data.  Your goal is to choose the parameters to minimize (or maximize) the given function.

ML in ML

But this sounds exactly like what we do when we select a Machine Learning algorithm!  We try different algorithms and parameters for those algorithms on our data, evaluate their performance and finally select the best ones according to our evaluation.  So why can’t we use the former to do the latter?  Instead of stabbing around blindly by hand, why can’t we use our own algorithms to do this for us?

In just the last five years or so, there’s been a lot of work in the academic community around this very topic (usually it’s called hyperparameter optimization, and the particular type which is getting the attention lately is the Bayesian variety) which in turn has led to a number of open source libraries like hyperopt, spearmint, and Auto-WEKA.  They all have loosely the same flavor:

  1. Try a bunch of random parameter configurations to learn models on the data
  2. Evaluate those models
  3. Create a Machine Learning dataset from these evaluations where the features are the parameter values and the objective is the result of the evaluation
  4. Model this dataset
  5. Use the model to select the “most promising” set of next parameter sets to evaluate
  6. Learn models with those parameter sets
  7. Repeat steps 2-6, adding new evaluations to the dataset described in set 3 at each iteration

Most of the subtlety here is in steps four and five.  What is the best way to model this dataset and how do we use the model to select the next sets of parameters to evaluate?

My favorite specialization of the above is SMAC.  The original version of SMAC is a bit fancier than is necessary for our purposes, so I’ll dumb it down a little here in the name of simplicity (let’s call the simpler algorithm SMACdown):

  • In step four, we’re going to grow a random regression forest as our model for the parameter space.  Say we grow 32 trees: This means that for each parameter set we evaluate using our model, we’ll get 32 separate estimates of the performance of our algorithm.  Importantly, the mean and variance of these 32 estimates can be used to define a Gaussian distribution of probable performances given that parameter set.

  • In step five, we generate a whole bunch of parameter sets (say, thousands) and pass them through the model from step four to generate a Gaussian for each one.  We then measure, for each gaussian, how much of the lower tail is below our current best evaluation.  The ones with the most area below this lower tail are our most promising candidates.

SMACdownWith most of the details settled, all that’s left is to choose a language in which to implement the algorithm.

How about WhizzML?

Why would we choose WhizzML?  For starters, it allows us to kiss our worries about scalability goodbye.  We can prototype our script on some small datasets, then run exactly the same script on datasets that are gigabytes in size.  No extra libraries or hardware; it will just work out of the box.

Second, because the script itself is a BigML resource, it can be run from any language from which you can POST an HTTP request to BigML, and you can consume the results of that call as a JSON structure.  With WhizzML, there’s no longer the necessity of working in a particular language; you can implement once in WhizzML and run from anywhere.

We aren’t going to go through all of the code in detail, but we’ll hit on some of the major points here.

Our goal here is going to be to optimize the parameters for an ensemble of trees.  We’ll start by creating a function that generates a random set of parameters for an ensemble.  That looks like this:

random params

We use WhizzML’s lambda to define a function with no arguments that will generate a random set of parameters for our ensemble.  Note that we need to know if this is going to be a classification or a regression in advance, as setting balance_objective to true for regression problems is invalid.  This function returns a function that can be invoked over and over again to generate different sets of parameters each time.

The process of evaluating these generated parameter sets is fairly simple; for each parameter set you want to evaluate, you create an ensemble, perform an evaluation on your holdout set (you did hold out some data, didn’t you?), then pull out or create the metric on which you want to evaluate your candidates.

Once you have these evaluations in hand, you need to model them (step four).  That’s done here:

make ensemble

Here, we make the random forest described above.  The helper smackdown—data->dataset creates a dataset from our list of parameter evaluations.  We then create a series of random seeds and create a model for each one, returning the list of IDs.

The next thing is to create a bunch of new parameter sets and use our constructed model to evaluate them:

make predictions

The data argument here is our new list of parameter sets (created elsewhere by multiple invocations of the model-params-generator defined above), and mod-ids is the list of model IDs created by the smacdown--create-ensemble.  The logic here is again fairly simple:  We create a batch prediction for each model, then create a sample from each batch predicted dataset so we can pull all of the rows for each prediction into memory.  We’re left with a row of predictions for each datapoint in data.

Another function is applied to these lists to pull out the mean and variance from each one, then to compute, given the current best evaluation, which of these has the greatest chance to improve on our current best solution (that is, which has the highest percentage of the area under its Gaussian below the current best solution).

There’s a number of details here we’re glossing over, but thankfully you don’t have to know them all to run the script.  In fact, you can clone it right out of BigML’s script gallery:

What’s the takeaway from all of this?  Mainly, we want you to see that WhizzML is expressive enough to let you compose even complex meta-algorithms on top of BigML’s API.  When you choose to use it, WhizzML offers you scalability and language-agnosticity for your Machine Learning workflows, so that you can run them on any data, any time.

No excuses left now!  Go give it a shot and let us know what you think at or in the comments below.

WhizzML: Level Up with Gradient Boosting

Let’s get serious.

Sure, you can use WhizzML to fill in missing values or to do some basic data cleaning, but what if you want to go crazy?  WhizzML is a fully-fledged programming language, after all.  We can go as far down the rabbit hole as we want.

As we’ve mentioned before, one of the great things about writing programs in WhizzML is access to highly-scalable, library-free machine learning.  To put in another way, cloud-based machine learning operations (learn an ensemble, create a dataset, etc.) are primitives built into the language.

Put these two facts together, and you have a language that does more than just automate machine learning workflows.  We have the tools here to actually compose new machine learning algorithms that run on BigML’s infrastructure without any need for you, the intrepid WhizzML programmer, to worry about hardware requirements, memory management, or even the details of the API calls.

What sort of algorithms are we talking about, here?  Truth be told, many of your favorite machine learning algorithms could be implemented in WhizzML.  One important reason for this is because many machine learning algorithms feature machine learning operations as primitives.  That is, the algorithm itself is composed of steps like model, predict, evaluate, etc.

As a demonstration, we’ll take a look gradient tree boosting.  This is an algorithm that has gotten a lot of praise and press lately due to it’s performance in general, and the popularity of the xgboost library in particular.  Let’s see if we can cook up a basic version of this algorithm in WhizzML.

The steps to gradient boosting (for classification) are as follows:

  1. Compute the gradient of the objective with respect to the currently predicted class probabilities (which start out as, e.g., uniform over all classes) for each training point (optionally, on only a sample of the data)
  2. Learn a tree for each class as a functional approximation of this gradient step
  3. Use the tree to predict the approximate gradient at all training points
  4. Sum the gradient predictions with the running gradient sums for each point (these all start out as zero, of course).
  5. Use something like the softmax transformation to generate class probabilities from these scores
  6. Iterate steps 1 through 5 until a stopping condition is met (such as a small gradient magnitude).

You can see here that machine learning primitives feature prominently in the algorithm.  Step two involves learning one or more trees.  Step three uses those trees to make predictions.  Obviously, those steps are easily accomplished with the WhizzML builtins create-model and create-batchprediciton, respectively.But there are a few other steps where the WhizzML implementation isn’t as clear.  The gradient computation, summing of the predictions, and application of the softmax transformation don’t have (very) obvious WhizzML implementation, because they are operations that iterate over the whole dataset.  In general, the way we work with the data in WhizzML is via calls to BigML rather than explicit iteration.

So are there calls to the BigML API that we can make that will do the computations above?  There are, if we use Flatline.  Flatline is BigML’s DSL for dataset transformation, and fortunately all of the above steps that aren’t learning or prediction can be encoded as Flatline transformations.  Since Flatline is a first class citizen of WhizzML, we can easily specify those transformations in our WhizzML implementation.

Take step four, for example.  Suppose we have our current sum of gradient steps for each training point stored in a column of the dataset, and our predictions for the current gradient step in another.  If those columns are named current_sum and current_prediction, respectively, then the Flatline expression for the sum of those two columns is:


Where the f Flatline operator gets the value for a field given the name.  Knowing that we have a running sum and a set of predictions for each class, we need to construct a set of Flatline expressions to perform these sums.  We can use WhizzML (and especially the flatline builtin) to construct these programmatically:

sum columns

Here, we get the names for all of the running sum, current prediction, and new sum columns into the the last-sums, this-preds, and this-sums variables, respectively.  We then construct the flatline expression that creates the sum, and call make-fields (a helper defined elsewhere) to create the list of flatline expressions mapped to the new field names.  The helper add-fields then creates a new dataset containing the created fields.

We can do roughly the same thing to compute the gradient and apply the softmax transformation; We use WhizzML to compose Flatline expressions, then allow BigML to do the dataset operation on it’s servers.

This is just a peek into what a gradient boosting implementation might look like in WhizzML.  For a full implementation of this and some other common machine learning workflows, check out the WhizzML tutorials.  We’ve even got a Sublime Text Package to get you started writing WhizzML as quickly as possible.  What are you waiting for?

PAPIs ’16 in Boston is Open for Registration

Registration is now open for PAPIs ’16, which will take place in Boston on October 11–12, 2016. This will be the third edition of the International Conference on Predictive Applications and APIs and the first time the event will be held in The United States. PAPIs brings together leaders from all around the world — as well as newcomers to the field — to discuss opportunities, challenges and new developments in the space of intelligent applications, Machine Learning tools and APIs.

Here is a teaser video shared by the organizers to wet your appetite:

Watch the Video

This year the conference will be held at Microsoft’s New England Research & Development center. Talks by distinguished speakers from Uber, Telefonica, Google, Amazon and BigML, multiple networking sessions, as well as the second edition of the AI Startup Battle are all part of this year’s program.

PAPIs organizers always make a big effort to keep this event high on both quality and affordability for all parties interested. As a result, tickets tend to be quickly sold out. So don’t procrastinate and secure your registration well in advance!

Machine Learning Prague Videos are Ready!

BigML had the pleasure to participate in the inaugural Machine Learning Prague conference, which brought together European companies and startups as well as academics specializing in Machine Learning. To us, it was one more piece of evidence that far from a Silicon Valley fad, Machine Learning is a global phenomenon and the creativity, the talent and the ambition to match those are already at many corners of the world.

Machine Learning Prague 2016 - Adam Ashenfelter

In the spirit of passing the knowledge on to thousands more who could not be there, the organizers of Machine Learning Prague have now made video recordings of all the sessions available on their YouTube channel. Among the highlights, you will notice BigML’s Adam Ashenfelter’s presentation on Anomaly Detection.  The session starts with a high level review of various anomaly detection techniques and delves into the specifics of the versatile unsupervised Isolation Forest technique, so all in all a great primer into the topic.

Also of note is the presentation by Yandex’s Michael Levin as it explains how Yandex has been able to adopt Machine Learning across their teams by investing into a homegrown platform built mainly on Gradient Boosted Trees.  This platform has successfully been applied to many different use cases across the company.  As such, it is one more data point in support of a standardized approach instead of relying on custom implementations on a project by project basis. Other examples like the announcement by Facebook, which is prioritizing Machine Learning as a core developer competency is especially striking. Google’s recent article about their ML Ninja program is yet another example. These are great signs that the Wild West era of Machine Learning is coming to a close, and we are seeing a maturing marketplace with tools that can measure up to the biggest unmet challenge: how do we take Machine Learning from being seen as Voodoo Magic to becoming an essential component of every developer’s toolbox?

BigML’s mission has always been democratizing Machine Learning by providing companies of all sizes a consumable, programmable and scalable Machine Learning platform so they can tackle even complex problems with nothing more than their domain expertise, development skills, and the passion to innovate.  How so?  By providing free educational material, a well-documented API and even a domain specific language to automate sophisticated Machine Learning workflows, implement high level algorithms and share those with others. Let us know your thoughts on how your organization is planning on managing this key transformation.

Programmatically Fill in Missing Values in Your Dataset with WhizzML

For new WhizzML developers, WhizzML’s power as a full-blown functional programming language can sometimes obscure the relationship between WhizzML and the BigML Machine Learning platform. At BigML, we refer to WhizzML as a functional programing language for orchestrating workflows on the BigML platform. In this post we describe an example script in the WhizzML script gallery for filling in missing data values in a BigML Dataset object to help elucidate how WhizzML and the BigML machine learning platform interact.

The BigML developer documentation provides one view into the Machine Learning functions the BigML platform makes available to users. This functionality can be accessed through multiple programming methods including the BigML REST API, the downloadable BigMLer  command line toolBigML bindings for all popular programming languages, and now as WhizzML functions.  There is an important difference between the first three programming methods (REST API, BigMLer, and bindings) and WhizzML: Solutions using the first three methods run on other user platforms, increasing the volume of data and metadata transfers between user platforms and the BigML platform. Production WhizzML scripts run on the BigML platform, eliminating data transport costs and leveraging parallelism and performance optimizations for BigML Machine Learning on large datasets.

WhizzML and Flatline

When using WhizzML to orchestrate workflows, you might quickly come up against an additional subtlety: To realize the full potential of WhizzML, WhizzML functions should not themselves process the data in datasets, but only orchestrate execution of BigML Machine Learning functions in the BigML API.  However, in your ML application, you might need to process dataset data in unique ways. For example, the API to create a BigML Cluster object from a BigML Dataset object includes an argument “default_numeric_value” that allows us to specify the single type of the numeric value — “mean”, “median”, “minimum”, “maximum”, “zero” — (the first four types computed on a per-column basis) that should be used to fill all missing values in a dataset in all columns considered in the clustering operation.

It is easy to conceive of applications, where you might need more flexibility in filling missing values in a dataset.  We don’t want to do this by processing the data in WhizzML itself, because we couldn’t leverage all the performance benefits the BigML platform provides for handling datasets.  This is where we can turn to Flatline in WhizzML.  Flatline is a row-oriented processing language for datasets in the BigML platform itself.  The BigML Developer tools includes a Flatline editor for directly applying Flatline operations to datasets, but we can also use Flatline directly in WhizzML.

The Whizzml script Clean Data Fill in the BigML WhizzML script gallery is an example of how we can use WhizzML and Flatline to fill in missing values in a dataset by using default values supplied in a map to a WhizzML function.  We can’t cover all of the Flatline operations and use cases in here, so in our example we’ll just show how to apply the the Flatline function:

(all-with-defaults <field-designator-0> <field-value-0>
                   <field-designator-1> <field-value-1>
                   <field-designator-n> <field-value-n>)

to modify a dataset.  We do this by using the built in WhizzML  (flatline …) function and a macro-like structure to fill in the field information:

(flatline "(all-with-defaults @{{fargs}})")

where fargs is a WhizzML list that includes field key/value (field names/values) pairs as sequential entries.  Our example script does this by using four WhizzML functions, three of which are actually functions to simplify the task of specifying default values for the fourth function that does the real work by using Flatline.

Specifying Default Values for Missing Dataset Values

Arguably the most burdensome task we have to undertake is to build a map of default values to fill in the missing values in a dataset. Three of the four functions in our example WhizzML script (extract-meta …), (extract-meta-func …), and (generate-configmap …) implement our illustrative approach for doing this.  Before discussing these functions, note that the metadata for BigML Dataset objects include two properties “input_fields” and “fields” that provide the metadata items we need to build our default value map.  The “input_fields” property is a list of field (column) IDs in the dataset, e.g.:

{ ...

The “fields” property is a dictionary of summary information for each field keyed on the IDs in the “input_fields” property, e.g.:

{ ...
  { ...
   :datatype "double",
   :name "Employment Rate",
   :optype "numeric",
   { ...
    :mean 58.35941,
    :median 58.00162,
    :minimum 29.96302,
    :maximum 83.55616,
     ... }}
    ... }

The first three functions in our script process the “input_fields” and “fields” properties of the input dataset metadata to generate a template map for specifying default values to the function that fills the missing values in the dataset.

The first function (extract-meta …), is a helper function that accepts the submap {:fields {:00000 {…}} for a single field from the “fields” property as an input parameter:

(define (extract-meta mpi) 
  (let (mpis (get mpi "summary")
        mpos {"mean" (get mpis "mean")
              "median" (get mpis "median")
              "minimum" (get mpis "minimum")
              "maximum" (get mpis "maximum")})
    {"datatype" (get mpi "datatype") 
     "name" (get mpi "name")
     "optype" (get mpi "optype")
     "summary" mpos}))

The function extracts and returns just the contents we need for the corresponding field entry in our default value map:

{:datatype "double",
 :name "Employment Rate",
 :optype "numeric",
 {:mean 58.35941,
  :median 58.00162,
  :minimum 29.96302,
  :maximum 83.55616}}

This map provides the minimum information you might find useful about the type and contents of a column.

The next function (extract-meta-func …) is a factory function that returns a lambda function suitable for use in a WhizzML (reduce fn {…} [..]) function.

(define (extract-meta-func ds)
  (let (fields (get ds "fields"))
    (lambda (mp id)
      (let (mpi (get fields id)
            mpo (extract-meta mpi))
        (assoc mp id mpo)))))

This function creates a closure that captures the contents of the “fields” property of the metadata for the dataset whose ID is supplied as the “ds” input parameter. The returned lambda function (lambda (mp id) …) accepts a partial metadata map “mp” and a column “id” (from the “input_fields” property of the dataset metadata map) as input parameters. It returns a new version of the input map augmented with the submap returned by the (extract-meta …) function for the column specified by “id”.

Our third function (generate-configmap …) just repetitively applies the function returned by (extract-meta-func …) to the dataset metadata to build up a template map for supplying default values to our dataset:

(define (generate-configmap dataset-id)
  (let (ds (fetch dataset-id)
        flds (get ds "input_fields")
        metafn (extract-meta-func ds))
    (reduce metafn {} flds)))

The result is a WhizzML list of maps, one per column, of the minimum metadata for the column.  For each field, we can then add a property “default” to the submap for the field to specify the value that should be plugged in to the rows of the dataset with missing values in that column:

 {:datatype "double",
  :name "Employment Rate",
  :optype "numeric",
  {:mean 58.35941,
   :median 58.00162,
   :minimum 29.96302,
   :maximum 83.55616},
  :default 0.0}
    ... }

Filling Missing Dataset Values with Flatline

Once we have a map that explicitly specifies the default values for the columns of our dataset, we can use the fourth function in the example WhizzML script (fill-missing …) to create a new dataset with all missing values in the source dataset specified by “dataset-id” replaced with the default values in the “dflt-mp” map:

(define (fill-missing dataset-id dflt-mp)
  (let (frdce (lambda (lst itm) 
                (let (dkey (get itm "name")
                      dval (get itm "default"))
                  (append (append lst dkey) dval)))
        fargs (reduce frdce [] (values dflt-mp)))
    (log-info fargs)
    (create-and-wait-dataset {"origin_dataset" dataset-id
                              "all_fields" false
                              "new_fields" [{"fields" (flatline "(all-with-defaults @{{fargs}})")}]})))

This function first declares a function frdce that is used in a WhizzML (reduce …) function to extract a WhizzML list fargs of sequential per-column name-value pairs.

The heart of our example (fill-missing …) function is the WhizzML (create-and-wait-dataset …) function that creates a modified copy of the source dataset with our default values inserted. Referring to the BigML API documentation for the Dataset object API arguments for extending a dataset, a false value for  “all_fields” argument specifies that the function should not pass any of the input fields of the source dataset directly to the new dataset.  The “new_fields” argument specifies new fields that should be added to the new dataset by using Flatline.

Our example function uses a “new_fields” argument form that includes a WhizzML map in the argument  [{“fields” (flatline …)}], which specifies values for all of the fields in the new dataset with a single Flatline expression.   The (flatline …) function  accepts a single string argument that is passed to the BigML backend at execution time. The string argument “(all-with-defaults @{{fargs}})” in turn incorporates a WhizzML macro form where fargs is the WhizzML list of sequential per-column name-value pairs, which were defined earlier. When the (flatline …) function is executed, WhizzML expands the string argument with the value of fargs.  The resulting string value for the “new_fields” argument is passed to the BigML backend along with the other arguments by the (create-and-wait-dataset …) function. The BigML platform backend generates the new dataset with the default values inserted by using whatever optimizations it can.

A Final Comment

Our WhizzML script is primarily intended as an example of how you can use WhizzML and Flatline to process datasets on the BigML platform backend.   In your applications, you may want to compute default values in other ways or perform other data manipulations. For instance,  you may want to compute default values on a row-basis by using Flatline rather than on a column-basis. Most data manipulations can be accomplished by using WhizzML and Flatline, but some computations may be harder to implement than others. We will take up other ways to use WhizzML and Flatline to facilitate Machine Learning tasks in subsequent WhizzML demonstration scripts and blog posts.


Using Anomaly Detectors to Assess Covariate Shift


BigML first discussed some time ago how the performance of a predictive model can suffer when the model is applied to new data generated from a different distribution than the data used to train the model.  Machine Learning practitioners commonly identify two types of data variations that can cause problems for predictive models. The first, Covariate Shift, refers to differences between the distribution of the data fields used as predictors in the training and production datasets.  The other type of variation, Dataset Shift, denotes changes in the joint distribution of the predictors and the predicted data fields arising between the target and production datasets. A recent blog post showed how to implement one technique for detecting both types of data shift by using WhizzML.

The introduction of Anomaly Detectors in BigML provides yet another means using WhizzML for detecting data shifts that can affect the performance of predictive models.  As background for the simple technique we describe next,  you can read more about anomaly detection in a previous BigML blog post.  In a nutshell, an anomaly detector is an iforest of over-fitted decision trees.  Anomalous data items are outliers from the dataset and therefore are detected at shallower depth in the decision trees.  The depth at which an item is classified compared to the average depth of the decision trees is converted to an anomaly score that ranges from 0 (least anomalous) to 1 (most anomalous).

BigML provides two anomaly detector functions in WhizzML useful for building a data shift detector:

  • (create-and-wait-anomaly …): We can use this function to build an anomaly detector object from the same training dataset we use to build a predictive model.
  • (create-and-wait-batchanomalyscore …): Once we have built an anomaly detector, we can use this function to apply that anomaly detector to a production datashift.

There are some features of the (create-and-wait-batchanomalyscore …) function that are useful for our purpose.   When this function is applied to the input production dataset, it creates a Batch Anomaly Score object and an output Dataset object that includes every row of the input production dataset object with an added score field containing the anomaly score for that row.  The batch anomaly score function also adds summary metadata to the output dataset metadata that we can use to compute the desired dataset shift measure.

The BigML WhizzML script gallery includes an example Anomaly Shift Estimate script that demonstrates how we to use the anomaly detector functions to create a dataset shift measure.  In the rest of this post, we describe the component functions in this demonstration WhizzML script.  The script can be used as-is, or you can use the component functions as starting points for custom WhizzML scripts.

A Few Helper Functions to get Started

To begin, we recall that predictive models are typically built by learning a model from a training subset of the source data.  The data shift detection script starts with a simple WhizzML function (sample-dataset …) that allows us to select a subset of the training dataset:

(define (sample-dataset dst-id rate oob seed)
  (create-and-wait-dataset {"sample_rate" rate
                            "origin_dataset" dst-id
                            "out_of_bag" oob
                            "seed" seed}))

This minimal helper function primarily illustrates the few parameters one likely would want to use to select a subset of an input dataset.  In a WhizzML script customized for your application, you may want to use other parameters of the (create-and-wait-dataset …) function, that is described in the BigML Dataset documentation.

The script also includes a minimal WhizzML helper function (anomaly-evaluation …) to apply an anomaly detector to every row of the production dataset:

(define (anomaly-evaluation anomaly-id dst-id)
  (create-and-wait-batchanomalyscore {"anomaly" anomaly-id
                                      "dataset" dst-id
                                      "all_fields" true
                                      "output_dataset" true }))

Again, just those parameters of the (create-and-wait-batchanomalyscore …) function needed to apply the “anomaly” detector with anomaly-id to an input “dataset” with dst-id are used.  Specifying both the “output_dataset”  and “all_fields” parameters as true requests creation of an output dataset that includes all  fields in the input dst-id and an anomaly score for each row in the dataset.  The Batch Anomaly Score documentation describes the full set of parameters you might find useful in your own WhizzML scripts.

The script includes one last WhizzML helper function (avg-anomaly …) that uses metadata the WhizzML (create-and-wait-batchanomalyscore …) function adds to the basic metadata of the output dataset, which it creates to compute an average anomaly score for how anomalous the input dataset is relative to the training set used to build the anomaly detector:

(define (avg-anomaly evdst-id)
  (let (evdst (fetch evdst-id)
        score-field (get-in evdst ["objective_field" "id"])
        sum (get-in evdst ["fields" score-field "summary" "sum"])
        population (get-in evdst ["fields" score-field "summary" "population"]))
    (/ sum population)))

There are a few details worth noting here. We first must fetch the output dataset evdst identified by evdst-id.  The metadata associated with evdst includes a map “objective_field” that includes a sub-map “id”  that identifies the score-field in the metadata containing the anomaly results we need. Using that score-field value, we can access the “summary” sub-map in the “fields” sub-map of the metadata, where the total sum of the anomaly scores for all rows in the dataset and the population count of the number of rows in the dataset are found.  We return the quotient of these two quantities as our average anomaly score for the entire dataset as a measure of data shift.  The Dataset Properties section of the Dataset documentation provides more information about the properties we describe here as well as other properties you might find useful in a custom WhizzML script.

Using our Helper Functions to do Anomaly Scoring

We now can combine these minimal helper functions into a single function that computes an anomaly score for an entire production dataset relative to a training dataset.

(define (anomaly-measure train-dst train-exc prod-dst prod-exc seed clean)
  (let (traino-dst (sample-dataset train-dst 0.8 false seed)
        prodo-dst (sample-dataset prod-dst 0.8 true seed)
        anomaly (create-and-wait-anomaly {"dataset" traino-dst
                                          "excluded_fields" train-exc})
        ev-id (anomaly-evaluation anomaly prodo-dst)
        evdst-id (get-in (fetch ev-id) ["output_dataset_resource"])
        score (avg-anomaly (wait evdst-id)))
      (if clean
        (prog (delete evdst-id)
              (delete ev-id)
              (delete anomaly)
              (delete prodo-dst)
              (delete traino-dst)))

In summary, this (anomaly-measure …) function:

  1. Creates samples of both datasets (traino-dst,  prodo-dst)
  2. Creates an anomaly detector from the training sample (anomaly)
  3. Applies the anomaly detector to the production sample to create a batch score (ev-id)
  4. Computes the average anomaly score for the entire production sample (score)

This function also includes several details you might handle differently in your own WhizzML scripts.  The “train-exc” parameter is a WhizzML list of fields in the training dataset that should be ignored when the anomaly detector is created.  The “prod-exc” input parameter is ignored here since the contents of the “train-exc” input parameter determines what fields the anomaly detector will ignore in the production dataset.

In addition to these input parameters, there are some internal details of the function that should be noted. The (anomaly-evaluation …) function returns the ID of a Batch Anomaly Score object identified by ev-id; the metadata map for this object includes a property “output_dataset_resource” that contains the BigML ID evdst-id of the output dataset created by the batch anomaly score function.  We note that the BigML platform backend produces the batch anomaly score object before the output dataset object is complete.  We must use the (wait …) function or an equivalent operation to insure the dataset referenced by “output dataset resource” is available before we attempt to access the anomaly score information in the output dataset metadata that we need.

Finally, our (anomaly-evaluation …) function includes some housekeeping features to support the higher-level functions in our WhizzML script.  You might find similar features useful in your own scripts.   The “seed” input string parameter passed to the (sample-dataset …) functions causes deterministic, and therefore repeatable, sampling.  Specifying the “clean” input as true causes the function to delete the intermediate working objects it creates before returning the average anomaly score.  This can be helpful when one repetitively computes the average anomaly score on a sequence of pairs of subsets of the testing and production dataset.

As just suggested, in practice we likely would want to repetitively sample the training and production datasets and compute a sequence of average anomaly scores for that sequence of samples.  The (anomaly-loop …) function in the script is explicitly does this in a form  that illustrates how you could also easily add other computations or logging in your own custom WhizzML scripts:

(define (anomaly-loop train-dst train-exc prod-dst prod-exc seed niter clean logf)
  (loop (iter 1
         scores-list [])
    (if logf
      (log-info "Iteration " iter))
    (let (score (anomaly-measure train-dst train-exc prod-dst prod-exc (str seed " " iter) clean)
          scores-list (append scores-list score))
      (if logf
        (log-info "Iteration " iter scores-list))
      (if (< iter niter)
        (recur (+ iter 1)

This function just calls the (anomaly-measure …) function “niter” times and returns the resulting sequence of average anomaly scores.  Note that the input parameters include the “clean” boolean parameter specifying whether the intermediate objects created by each use of the (anomaly-measure …) function should be preserved or deleted.  Finally, this function illustrates how we can use logging features on the BigML platform to log results from the sequence of (anomaly-measure …) calls under control of the “logf” boolean input parameter.

Next in our script we call the (anomaly-loop …) function inside a wrapper function:

(define (anomaly-measures train-dst train-exc prod-dst prod-exc seed niter clean logf)
  (let (values (anomaly-loop train-dst train-exc prod-dst prod-exc seed niter clean logf))

Although this function could be eliminated in our script, you might find a similar function useful in your own custom WhizzML script as the place for adding additional computations on the sequence of average anomaly scores returned by the (anomaly-loop …) function.

The final high-level function in our script computes our final single numeric measure of data shift.  In our script, this is simply the average of the sequence of average anomaly scores returned by the (anomaly-measures …) function:

(define (anomaly-estimate train-dst train-exc prod-dst prod-exc seed niter clean logf)
  (let (values (anomaly-measures train-dst train-exc prod-dst prod-exc seed niter clean logf)
        sum (reduce + 0 val)
        cnt (count values))
    (/ sum cnt)))

In a custom WhizzML script one could combine the (anomaly-estimate …) function and (anomaly-measures …) function by just replacing the use of (anomaly-measures …) with (anomaly-loop ..).   If one doesn’t need to access the list of scores, one could also pull the contents of the (anomaly-loop …) function into this function. On the other hand, you might need to use the list of scores from (anomaly-measures …) directly in your own WhizzML scripts, rather than just computing the average values of the average anomaly scores in that list.

Finally, the example in the WhizzML script gallery concludes with the definition required to use the script in the BigML Dashboard:

(define result (anomaly-estimate train-dst train-exc prod-dst prod-exc seed niter clean logf))

This definition also demonstrates how you would call the top-level (anomaly-estimate …) function directly in your own WhizzML scripts. Thanks to WhizzML’s composability, using Anomaly Detectors to detect covariate shift is more convenient that ever.  We hope you get a chance to give it a spin and let us know how it goes!



Predictive Analytics in the Financial Industry – The Art of What, How and Why

Mobey Forum, the global industry association empowering banks and other financial institutions to play a leading role in ushering the future of digital financial services, has just published the first in a new series of reports that are exploring the most important aspects, challenges and key application areas of predictive analytics in financial services. As Co-chair of the Mobey Forum’s predictive analytics workgroup, I had a front row seat in observing the challenges the industry is facing, while transitioning to a much more data-driven operational mode necessitated by competitive pressures. I would like to thank the colleagues from Danske Bank, UBS, Nets, PostFinance, Ericson, HSBC, Nordea, CaixaBank, Teconcon, Giesecke&Devrient and many more leading institutions that contributed to the final report.

Statistics vs. Machine Learning

Predictive Analytics in the Financial Industry – The Art of What, How and Why’, is a primer that lays the ground work for subsequent reports that will go into much more detail in exploring different technical and organizational aspects of predictive analytics. The Mobey Forum workgroup aims to strike a balance between the technical underpinnings of key enabling technologies such as Machine Learning and the real-life best practices commercial applications that can serve as benchmarks for beginners in their initiatives.

We hope this effort provides the spark to get your organization to start investing in predictive analytics. And in a way that can accelerate innovative data-driven products and services that can adapt to the dynamic marketplace that is threatening to make one-size-fits-all type traditional product and service portfolios obsolete.

As a reminder, an in-depth discussion by the authors of the report will be broadcast on BrighTALK on July 11th at 4PM CET.

Automatically Estimate the Best K for K-Means Clustering with WhizzML

(Thanks to Alex Schwarm of for bringing to our attention the Pham, Dimov, and Nguyen paper, which is the subject of this post.)

The BigML platform offers a robust K-Means Clustering API that uses the G-Means algorithm for determining K if you don’t have a good guess for K.  However, sometimes you may find that the divisive top-down approach of the G-Means algorithm does not always yield the Best-K for your dataset.  After a little experimentation, you may also discover that the G-Means algorithm does not choose a value of K that makes sense based on your knowledge of your dataset (see the “k” and “critical_value” arguments in the Cluster Arguments section).  You could manually try running the cluster operation on your dataset for a range of K, but that approach does not inherently include a way to recognize the best K.  And it can be very time consuming!

The Pham, Dimov, and Nguyen Algorithm and the K-means Algorithm in BigML

Fortunately, WhizzML allows us to easily implement another approach for choosing K using an algorithm by Pham, Dimov, and Nguyen. (D.T. Pham, S. S. Dimov, and C. D. Nguyen, “Selection of K in K-means clustering“.  Proc. IMechE, Part C: J. Mechanical Engineering Science, v. 219, pp. 103-119.)  Pham,Dimov,and Nguyen define a measure of concentration f(K) on a K-means clustering and use that as an evaluation function to determine the best K.  In this post, we show how to use the Pham-Dimov-Nguyen (PDN) algorithm in WhizzML to calculate f(k) over an arbitrary range of Kmin to Kmax.  You can then consider using the k that yields the optimum (minimum) value of f(k) as the best K for a K-means clustering of your dataset.

Before jumping into the WhizzML code, we first note the clustering functions WhizzML provides via the BigML API calls:

  • (create-and-wait-cluster …):  Using this function we can create a BigML Cluster object for a BigML Dataset object using K-means or G-means clustering.
  • (create-and-wait-centroid …):  Once we have a BigML Cluster for a BigML Dataset we can create a BigML Centroid object for a row in the dataset using this function.
  • (create-and-wait-batchcentroid …):  Given a Cluster object and a Dataset object, we can use this function to create a BigML Batch Centroid object and a new Dataset that labels every row with the number of the cluster centroid to which the row is assigned.
  • (create* “cluster” …): With this function we can initiate the creation of a sequence of BigML Cluster objects on the BigML platform in parallel.
  • (wait* …): Although not a clustering function, this synchronization function re-establishes serial program flow in WhizzML after (create* …) initiates parallel creation of BigML objects.

We’ll use the latter two parallel operations to increase the speed of our WhizzML script that implements the PDN algorithm.

Our WhizzML script in the BigML gallery uses the PDN concentration function f(k) and finds the best K in several steps.  Given a BigML Dataset object, the steps of the generic algorithm are:

  1. Compute a sequence of Bigml Cluster objects for k ranging from Kmin to Kmax.
  2. Evaluate f(k) for each cluster in the sequence of BigML Cluster objects.
  3. Choose the k with the optimum (minimum) value of f(k) as the best K.
  4. Finally, if desired, create a BigML Batch Centroid object from the best K Cluster object and the source Dataset object.

It turns out that our example WhizzML script implements a sequence of component WhizzML functions that aren’t quite one-to-one with the steps in this generic algorithm.  The functions in our script are organized into three layers: The base layer are foundation functions to enable computation of the PDN concentration function f(k).  The functions in the middle layer use these foundation functions to implement our algorithm to find the best k for K-Means clustering of a dataset.  The top layer are WhizzML functions that provide examples of different ways to use our best k implementation of K-Means clustering in your own workflows.

Foundation Functions for a PDN-based Approach to Finding the Best k

Our WhizzML script begins with a set of four simple foundation functions  (generate-clusters …), (extract-eval-data …)(alpha-func …) and (evaluation-func …).  The (generate-clusters …) function implements the first step in the generic algorithm we outlined.  Given a BigML dataset ID and a range for values of k, this script creates a sequence of BigML Cluster objects:

(define (generate-clusters dataset cluster-args k-min k-max)
  (let (dname (get (fetch dataset) "name")
        fargs (lambda (k)
                (assoc cluster-args "dataset" dataset
                                    "k" k
                                    "name" (str dname " - cluster (k=" k ")")))
        clist (map fargs (range k-min (+ 1 k-max)))
        ids (create* "cluster" clist))
    (map fetch (wait* ids))))

In addition to the “dataset” ID and range for k specified by “k-min” and “k-max”, the function accepts a map “cluster-args” of arguments for the BigML API to create Cluster objects. This base “cluster-args” map is expanded to a map for a specific value of k by the function fargs(k) created as a lambda function.

The rest of the function creates the clist of argument maps for each value of k and uses the WhizzML (map …) function.  The WhizzML (create* …) and (wait* …) functions are then used to create the BigML Cluster objects for k in “k-min” to “k-max” in parallel.  The function then returns a list of the metadata for the resulting clusters on the BigML server.

As we will explain subsequently, the PDN concentration function f(k) for a given k is computed from certain members of the metadata map for the cluster object for k.  To illustrate this and simplify the code,  the next helper function (extract-eval-data …) in the script encapsulates the required values from the metadata map in a separate map:

(define (extract-eval-data cluster)
  (let (id (get cluster "resource")
        k (get cluster "k")
        n (count (get cluster "input_fields"))
        within_ss (get-in cluster ["clusters" "within_ss"])
        total_ss (get-in cluster ["clusters" "total_ss"]))
    {"id" id "k" k "n" n "within_ss" within_ss "total_ss" total_ss}))

In addition to the BigML cluster “id” and “k”, this smaller map includes the number “n” of fields in the dataset that are actually considered when doing the clustering. The “within_ss” property is the total sum-squared distance between every dataset row in the cluster and the centroid of the cluster.  Similarly, “total_ss” is the total sum-squared distance between every row in the entire dataset and the global centroid of the dataset. Therefore, it will be the same value for each cluster.

The next two functions (alpha-func …) and (evaluation-func …), are actually factory functions that together create the PDN concentration function f(k) for a clustering. This function includes an internal weighting function a(k) parameterized on the number n of input fields considered in clustering the dataset.  WhizzML does not provide an equivalent to the LISP (apply …) or the Clojure (partial …) for creating partial function evaluations, but it does create standard closures.  This allows us to use Javascript methods based on lambda functions and closures to  build the PDN concentration function f(k) parameterized on n in WhizzML.  We do this by using a factory function (alpha-func …) that returns the weighting function a(k), and a factory function (evaluation-func …) that returns a custom version of the concentration function f(k).

The concentration function f(k) in the PDN paper incorporates a weighting function a(k) that is recursive in k and parameterized on n (eqns. (3a) and (3b) in the paper).  Because we want to evaluate f(k) over an arbitrary range of k, we need a closed form expression for a(k).  We can’t go through the derivation here, but the closed form we need is:

       | 1 - 3/4n                               k=2
a(k) = |
       | (5/6)^(k-2) a(2) + [1 - (5/6)^(k-2)]   k>2

We could write our factory function (alpha-func …) in multiple ways.  The implementation in our WhizzML script follows a simple Javascript pattern that returns an anonymous function:

(define (alpha-func n)
  (let (alpha_2 (- 1 (/ 3 (* 4 n)))
        w (/ 5 6))
    (lambda (k)
      (if (<= k 2)
        (+ (* (pow w (- k 2)) alpha_2) (- 1 (pow w (- k 2))))))))

This factory function implicitly creates a closure that captures the input parameter “n” and then returns a lambda function that computes a(k).

We next use (alpha-func …) in our factory function (evaluation-func …) that creates the concentration function f(k).  As with the weighting function a(k), since we want to evaluate f(k) over an arbitrary range of k we need to slightly transform f(k) in the PDN paper (eqn. (2)):

                     | 1                   k=1 
f(k, S(k), S(k-1)) = | 1                   S(k-1) undefined or S(k-1)=0
                     | S(k)/[a(k)S(k-1)]   otherwise

where S(k) is the “within_ss” property in the map returned by the (extract-eval-data …) function we described above.  Our factory function implements the simple Javascript pattern that returns an anonymous function:

(define (evaluation-func n)
  (let (fa (alpha-func n))
    (lambda (k sk skm)
      (if (or (<= k 1) (not skm) (zero? skm))
        (/ sk (* (fa k) skm))))))

This factory function accepts the single input parameter “n”, implictly creates a closure that includes an instance of the weighting function a(k), and then returns an anonymous instance of our modified concentration function f(k, S(k), S(k-1)).

At this point it’s worth recapping the functions we’ve built so far.  In just a few lines of WhizzML code, we’ve implemented four routines that form the foundation layer of the Best-K script in the WhizzML script gallery and illustrate the power of WhizzML.  The (generate-clusters …) function orchestrates a potentially large amount work on the BigML backend to create a sequence of BigML cluster objects for K-means clusterings of our dataset over a range of k.  Each BigML cluster object itself embodies a large amount of data and metadata, so we’ve defined a function  (extract-eval-data …) that you could customize further in your own WhizzML scripts to extract just the metadata we’ll need.  Finally, we’ve implemented two factory functions (alpha-func …) and (evaluation-func …) that together generate a version of the Pham-Dimov-Nguyen concentration function f(k) suitable for our needs.

Using Our Foundation Functions to Implement a Best k Algorithm

We next combine our foundation functions with other WhizzML built-in functions in a set of three functions at the heart of our implementation of the PDN algorithm for choosing the best K-means clustering.  The first function (evaluate-clusters …) accepts a list of clusters created by (generate-clusters …) and returns a corresponding list of metadata maps:

(define (evaluate-clusters clusters)
  (let (cmdata (map extract-eval-data clusters)
        n (get (nth cmdata 0) "n")
        fe (evaluation-func n))
    (loop (in cmdata
           out []
           ckz {})
       (if (= [] in)
         (let (ck (head in)
               ckr (tail in)
               k (get ck "k")
               within_ss (get ck "within_ss")
               within_ssz (if (<= k 2) (get ck "total_ss") (get ckz "within_ss"))
               cko (assoc ck "fk" (fe k within_ss within_ssz)))
           (recur ckr (append out cko) ck))))))

Each metadata map in the returned list includes a property “fk” that is  the value of the PDN function f(k) for the corresponding K-means clustering.

This function uses (extract-eval-data …) to build a list cmdata of metadata maps for the list of K-means clusterings, and the factory function (evaluation-func …) to create a function “fe” that is our version f(k, S(k), S(k-1)) of the PDN concentration function f(k).  The body of the function is a WhizzML (loop …) function that steps through the input list “in” of metadata maps (initially the cmdata list) to sequentially generate the output list “out” of metadata maps.  The loop body operates on the head metadata map of the “in” list and the metadata map from the head member of the last iteration “ckz” to compute the “fk” property, and then appends an augmented metamap map to the output list “out”.  We note that the input list of “clusters” spans an arbitrary range of k and that the computation to generate “within_ssz”  generates the initial value for S(k-1) required by our concentration function f(k, S(k), S(k-1)) for the first cluster in the “clusters” list.

Our next two functions are helper functions used by our top level functions we describe next.  The first function (clean-clusters …) just deletes unneeded BigML Cluster objects created by our PDN-based algorithm:

(define (clean-clusters evaluations cluster-id logf)
  (for (x evaluations)
    (let (id (get x "id")
          _ (if logf (log-info "Testing for deletion " id " " cluster-id)))
      (if (!= id cluster-id)
        (prog (delete id)
              (if logf (log-info "Deleted " id))))))

We note that this function includes an input parameter “logf”. When this parameter is true, the function logs information about the delete operation to the BigML logging system.  The function is intended to be a base example you could expand with additional logging information in your own version of the script.

The other function (best-cluster …) generates a new BigML Cluster object:

(define (best-cluster dataset cluster-args k)
  (let (dname (get (fetch dataset) "name")
        ckargs (assoc cluster-args "dataset" dataset
                                   "k" k
                                   "name" (str dname " - cluster (k=" k ")")))
    (create-and-wait-cluster ckargs)))

This helper function is intended to increase the flexibility of our WhizzML script. In the initial evaluation stage we generate a list of BigML Cluster objects using the (generate-clusters …) function using an arbitrary map “cluster-args” of  values for the BigML clustering operation arguments.  Using this helper function, we can generate a final version of the BigML Cluster object for a given k using a different “cluster-args” map.

Before introducing the final top level functions in our example WhizzML script, we can add a few additional notes.  First note that our middle level functions only access data in WhizzML to do their work, they don’t need to access the BigML Cluster objects in the BigML system after we created the BigML Cluster objects with the (generate-clusters …) function.  Correspondingly, our example (clean-clusters …) WhizzML function queues the object deletion requests to the BigML platform but doesn’t need to wait for them to complete.  Finally, although the sample (best-cluster …) function allows us to regenerate the K-means clustering for the best k and waits for BigML to complete, you could just queue the request  to create the BigML Cluster object in your own custom WhizzML script and check if it is complete with the (wait …) function when you need it. The BigML platform takes care of all the cumbersome work of creating and deleting objects, and just provides our WhizzML code with the small out of data we need. This greatly simplifies orchestrating and optimizing the performance of our workflows.

Functions that Illustrate Several Applications of the PDN Best k Approach

The final group of functions in our example WhizzML script are three simple top level functions that provide us with a stack of operations relevant to different applications.  We step through them in order. We then provide example WhizzML calls of each function.

The first top-level function (evaluate-k-means …) just creates the list of BigML Cluster objects for K-means clustering for k ranging from “k-min” to “k-max” and returns the list of metadata maps that includes the value of the PDN concentration function f(k) as the property “fk”:

(define (evaluate-k-means dataset cluster-args k-min k-max clean logf)
  (let (clusters (generate-clusters dataset cluster-args k-min k-max)
        evaluations (evaluate-clusters clusters))
    (if clean
      (clean-clusters evaluations "" logf))

In addition to the basic input parameters “dataset”, “k-min”, and “k-max”,  the function allows us to specify a WhizzML map “cluster-args” of our choice of arguments for the BigML cluster operation.  When the “clean” parameter is true, it causes the function to call the (clean-clusters …) function to optionally delete the BigML Cluster objects on the BigML platform before returning the result list.  In this example function, the value of the parameter “logf” is just passed on to the (clean-clusters …) function.  In your own custom version of this WhizzML script you can use this parameter to control whatever additional logging you might want.

Our next function (best-k-means …) builds on (evaluate-k-means …) to return a BigML Cluster object for the best k:

(define (best-k-means dataset cluster-args k-min k-max bestcluster-args clean logf)
  (let (evaluations (evaluate-k-means dataset cluster-args k-min k-max false logf)
        _ (if logf (log-info "Evaluations " evaluations))
        besteval (min-key (lambda (x) (get x "fk")) evaluations)
        _ (if logf (log-info "Best " besteval))
        cluster-id (if (= cluster-args bestcluster-args)
                     (get besteval "id")
                     (best-cluster dataset bestcluster-args (get besteval "k"))))
    (if clean
      (clean-clusters evaluations cluster-id logf))

After we generate the list evaluations of metadata maps with the PDN concentration function values, we used the WhizzML (min-key …) built-in function to find the metadata map for the best k.  We then check if the “cluster-args” map used in the first stage when we find k differs from the “bestcluster-args” map.  If the two maps don’t agree, we generate a new BigML Cluster object for the best k.   Regardless, if “clean” is specified as true, we direct the BigML platform to asynchronously delete the BigML Cluster objects on the platform that we don’t need.  Finally, we return the ID of the BigML Cluster object for the best k the routine found.

Our last routine (best-batchcentroid …), uses the BigML Cluster object created by the (best-k-means …) function and the input BigML Dataset object to create a BigML Batch Centroid object:

(define (best-batchcentroid dataset cluster-args k-min k-max bestcluster-args clean logf)
  (let (cluster-id (best-k-means dataset cluster-args k-min k-max bestcluster-args clean logf)
        batchcentroid-id (create-and-wait-batchcentroid {"cluster" cluster-id
                                                         "dataset" dataset
                                                         "output_dataset" true
                                                         "all_fields" true}))

Because the argument map in our call  to the WhizzML (create-and-wait-batchcentroid …) function includes the “all_fields” snf the “output_dataset” properties, the function also creates a BigML Dataset object that includes all columns in the input “dataset” and an extra column that specifies the cluster number to which the dataset row was assigned.

Using the Best k Algorithm Implementation

All of our top 3 WhizzML functions have the same parameters, so we can call them in the same way:

(define bestk-evaluations (evaluate-k-means dataset cluster-args k-min k-max clean logf))

(define bestk-cluster (best-k-means dataset cluster-args k-min k-max bestcluster-args clean logf))

(define bestk-batchcentroid (best-batchcentroid dataset cluster-args k-min k-max bestcluster-args clean logf))

These three examples illustrate how we compute a list of PDN concentration function f(k) evaluations, the BigML Cluster object for the best k, and the BigML Batch Centroid object for the best k, respectively.

In your application, you might have a guess for the best k.  In that case, you might want to specify a range “k-min” to “k-max” that brackets that k value.  You could then use the first call to the (evaluate-k-means …) function above, examine the results, and choose the best k.  Alternatively you could use (evaluate-k-means …) in a loop to test a series of intervals [k_1,k_2], [k_2,k_3] … [k_N-1, k_N], and then choose the best k from all of those tests.  Finally, if you know a  range “k-min” to “k-max”, you can use (best-k-means …) or (best-batchcentroid …) to generate the BigML Cluster object or BigML Batch Centroid object, respectively, for the best k.



%d bloggers like this: