Skip to content

Machine Learning in Objective-C Has Never Been Easier

by

Taking the opportunity provided by our recent Spring release, BigML is pleased to announce our new SDK for Objective-C, bigml-objc, which provides a modern, block-based Objective-C API, and a new more maintainable and coherent design. Additionally, bigml-objc includes support for WhizzML, our exciting new DSL for the automation of ML workflows. bigml-objc evolves and supersedes the old ML4iOS library, which has been kindly supported by BigML’s friend Felix Garcia Lainez.

logo

bigml-objc‘s API design follows along the lines of our Swift SDK. Its main aim is to allow iOS, OS X, watchOS, and tvOS developers to easily integrate BigML services into their apps while also benefitting from modern Objective-C features (first and foremost Objective-C blocks), which allow for a simple handling of asynchronous operations.

The main features BigML SDK for Objective-C provides can be divided into two areas:

  • Remote resource processing: BigML SDK exposes BigML’s REST API through a higher-level Objective-C API that will make is easier for you to create, retrieve, update, and delete remote resources. Supported resources are:

    source Sources source Datasets
    source Models source Clusters
    source Anomalies source Ensembles
    source Predictions source WhizzML
  • Local resource processing: BigML SDK allows you to mix local and remote distributed processing in a seamless and transparent way. You will be able to download your remote resources (e.g. a cluster) and then apply supported algorithms to them (e.g. calculate its nearest centroid based on your input data). This is one definite advantage that BigML offers in comparison to competing services, which mostly bind you into either using their remote services or do everything locally. BigML’s SDK for Objective-C combines the benefits of both approaches by making it possible to use the power of a cloud solution and to enjoy the flexibility/transparency of local processing right when you need it. The following is a list of currently supported algorithms that BigML’s SDK for Objective-C provides:

    • Model predictions
    • Ensemble predictions
    • Clustering
    • Anomaly detections.

    A dive into BigML’s Objective-C API

    The BMLAPIConnector class is the workhorse of all remote processing: it allows you to create, delete, and get remote resource of any supported type. When instantiating it, you should provide your BigML’s account credentials and specify whether you want to work in development or production mode:

    BMLAppAPIConnector* connector =
      [[BMLAppAPIConnector alloc]
         initWithUsername:@"your BigML username here"
                   apiKey:@"your BigML API Key here"
                     mode:BMLModeProduction
                   server:nil
                  version:nil]
    

    You can safely pass nil for the version argument, since there is actually only one API version supported by BigML.

    Once you connector is instantiated, you can use it to create a data source from a local CSV file:

    
    NSString* filePath = ...;
    BMLMinimalResource* file =
    [[BMLMinimalResource alloc]
       initWithName:@"My Data Source"
               type:BMLResourceTypeFile
               uuid:filePath
         definition:nil];
    
    [connector createResource:BMLResourceTypeSource
                         name:@"My frist data source"
                      options:nil
                         from:file
                   completion:^(id resource, NSError* error) {
    
          if (error == nil) {
                 //-- use resource
          } else {
                 //-- handle error
          }
    }];
    
    

    As you can see, BMLAPIConnector’s createResource allows you to specify the type of resource you want to create, its name, a set of options, and the resource that should be used to create it, in this case a local file.

    BigML SDK for Objective-C’s API is entirely asynchronous and relies on completion blocks, where you will get the resource that has been created, if any, or the error that aborted the operation as applicable. The resource you will receive in the completion block is an instance of the BMLMinimalResource type, which conforms to the BMLResource protocol.

    typedef NSString BMLResourceUuid;
    typedef NSString BMLResourceFullUuid;
    
    @class BMLResourceTypeIdentifier;
    
    /**
     * This protocol represents a generic BigML resource.
     */
    @protocol BMLResource <NSObject>
    
    /// the json body of the resource. See BigML REST API doc (https://tropo.dev.bigml.com/developers/)
    @property (nonatomic, strong) NSDictionary* jsonDefinition;
    
    /// the current status of the resource
    @property (nonatomic) BMLResourceStatus status;
    
    /// the resource progress, a float between 0 and 1
    @property (nonatomic) float progress;
    
    /// the resource name
    - (NSString*)name;
    
    /// the resource type
    - (BMLResourceTypeIdentifier*)type;
    
    /// the resource UUID
    - (BMLResourceUuid*)uuid;
    
    /// the resource full UUID
    - (BMLResourceFullUuid*)fullUuid;
    
    @end
    

    The BMLResource protocol encodes the most basic information that all resources share: a name, a type, a UUID, the resource’s current status, and a JSON object that describes all of its properties. You are supposed to create your own custom class that conforms to the BMLResource protocol and that best suits your needs e.g., it might be a Core Data class that allows you to persist your resource to a local cache. Of course you are welcome to reuse our BMLMinimalResource implementation as you wish.

    In a pretty similar way you can create a dataset from the data source just created:

    [connector createResource:BMLResourceTypeDataset
                         name:@"My first dataset"
                      options:nil
                         from:myDatasource
                   completion:^(id resource, NSError* error) {
    
            if (error == nil) {
                  //-- use resource
            } else {
                  //-- handle error
            }
    }];
    

    If you know the UUID of an existing resource of a given type and want to retrieve it from BigML, you can use BMLAPIConnector’s getResource method:

    [connector getResource:BMLResourceTypeDataset
                         uuid:resourceUUID
                   completion:^(id resource, NSError* error) {
    
              if (error == nil) {
                   //-- use resource
              } else {
                   //-- handle error
              }
    }];
    

    Creating WhizzML Scripts

    You can create a WhizzML script in a way similar to how you create a datasource, i.e., by using BMLAPIConnector‘s createResource method and providing a BMLResourceTypeWhizzmlSource resource that encodes the WhizzML source code:

        BMLMinimalResource* resource =
        [[BMLMinimalResource alloc] initWithName:
         _model.viewModel.currentWorkflowResource.name
                                            type:BMLResourceTypeWhizzmlSource
                                            uuid:@""
                                      definition:@{}];
        NSDictionary* dict = @{ @"source_code" : @"My source code here",
                                @"description" : @"My first WhizzML script",
                                @"inputs" : @[@{@"name" : @"inDataset", @"type" : @"dataset-id"}],
                                @"tags" : @[@"tag1", @"tag2"] };
    
        [[BMLAPIConnector newConnector]
         createResource:BMLResourceTypeWhizzmlScript
         name:@"My first WhizzML Script"
         options:dict
         from:resource
         completion:^(id<BMLResource> resource, NSError* error) {
    
             if (resource) {
                // execute script passed in resource
             } else {
                // handle error
             }
         }];
    

    Creating WhizzML scripts yourself is not the only way to take advantage of our new workflow automation DSL. Indeed, you can browse our WhizzML script Gallery and find a growing collection of scripts to solve recurrent machine learning tasks such as removing anomalies from a dataset, identifying a dataset’s best features, doing cross-validation, and many more. Once you have found what you are looking for, you can clone that script (many are even free!) to your account for use from your Objective-C program.

    Once you have created or cloned your script from the gallery, you can execute it very easily:

    [connector createResource:BMLResourceTypeWhizzmlExecution
                                  name:@"New Execution"
                               options:@{ @"inputs" : @[@[@"inDataset", @"dataset/573d9b147e0a8d70da01a0b5"]] }
                                  from:myScript
                            completion:^(id<BMLResource> resource, NSError* error) {</pre>
    if (resource) {
                // execute script passed in resource
             } else {
                // handle error
             }
    }];
    

    Read a thorough description of WhizzML and how you can use WhizzML scripts, libraries and executions in our REST API documentation! A great resource to learn about the language is our series of training videos.

    Local algorithms

    The most exciting part of BigML’s SDK for Objective-C is surely its support for a collection of the most widely used ML algorithms such as model prediction, clustering, anomaly detection etc. What is even more exciting is that the family of algorithms that BigML’s SDK for Objective-C supports is constantly growing!

    As an example, say that you have a model in your BigML account and that you want to use it to make a prediction based on some set of data that you have got. This is a two step process:

    • Retrieve the model from your account, as shown above, with getResource.
    • Use BigML’s SDK for Objective-C to calculate a prediction locally.

    The second step can be executed inside of the completion block that you pass to getResource. This could look like the following:

    [connector getResource:BMLResourceTypeModel
                         uuid:resourceUUID
                   completion:^(id resource, NSError* error) {
    
                           if (error == nil) {
                               NSDictionary* prediction =
            [BMLLocalPredictions localPredictionWithJSONModelSync:resource.jsonDefinition
                                                        arguments:@{ @"sepal length": @(6.02),
                                                 @"sepal width": @(3.15),
                                                 @"petal width": @(1.51),
                                                 @"petal length": @(4.07) }
                                                          options:nil];
                           } else {
                              //-- handle error
                           }
    }];
    

    The prediction object returned is a dictionary containing the value of the prediction and its confidence. In similar ways, you can calculate the nearest centroid, or do anomaly scoring.

    Practical Info

    The BigML SDK for Objective-C is compatible with Objective-C 2.0 and later. You can fork BigML’s SDK for Objective-C from BigML’s GitHub account and send us your PRs. As always, let us know what you think about it and how we can improve it to better suit your requirements!

PAPIs 2016 – Call for Proposals Deadline is This Friday!

PAPIs 2016 Boston

2016 marks the first year PAPIs.io is making it across the Atlantic to Boston. The conference will take place on October 10-12, 2016, and the deadline for proposals is this Friday.  As a founding member and initial sponsor of PAPIs.io, BigML will be actively participating in this third edition too. Besides BigML, last year’s event in Sydney included presenters from large tech companies such as Amazon, Microsoft, Google, NVIDIA as well as key government organizations and innovative startups focusing on Machine Learning.

PAPIs remains the premier forum for the presentation of new machine learning APIs, techniques, architectures and tools to build predictive applications. It is a community conference that brings together practitioners from industry, government and academia to present new developments, identify new needs and trends, and discuss the challenges of building real-world predictive, intelligent applications.

This year’s conference program will feature 4 types of presentations:

  • Technical and Business Talks (e.g., use cases, innovations, challenges, lessons learnt)
  • Tutorials
  • Research Presentations
  • Startup Pitches (as part of the AI Startup Battle)

As evidenced by 700+ attendees came from 25 different countries to 4 previous events, presenting at the conference is a great way to share your learnings, showcase leadership on behalf of your organization, and engage with likeminded peers. With the aim of a diverse and creative line-up of speakers, the organizing committee is welcoming practical presentations on a wide range of experience levels — from beginner-friendly how-to’s to cautionary tales to deep dives for experienced professionals.

Please follow these guidelines for the best chance of having your proposal selected. If you have additional questions, don’t hesitate to email the orginizers at cfp@papis.io.

We’re looking forward to receiving your best proposals!

WhizzML Training Videos are Here!

This week we completed four in-depth training webinars focused on WhizzML, BigML’s new domain-specific language for automating Machine Learning workflows, implementing high-level Machine Learning algorithms, and easily sharing them with others. We already have our first batch of WhizzML graduates merely a week after launch. However, many of you were either not able to secure a live webinar spot or not able to join us at the scheduled date and time. Don’t fret if you missed any of these training sessions. You can now watch the whole series at your own pace on BigML’s YouTube channel.

We suggest that you follow the same order in the series as there are dependencies that may slow down your comprehension if you skip things.  Here is brief guide on how the series is structured:

1. Introduction to WhizzML

The first session covers all the basics describing how WhizzML is implemented on the BigML platform. Ryan Asensio, BigML’s Machine Learning Engineer, introduces the purpose of the language and some benefits over other ways of implementing Machine Learning workflows and algorithms.

2. Language Overview and Basic Workflows

This intermediate webinar explores the WhizzML domain-specific language in greater detail, with a whirlwind tour of its syntax, programming constructs and basic standard library functions. In this second training session, Charles Parker, BigML’s VP of Machine Learning Algorithms, explains how to create and use WhizzML resources (libraries, scripts and executions) by means of several simple yet fully functional example workflows.

3. Advanced Machine Learning Workflows

The third training session is an advanced webinar where we continue our exploration of the WhizzML language, diving into more complex examples and using more advanced features of the language. Charles Parker, BigML’s VP of Machine Learning Algorithms, explains how some of the most effective Machine Learning algorithms can be implemented and automated on top of the BigML with WhizzML.

4. Real-world Machine Learning Workflows

In the fourth session, Poul Petersen, BigML’s Chief Infrastructure Officer, walks you through some real-world workflow automations with an eye towards the kind of problems posed by complex use cases. In this advanced webinar we use some of the best tricks to solve your Machine Learning problems with confidence.

You can always visit the dedicated WhizzML landing page for the most up to date info and resources.

Have an idea for a new script for a Machine Learning task? As always, forward us your questions or comments anytime at info@bigml.com. We are looking forward to hear about the Machine Learning projects that you are looking to automate.

Happy WhizzMLing!

 

 

 

WhizzML Tutorial II: Covariate Shift

whizzml-training-notext

If this is your first time writing in the new WhizzML language, I suggest that you start here with a more simple tutorial. In this post, we are going to write a WhizzML script that automates the process of investigating Covariate Shift. To get a deeper understanding of what we’re trying to do, read the beginning of this article first.

We want a workflow that:

  1. Given two datasets (one that represents the data used to train a predictive model, one that represents production data)
  2. Returns an indication of whether the distribution of data is different between the two datasets

covariate_shift

As we read in the article (or on Wikipedia), the indicator of change in our data distribution is called the phi coefficient. Our WhizzML script will return us this number, so let’s name our base function phi-coefficient.

phi-coefficient

Screen Shot 2016-06-02 at 12.55.03 PM.png

What are we doing here?

To start, the function takes three arguments. The first two are ids for our training and production datasets, respectively. We call them training-dataset and production-dataset. The third argument, seed, is used to make our sampling deterministic. We’ll talk about this later.

There’s quite a bit going on in this function, but it’s all broken into manageable pieces. First, we use let to set local variables. These local variables are the result of a few different functions, which we will have to define. The local variables are comb-data, ids, model, and eval. After these are set, we can compute the phi coefficient with the function avg-phi. Let’s go over each of the local variables.

comb-data

comb-data is the result of (combined-data training-dataset production-dataset). Here, we combine the two datasets into one big dataset. But before they are combined, we have to do a transformation on each dataset (add the “Origin” field). We’ll talk about that transformation when we define combined-data.

The dataset returned by our comb-data function looks something like this:

Screen Shot 2016-06-02 at 10.32.59 AM.png

ids

Next, we have a variable called ids. This is a list of two dataset IDs – the result of:

Screen Shot 2016-06-02 at 10.43.35 AM.png

Our split-dataset function takes the comb-data (one big dataset) and randomly splits it into two datasets. We split it so that we can train a predictive model with the larger portion of the split, and then evaluate its performance on the smaller part. The split-dataset function returns something like this:

["dataset/83bf92b0b38gbgb" "dataset/83hf93gf012bg84b20"]

model

model is a BigML predictive model resource. We are creating this model from the first element of our ids list: "dataset" (nth ids 0). The model is built to predict whether the value for the “Origin” field is “Training” or “Production”. Thus, the “objective_field” is “Origin”: "objective_field" "Origin".

eval

eval is a BigML evaluation resource. To create an evaluation, we need two arguments: a predictive model and a dataset we want to test the model against. Our model is stored in model and our dataset is the second element in the ids list, hence: (nth ids 1)

avg-phi

We’re done with the local variables, but what does the whole phi-coefficient function return – what’s our end product?

Screen Shot 2016-06-02 at 10.44.32 AM.png
That line gives us the average phi score for the evaluation we just created. A bunch of information is stored inside the eval data object that will be retrieved from BigML. But of course we have to tell the function avg-phi how to get what we want! We’ll save that for later.

So we have built our base function (phi-coefficient)and understand its components. Now we have to go back and build the functions we haven’t defined yet, specifically comb-data, split-dataset, model-evaluation and avg-phi. We’ll start with comb-data.

comb-data

Screen Shot 2016-06-02 at 12.55.42 PM.png

Again, this function combines two datasets. We tell BigML what datasets we want to combine using the “origin_datasets” parameter and passing it a list of dataset ids.

But what are train-data and prod-data?

Those are helper functions that add the “Origin” field we talked about.

  • train-data adds the “Origin” field with the value “Training” in each row
  • prod-data adds the “Origin” field with the value “Production” in each row

They are defined here:

Screen Shot 2016-06-02 at 10.48.00 AM.png
Since we are doing pretty similar things in both functions, (adding an “Origin” field) we can separate that logic into its own function. Here it is:

with-origin-field

Screen Shot 2016-06-02 at 10.48.10 AM.png
In that function we are…

  1. Creating a new dataset from an existing one "origin_dataset" dataset-id
  2. Adding a new field "new_fields" [...]
  3. Giving the new field a column name and label "name" "Origin" "label" "Origin"
  4. Setting the row’s value "field" value

The value will either be the string "Production" or "Training". This string is passed in as an argument where prod-data and train-data are defined.

Nice. Nows let’s go over split-dataset.

split-dataset

Screen Shot 2016-06-02 at 10.51.12 AM.png

  1. What are we splitting? dataset-id – the dataset we pass in.
  2. How are we splitting it, 80%/20%? 90%/10%? We can do whatever we want. This is determined by rate.
  3. How are we going to shuffle our data before we split it? The seed determines this.

As you can see, we are sampling the same dataset twice. One sample will be used to build a predictive model, the other will be used to evaluate the predictive model.

sample-dataset is another function. Here it is below:

sample-dataset

Screen Shot 2016-06-02 at 10.52.42 AM.png
This function interacts with the BigML API. We create a new dataset, passing in the rate, the original dataset (dataset-id), whether it is out_of_bag or not (we’ll go over this) and the seed used to determine how the original dataset was shuffled.

Here’s a little diagram that will help explain how the seed and out_of_bag (oob) work.

Screen Shot 2016-06-02 at 10.53.41 AM.png
So if out_of_bag is set to true, we grab the rows labeled “oob”. Otherwise, we grab the ones marked “x”. The seed just changes which rows we label “oob” and “x”. The seed also enables this whole process be deterministic. So if you run the phi-coefficient function with the same seed (and the same datasets), you’ll get the same results!

Cool. That wraps up our split-dataset function. Next up, model-evaluation.

model-evaluation

Screen Shot 2016-06-02 at 10.55.01 AM.png

I apologize if you were hoping for something more exciting. This function is just a wrapper for the method included with WhizzML, create-and-wait-evaluation. As you can see, we are simply creating an evaluation with a model and a dataset. Our last function is…

avg-phi

Screen Shot 2016-06-02 at 10.55.46 AM.png
Pretty simple too!

We take the evaluation ev-id and fetch its data from BigML (fetch ev-id). Then we access the “average_phi” attribute nested under “model” and “result”. The data object looks like this:

Screen Shot 2016-06-02 at 10.56.19 AM.png

And there we have it. A WhizzML script that helps predict Covariate shift.

All together:

Screen Shot 2016-06-02 at 12.57.38 PM.png

We can run our function like this:

Screen Shot 2016-06-02 at 11.01.25 AM.png
But…

As we read in the previous post, it is best to do this process several times and look at the average of the results. How could we add some more code to to do this programmatically? Here’s one implementation.

multi-phis

Screen Shot 2016-06-02 at 12.58.13 PM.png

Again, we are giving this function our training-dataset and production-dataset. But we are also passing in n, which is the number of phi-coefficients we want to calculate. As you can see, we are defining a loop.

Within this loop, we set some variables.

seeds, we give the default (starting) value of (range 0 n). If we pass in 4 for the value of n then the initial value of seeds = [0 1 2 3]

out is our output. We will add the result of a phi-coefficient run each time through the loop. Initially, out = []

We also define the end-scenario.

seeds = (tail seeds). This grabs everything but the first element of seeds. So the first time through, it might be [0 1 2 3], then it will be [1 2 3], then [2 3]

If seeds is not empty, we go back to the loop, but define values for seeds and out.

If seeds is empty, then we return a map with the values list and average (we’ll explain these in a bit).

out = (append out (phi-coefficient ...)) We take the result of our phi-coefficient function and add it to the out list. The first time through, it’s [], then [-0.0838], then [-0.0838, 0.1240] etc.

The seed we will use for each of these phi-coefficient runs will be "test-0", "test-1", "test-2" etc.  Thats what (str "test-" (head seeds)) is doing – joining the string "test-" with the first element of the seeds list.

The last thing we should discuss is the end-case return value:

Screen Shot 2016-06-02 at 11.13.26 AM.png
The value of “list” (out) is just the list of phi-coefficient values from each run. The “average” is… Yep. The average of all the runs.reduce adds up the elements. count counts the number of elements. / divides the first argument by the second. That’s it!*

Example run:

Screen Shot 2016-06-02 at 11.20.43 AM.png

We have now automated the process to investigate whether our distribution of data has changed. Great! You might want to create a scheduled job to check your production data against the data you used to create a predictive model. When the covariate shift exceeds a threshold, retrain the model!

Why WhizzML?

Wait… couldn’t we already do this with the API bindings? What’s special about WhizzML?

Yes, we could use the API bindings. However, there are two significant advantages to WhizzML. First, what if we write this workflow in Python and later decide we want to do the same thing workflow in NodeJS app? We would have to rewrite it the whole workflow!  WhizzML lets us codify our workflow once and use it from any language. Second, WhizzML removes the complexity and brittleness of needing to send multiple HTTP requests to the BigML server (for creating intermediate resources, fetching data, etc.). One HTTP request is all you need  to execute a workflow with WhizzML.

Stay tuned for more blog posts like this that will help you get started automating your own Machine Learning workflows and algorithms.


*There is actually one more thing we can do: a performance enhancement. In each phi-measure run, we recreate the train-data , prod-data and comb-data datasets. This is unnecessary – we can reuse the comb-data dataset and just sample it differently for each run! You can check out the code that includes this improvement here. Note that the comb-data logic from the phi-coefficient function is moved into the loop of multi-phis , and thus the phi-coefficient function is renamed to sample-and-score.

 

WhizzML Tutorial I: Automated Dataset Transformation

whizzml_logo

I hope you’re as excited as I am to start using WhizzML to automate BigML workflows! (If you don’t know what WhizzML is yet, I suggest you check out this article first. In this post, we’ll write a simple WhizzML script that automates a dataset transformation process.

As those of you who have dealt with datasets in a production environment know, sometimes there are fields which are missing a lot of data. So much so, that we want to ignore the field altogether. Luckily, BigML will automatically detect useless fields like this and ignore them automatically if we create a predictive model. But what if we want to specify the required “completeness” of the data field? Like if we only want to include fields that have values for more than 95% of the rows.

We can use WhizzML!

Let’s do it! Look to the WhizzML Reference Guide if you need it along the way. Also, the source code can be found in this GitHub Gist.

We want to write a function that: given a dataset and a specified threshold (e.g., 0.95), returns a new dataset with only the fields that are more than 95% populated. Our top-level function is defined below.

filtered-dataset

filtered-dataset.png

Hey, slow down!

Ok. Let’s take it step-by-step. We define a new function called filtered-dataset that takes two arguments: our starting dataset, dataset-id and a threshold (e.g., 0.95).

Screen Shot 2016-05-26 at 5.12.15 PM.png

What do we want this function to do? We want it to return a new dataset, hence:

Screen Shot 2016-05-26 at 5.12.23 PM.png

But we don’t just want any old dataset, we want one based off our old dataset:

Screen Shot 2016-05-26 at 5.12.32 PM.png

And we also want to exclude some fields from our old dataset!

Screen Shot 2016-05-26 at 5.12.39 PM.png

Ah, but which fields do we want to exclude? We can let a new function called excluded-fields figure that out for us. But for now, all we need to know is that this new function (excluded-fields) takes two arguments: our original dataset and our specified threshold.

The line above becomes: (indentation removed for clarity)

Screen Shot 2016-05-26 at 5.12.49 PM.png

As we progress, keep in mind that we want this new function (excluded-fields) to return a list of field names (e.g., ["field_1" "field_2" "field_3"]).

Great! We have defined our base function. Now we have to tell our new function,  excluded-fields, how to give us the list that we want.

excluded-fields

excluded-fields.png

Wow what?
You can use that code for reference, but don’t be intimidated. We’ll go over each piece. First we define the function, declaring its two arguments: our original dataset, and the threshold we want to use.

Screen Shot 2016-05-26 at 5.12.59 PM.png

Before we write any more code, let’s talk about the meat of this function. We want to look at all the fields (columns) of this dataset, and find the ones that are missing too much data. We’ll keep the names of these “bad” fields so that we can exclude them from our new dataset. To do this, we can use the function filter. It takes two arguments: a list and a predicate (a predicate is like a test) and will return a new list based on the predicate. In our case, the predicate is that the field has less than 95% of the rows populated.

Screen Shot 2016-05-26 at 5.13.31 PM.png

The predicate should be a function that either evaluates to true or false based on each element of the list we pass to it. If the predicate returns true, then that element of the list is kept. Otherwise, it is thrown out.

We can define the predicate function using lambda.lambda is like any other function definition. We have to tell it the name of the thing we are passing into it

Screen Shot 2016-05-26 at 5.13.43 PM.png

and also tell it what we are going to do with that thing.

Screen Shot 2016-05-26 at 5.13.50 PM.png

In our case, we are checking to see if the threshold is greater than the amount of data present. We will keep the field-name(s) that do not have enough data. (Because remember, these are the fields that will be excluded from our new dataset.) Two things still missing from our filter.

  1. all-field-names
  2. <percent-of-data-that-the-field-has>

How do we get these? The first isn’t too difficult because BigML Datasets have this information readily available. We just have to “fetch” it from BigML first.

Screen Shot 2016-05-26 at 5.14.10 PM.png

and then specify which value we want to get.

Screen Shot 2016-05-26 at 5.14.17 PM.png

Nice. To figure out what percent of the rows are populated for a specific field, we get to… Define a new function! But before we do that, let’s talk about some things we skipped over in our excluded-fields function. Here it is again, for convenience.

excluded-fields.png

What is let?

let is the method for declaring local variables in WhizzML.

  1. We set the value of data to the result of (fetch dataset-id).
  2. We set the value of all-field-names to the result of (get data "input_fields")
  3. We set the value of total-rows to the result of (get data "rows"). (We didn’t talk about this yet. It’s one of the values we need to pass to the present-percent function)

let is useful for a couple of reasons in this function. First, we use data twice. So we can avoid the repetition of writing (fetch dataset-id) twice. Second, naming these variables at the top of the function makes the rest much easier to read and comprehend!

So to wrap up this excluded-fields function, lets talk through what it does again.
First, it declares local variables that we’ll need. Then, we take the list of all-field-names and filter it based on a function that checks its “present percent” of data points. We keep the names of the fields that do not pass our predicate. Cool! Now we’ll go over that present-percent function.

present-percent

present-percent.png

Ah. Not so bad. To calculate the percentage of data points that are present in a given field, we need a few things:

  1. The big collection of data from our dataset (data).
  2. The name of the field we are inspecting (field-name).
  3. The total number of rows in our dataset (total-rows).

We’ll set another local variable using let and call it fields. This is another object containing data about each of the fields. We’ll be using it below.

Screen Shot 2016-05-26 at 5.14.30 PM.png

Then, we divide the missing-count from the field by the total-rows. This gives us a “missing percent”.

Screen Shot 2016-05-26 at 5.14.39 PM.png

We subtract the “missing percent” from 1 and that gives us the “present percent”!

Screen Shot 2016-05-26 at 5.14.45 PM.png

But “missing-count” is another function!

Yes it is!

missing-countScreen Shot 2016-05-30 at 2.09.51 PM.png

 

missing-count takes two arguments. First, the name of the field we are inspecting (field-name) Second, the fields object we mentioned earlier. It holds a bunch of information about each of the Dataset fields. To get the count of missing rows of data in the field, we do this:

Screen Shot 2016-05-26 at 5.14.54 PM.png

It lets us access an inner value (e.g., 10) from a data object structured like so:

Screen Shot 2016-05-26 at 5.27.40 PM.png

And… That’s it! We have now written all the pieces to make our filtered-dataset function work! All together, the code should look like this:

all-together.png

And we can run it like this:

Screen Shot 2016-05-26 at 5.15.07 PM

And get a result like this "dataset/574317c346522fcd53000102"– a new dataset without those empty fields. I can add this script to my BigML Source dashboard and use it with one click. Or I can put it in a library, and incorporate it into a more advanced workflow. Awesome!

Stay tuned for more blog posts like this that will help you get started automating your own Machine Learning workflows and algorithms.

WhizzML Launch Webinar Recording is Here! In-depth WhizzML Training Series Open for Registration

Last week BigML announced WhizzML, a new domain-specific language for automating Machine Learning workflows, implementing high-level Machine Learning algorithms, and easily sharing them with others.  If you missed the announcement event, you can watch the launch webinar by clicking the link below. This webinar will be complemented by a series of in-depth training sessions for the true innovators, who are looking to push the envelope when it comes to the uptake of Machine Learning in their organizations. Consider this your FREE invitation to join this exclusive four part online event. See the details below.

WhizzML marks a turning point in how companies can automate Machine Learning as it offers out-of-the-box scalability, abstracts away the complexity of underlying infrastructure, and helps analysts, developers, and scientists double or even triple their productivity by reducing the burden of repetitive, brittle and time-consuming Machine Learning tasks.  If you complete the following four training sessions, you will not only leap ahead in your understanding of real life Machine Learning automation challenges but also receive a BigML T-shirt to commemorate your achievement.

webinar1

The first session will cover all the basics describing how WhizzML is implemented on the BigML platform. Ryan Asensio, BigML’s Machine Learning Engineer, will be introducing the purpose of the language and some benefits over other ways of implementing Machine Learning workflows and algorithms. Join us on Monday, May 30, 2016 at 10:00 AM US PDT (Portland, Oregon. GMT -07:00) / 7:00 PM CEST (Valencia, Spain. GMT +02:00).

webinar2In this intermediate webinar, Charles Parker, BigML’s VP of Machine Learning Algorithms, will start exploring the WhizzML domain-specific language in greater detail, with a whirlwind tour of its syntax, programming constructs and basic standard library functions. We will also learn how to create and use WhizzML resources (libraries, scripts and executions) by means of several simple yet fully functional example workflows. It will take place on Tuesday, May 31, 2016 at 10:00 AM US PDT (Portland, Oregon. GMT -07:00) /7:00 PM CEST (Valencia, Spain. GMT +02:00). Register now, as space is limited!

webinar3In this advanced webinar, we will continue our exploration of the WhizzML language, diving into more complex examples and using more advanced features of the language. Charles Parker, BigML’s VP of Machine Learning Algorithms, will explain how some of the most effective Machine Learning algorithms can be implemented and automated on top of the BigML with WhizzML. Sign up and reserve your spot for Wednesday, June 1, 2016 at 10:00 AM US PDT (Portland, Oregon. GMT -07:00) / 7:00 PM CEST (Valencia, Spain. GMT +02:00).

webinar4In this advanced session, we will walk you through some real-world workflow automations with an eye towards the kind of problems posed by complex use cases, and use some of the best tricks to solve them with confidence. This webinar will be presented by Poul Petersen, BigML’s Chief Infrastructure Officer. It will take place on Thursday, June 2, 2016 at 10:00 AM US PDT (Portland, Oregon. GMT -07:00) / 7:00 PM CEST (Valencia, Spain. GMT +02:00). We hope to see you all there!

More training resources:

In addition to these online training sessions, if you prefer the self-study approach, you may want to download and read our WhizzML guides, documentation, tutorials, as well as the slide decks with basic, intermediate and advanced Machine Learning workflows. We have also prepared a number of useful scripts that you can practice with to get more hands on with WhizzML. You’ll find those on BigML’s Gallery. There are also plenty of example scripts and libraries available in the WhizzML Github repository. Please visit our release page and the dedicated WhizzML page to easily navigate to your resource of choice. Welcome to the world of WhizzML!

Let Your Voice Be Heard with Vision Mobile Developer Survey and Win a Prize

We’re proud to be supporting the new developer survey run by our friends at VisionMobile. This is the 11th developer survey and it’s entitled Developer Tools Benchmarking – as you can understand, the focus is on developer tools.

Vision Mobile Developer Survey

The survey features questions on topics like programming languages, platforms, app categories, tool categories, revenue models, IoT verticals. This year, they have included specific survey questions on Machine Learning tools such as BigML as a new area to explore. It’s a survey made by developers, for developers – so the questions are very relevant.

Whether you’re looking to share your thoughts with the dev community, find out something new, contribute to the leading developer research – or win a great prize – this is the survey for you. Participants can win one of the tens of prizes available, including an iPhone 6s, an Xperia Z5, a Nexus 6P, and more. VisionMobile will also share the survey findings with you, and show you how your answers compare you to other developers in your country.

Thanks for your participation in advance. Feel free to share/tweet and otherwise promote this post in your network, so we have the best possible coverage of the Machine Learning developer community.

Please start here: http://vmob.me/DE3Q16BigML

Automating Machine Learning in Madrid!

We are very excited with all the positive feedback about BigML’s latest release. It was a huge milestone to announce WhizzML, the very first domain-specific language for automating Machine Learning workflows, implementing high-level Machine Learning algorithms, and sharing them with others is now publicly available.

Thanks to everyone who attended. For those who couldn’t make it, we’ll publish the video recording soon.

More that ever, BigML is committed to its mission to make Machine Learning beautifully simple for everyone. We’ll be hosting four new FREE sessions about WhizzML to advance this worthy vision. Be sure to reserve your spot here

We will also be spreading the word on multiple fronts. Next week, Wayra Madrid is hosting our first public presentation of WhizzML after our release yesterday. Our CIO, Poul Petersen, will demonstrate how WhizzML can help analysts, developers, and scientists reduce the burden of repetitive and time-consuming analytics tasks with its out-of-the-box scalability and its ability to abstract away the complexity of underlying infrastructure. Check out the poster below for further details on the special event.

The presentation will take place at Wayra Academy in Madrid, on Wednesday, May 25, 2016 at 7:00 PM CEST (GMT +02:00). Registration is FREE and by invitation only. To apply and secure your spot for this meetup, please register here before Monday 23, 2016 at 11:00 PM CEST (GMT +02:00).

If you are in Madrid or nearby, reserve your seat now and get your invite. Space is limited!

P.S. There will be beer, drinks and pizza for networking among attendees!

Automating Machine Learning Workflows

by

whizzml_logo

Machine Learning (ML) services are quickly becoming a taken-for-granted part of the software developer’s toolbox, in any domain. These days, databases or networking are a standard component of almost any non-trivial application, so easily integrated that almost no special expertise is required. We expect to see Machine Learning becoming, in the very near future, a similar layer in the software stack.

This commoditization of ML services has been driven so far by Service-oriented platforms such as BigML, which have provided a key ingredient of the process: abstraction. Simple and easy to use REST APIs hide away not only the details of the sophisticated algorithms underlying the services at hand, but also the complexities of scaling those computations both over CPU cycles and input data volumes. A programmer no longer needs to be an expert in distributed systems or the fine details of implementing ML algorithmics in order to incorporate intelligent behaviour in her applications.

Or does she? The answer is, well, sometimes. Machine Learning platforms excel at hiding away the complexities of distributed computation, and are already providing powerful building blocks in the algorithmic department: decision trees, ensembles, random forests, anomaly detectors, automated feature engineering and many other pieces are already there. And, sometimes, they are enough to cover the needs of the end user. However, it is also often the case that the machine learning solution for non-trivial problems requires combining many of those primitives and algorithms into a complex workflow.

A Fresh Approach to Automation

Machine learning workflows are most of the time iterative, and typically involve the creation of many intermediate datasets, models, evaluations and predicitions, all of them dynamically interwoven. There are many examples of such workflows, and we have covered in the past several of them in this very blog, from basic ones that automate repetitive tasks (e.g., create a dataset, filter it removing anomalies, then cluster it to label instances and finally model the result and make predictions) to sophisticated algorithms that enhance our machine learning arsenal (e.g., feature selection or hyperparameter optimization techinques).

So, quite often, the predictive components of end-user applications are workflows that must be built on top of our machine learning platforms using a general purpose language such as Python or Java and its bindings to the corresponding platform’s simple and nice API. That is, we write client-side programs that combine server-side API calls to solve our problem.

While effective, this way of automation destroys a good deal of the simplicity and abstraction that those APIs had worked very had to provide. In particular, automation based on client-side programming against remote APIs presents, among others, the following drawbacks:

  • Complexity Coding the workflow in a standard programming language implies worrying over details outside of the problem domain and not directly related to the machine learning task.
  • Reuse REST APIs such as BigML’s provide a uniform, language-agnostic interface for solving machine learning problems; however, the moment we codify a workflow using a specific programming language we lose that interoperability. For instance, workflows written in Python, won’t be reusable by the Node.js or Java communities.
  • Scalability Complex, client-side workflows are potentially very hard to optimize, and they lose the automatic parallelism and scalability offered by sophisticated server-side services. Furthermore, there are performance penalties inherent to the client-server paradigm: asynchronously pipelining a remote workflow introduces an extra number of intermediate delays just to check the status of a created resource before moving forward in the workflow.
  • Client-server brittleness Client-side programs need to take care of non-trivial error conditions derived from network failures. When any step in a complex workflow can potentially fail, handling errors and resuming from interruptions without losing valuable work already performed becomes a big challenge.

In short, the underlying REST API is already providing an abstraction layer over computing resources and the nitty-gritty details of many algorithms, but client-side workflows are re-introducing inessential details in our workflow specifications.

We need a way to recover for workflows the nice properties the ML platforms are already giving us for each step the workflow is made of.

In the sofware engineering field, we have long known a very effective method to cure the kind of problems outlined above. To increase the abstraction level of our workflow descriptions and free them from inessential detail, we must express our solution in a new language that is closer to the domain at hand, i.e., machine learning and feature engineering. In other words, we need a domain-specific language, or DSL, for machine learning workflows.

If you are a BigML user, chances are that you are already familiar with another DSL that the platform offers to perform feature synthesis and dataset transformations, namely, Flatline. With that domain-specific language, we offer a solution to the problem of specifying, via mostly declarative symbolic expressions, new fields synthesized from existing ones in a dataset that is traversed using a sliding window. The DSL lets you also name things and use those names inside your expressions, an archetypal means of abstraction.

In Flatline, we talk only of the domain we are in, namely features (old and new), as well as their combination and composition. Also note how the Flatline language is based on expressions for the new features we are creating, rather than a procedural sequence of instructions describing how to synthesize those features.

Now, we want to be able to formulate machine learning workflows with a similarly expressive DSL. And that is precisely what WhizzML provides: a language specifically tailored to the domain of machine learning workflows, containing primitives for all the tasks available via BigML’s API (such as model creation or evaluation or dataset tranformation) and providing powerful means of composition, iteration, recursion and abstraction.

An Example WhizzML Workflow

To give you a quick flavor of what WhizzML syntax looks like, here’s a very simple workflow. It generates a batchcentroid dataset from a source, encapsulating all the steps needed. Namely, creating a dataset, creating a cluster, creating a batchcentroid and extracting the resulting dataset from it:

(define (one-click-batch-centroid src-id)
  (let (ds-id (create-and-wait "dataset" {"source" src-id})
        cl-id (create-and-wait "cluster" {"dataset" ds-id
                                          "default_numeric_value" "median"})
        bc-id (create-and-wait "batchcentroid" {"cluster" c-id
                                                "dataset" ds-id
                                                "output_dataset" true
                                                "all_fields" true}))
     (get (fetch bc-id) "output_dataset_resource")))

(define dataset-id (one-click-batch-centroid source-id))

Here, we can already recognize some of the qualities we have been discussing: the keywords define and let, which introduce new names for different artifacts: the identifiers of the dataset, cluster and batchcentroid that are created (ds-id, cl-id and bc-id), of a function, one-click-batch-centroid that encapsulates the workflow, and of the identifier of the resulting dataset, dataset-id, which will be the workflow’s output. We also see how primitives for the creation of BigML resources are readily available (create-and-wait-dataset, etc.), and take care of their time evolution (in this case, by waiting for their successful completion). All the steps of the workflow are thus trivially rendered.

Admittedly, careful use of high-level language bindings such as Python’s can attain similar levels of readability for workflows as simple as this example. However, as you’ll see in other examples, WhizzML’s readability scales better when the complexity of the workflow increases.

Moreover, there are other immediate advantages of using the WhizzML script for expressing your workflow, even in this simple case. For one, it is automatically cross-language and available for any development platform. Even more importantly, it is executable as a single unit fully on the server side, inside BigML’s system: there, we not only avoid the inefficiency of all the network calls needed by, say, a Python script, but also automatically parallelize the job. In that way, we are recovering one of the big advantages of Machine Learning-as-a-service: scalability of the available machine learning tasks, which are mostly lost for composite workflows when using the traditional, bindings-based solution.

The Power of Reusability

Another very important advantage of using WhizzML instead of client-side scripts written in your favorite programming language is that code written in WhizzML becomes reusable server side resources. Specifically, you have three kinds of resources at your disposal:

whizzml_actions

  • Library A collection of WhizzML definitions that can be imported by other libraries or scripts. Libraries therefore do not describe directly executable workflows, but rather, new building blocks (in the form of new WhizzML procedures, such as one-click-batch-centroid above) to be (re)used in other workflows.
  • Script A WhizzML script (as the ones we have seen in the previous section) has executable code and an optional list of parameters, settable at run time, that describe an actual workflow. Scripts can use any of the WhizzML primitives and standard library procedures. They can also import other libraries and use their exported definitions. When creating a script resource, users can provide a list of outputs in its metadata, in the form of a list of variable names defined in the script’s source code.
  • Execution When a user creates a script, its syntax is checked and it gets pre-compiled and readied for execution. Each script execution is described as a new BigML resource, with a dynamic status (from queued to started to running to finished) and a series of actual outputs. When creating an execution, besides the identifier of the script to be run, users will typically provide the actual values for the script’s free parameters (just as you provide its arguments’ values, when you call a function in any programming language). It is also possible to pipe together several scripts in a single execution.

Workflow scripts written in WhizzML are reusable server side resources, which makes WhizzML’s procedural abstraction capabilities especially powerful. New procedures can be made available, as part of a library (just another kind of REST resource), to other scripts. We have thus gained the ability to define new machine learning primitives, climbing even higher in our abstraction layer, without losing the benefits of performance or scalability associated with a Service-oriented platform such as BigML.

To learn more about WhizzML jump right away to its home page and keep an eye on future posts in this series!

BigML Spring 2016 Release and Webinar: Automating Machine Learning!

BigML Spring 2016 release is here! Join us on Thursday, May 19, 2016 at 10:00 AM US PDT (Portland, Oregon. GMT -07:00) / 7:00 PM CEST (Valencia, Spain. GMT +02:00) for a FREE live webinar to learn about the latest and greatest version of BigML. We’ll be focusing exclusively on WhizzML, a new domain-specific language that lets you automate Machine Learning workflows, implement high-level Machine Learning algorithms, and share them with others. WhizzML stands to make a big difference not only in how developers conceive of and implement smart applications, but also how analysts and scientists reduce the burden of repetitive analyses.

automate-final

Simply stated, WhizzML is a domain-specific programming language that helps you automate sophisticated Machine Learning workflows. Why come up with a new language? Machine Learning is an iterative and computationally intensive process, where each iteration consists of multiple stages. This makes it a perfect candidate for automation. WhizzML utilizes server-side optimization to streamline workflows without burdening the end-user with all the complexities of managing numerous backend tasks, e.g., setting up and configuring servers according to the size of a Machine Learning task at hand. This eliminates brittle client-server communication bottlenecks and the need to configure infrastructure, significantly boosting your Machine Learning productivity.

high-level-algorithms

WhizzML can also be put to great use by implementing high-level Machine Learning algorithms of your own, e.g., Boosting or Stacked Generalization. The modular design of underlying BigML capabilities and algorithms will not only help you get more creative with your solutions, but it will also let you codify them in a standardized way that lends itself well for troubleshooting, maintenance, and further enhancements.

gallery-ok

Finally, WhizzML provides an infrastructure for creating and sharing Machine Learning Scripts and Libraries, facilitating the reuse of proven techniques within your team. We have already prepared a gallery with a collection of useful scripts that you can use to practice with before you go ahead and add your own. The fact that the scripts are programming language agnostic doesn’t hurt either!

whizzml-training

During the webinar, we will be showcasing a diverse set of practical WhizzML Libraries and Scripts that can save you a significant amount of time and effort.

Ready to see WhizzML in action? Join us on Thursday, May 19, 2016 at 10:00 AM US PDT (Portland, Oregon. GMT -07:00) / 7:00 PM CEST (Valencia, Spain. GMT +02:00). Be sure to reserve your free spot today as space is limited! We will also be giving away BigML t-shirts to those who submit questions during the webinar. Don’t forget to request yours!

%d bloggers like this: