Skip to content

Must See Machine Learning Talk by Geoff Webb in Valencia

BigML and Las Naves are getting ready to host the 2nd Machine Learning Summer School in Valencia (September 8-9), which is fully booked. Although we are not able to extend any new invitations for the Summer School, we are happy to share that BigML’s Strategic Advisor Professor Geoff Webb (Monash University, Melbourne) will be giving an open talk on September 8th at the end of the first day of the Summer School.  All MLVLC meetup members are cordially invited to attend this talk, which will start promptly at 6:30 PM CEST, in Las Naves. After Professor Webb’s talk, there will be time allocated for free drinks and networking. Below are the details of this unique talk.

A multiple test correction for streams and cascades of statistical hypothesis tests

Statistical hypothesis testing is a popular and powerful tool for inferring knowledge from data. For every such test performed, there is always a non-zero probability of making a false discovery, i.e. rejecting a null hypothesis in error. Family-wise error rate (FWER) is the probability of making at least one false discovery during an inference process. The expected FWER grows exponentially with the number of hypothesis tests that are performed, almost guaranteeing that an error will be committed if the number of tests is big enough and the risk is not managed; a problem known as the multiple testing problem. State-of-the-art methods for controlling FWER in multiple comparison settings require that the set of hypotheses be predetermined. This greatly hinders statistical testing for many modern applications of statistical inference, such as model selection, because neither the set of hypotheses that will be tested, nor even the number of hypotheses, can be known in advance.

Subfamilywise Multiple Testing is a multiple-testing correction that can be used in applications for which there are repeated pools of null hypotheses from each of which a single null hypothesis is to be rejected and neither the specific hypotheses nor their number are known until the final rejection decision is completed.

To demonstrate the importance and relevance of this work to current machine learning problems, Professor Webb and co-authors further refine the theory to the problem of model selection and show how to use Subfamilywise Multiple Testing for learning graphical models.

They assess its ability to discover graphical models on more than 7,000 datasets, studying the ability of Subfamilywise Multiple Testing to outperform the state-of-the-art on data with varying size and dimensionality, as well as with varying density and power of the present correlations. Subfamilywise Multiple Testing provides a significant improvement in statistical efficiency, often requiring only half as much data to discover the same model, while strictly controlling FWER.

Please RSVP for this talk soon and be sure to take advantage of this unique chance to learn more about theis cutting edge technique, while joining our Summer School attendees from around the world for a stimulating session of networking afterwards.

The Ghost Olympic Event: Machine Learning Startup Acquisition

With no lack of drama both on and off the track, the 31st Summer Olympics and its 39 events have been wrapped up recently. As the city of Rio is preparing for the first Paralympic Games to take place in the Southern Hemisphere, some are experiencing Synchronized Swimming, Canoe Slalom and Modern Pentathlon withdrawal symptoms.  As Usain Bolt, Michael Phelps and Simone Biles stole the show, Silicon Valley has not just quietly sat and watched the proceedings. Not at all.  In fact, VCs, investment bankers and tech giants active in the Machine Learning space have been in a race of their own that goes on unabated even if they don’t get the benefit of prime time NBC TV coverage.

olimpic inverse musical chairs

Machine Learning as strategic weapon

It is fair to say that we have been witnessing the unfolding of the ghost Olympic event of Machine Learning startup acquisition.  The business community’s scores are not fully revealed yet and the acquisition amounts are mostly being kept under the wraps — albeit in a leaky kind of way.  Regardless, the most recent acquirers include Apple acquiring Turi and Gliimpse, Salesforce purchasing BeyondCore, Intel picking up Nervana Systems, and Genee being scooped up by Microsoft. So what is driving this recent surge?

The bulk of the M&A activity have been led by household B2C names like Google, Apple and Facebook that are sitting on top of piles of consumer data that can result in a new level of innovation when coupled with existing as well as emerging Machine Learning techniques like Deep Learning.  The dearth of talent to make this opportunity a reality has resulted in a very uneven distribution of the said talent as those deep pocketed “acquihirers” outbid other suitors to the tune of $10M per FTE  for early stage startups (and even higher in the case of accomplished academic brains).

The emerging need for a platform approach

As great as having some of the brightest minds work on complex problems is, it is no guarantee of success without the right tools and processes to maximize the collaboration with and the learning among developers, analysts and resident subject-matter experts.  Indeed, the best way to scale and amplify the impact from the efforts of these highly capable, centralized yet still relatively tiny teams is adopting a Machine Learning platform approach.

It turns out that those that started on the path of prioritizing Machine Learning as a key innovation enabler early on already have poured countless developer man-years into building their own platforms from scratch. Facebook’s FbLearner Flow, which the company recently revealed is a great example of this trend. As of now the platform claims to have supported over 1 million modeling experiments conducted to date, which make 6 million predictions per second possible for various Facebook modules such as the news feed. But perhaps the most impressive statistic is that 25% of Facebook engineers have become users of the platform over the years. This is very much in line with Google’s current efforts to train more developers to help themselves when it comes to building Machine Learning powered smart features and applications.

Machine Learning haves (1%) and have nots (99%)

Examples like the above are inspirational, but this brings the question how many companies can realistically afford to build their own platform from scratch. The short answer is “Not too many!”

Left to their own devices, these firms face the following options:

  • Hiring few Data Scientists that may each bring their own open source tools and libraries of varying levels of complexity potentially limiting the adoption of Machine Learning in other functions of the organization, where the ownership of mission critical applications and core industry expertise reside.

  • Turn to commercial point solution providers with a few built in blackbox Machine Learning driven use cases per function e.g., HR, Marketing, Sales etc.

  • Count on the larger B2B players’ recently launched Machine Learning platforms to catch up and mature in a way that can not only engage highly experienced Machine Learning specialists, but also serve the needs of developers and analysts alike e.g., IBM, Microsoft (Azure), Amazon (AWS) etc.

Although these options may be acceptable ways to dip your toes in the water or stop the bleeding in going to market with a very specific use, they are not satisfactory longer term approaches that strike the optimal balance between time to market, return on investment and a collaborative transformation that leads to a data driven culture of continuous innovation that transcends what can be achieved with small teams of PhDs. As a result, despite the recent advances in data collection, storage and processing, we are stuck with a data rich but insights (and business outcomes) poor environment awash with a cacophony of buzzwords in many industries.

Luckily, there’s still an incipient industry of independent Machine Learning platforms like BigML, H2O and Skytree (no more Turi) that can supply this unfulfilled demand from the so far lagging 99%. However, we must remember that replacing those platforms with new complete ones may require years of arduous work by highly specialized teams, which runs counter to the present day two co-founder, Silicon Valley accelerator startup recipe targeting a quick exit despite little to no Intellectual Property.

Regardless if any tech bellwether is able to create a monopoly, it seems safe to assume that for the foreseeable future the race for Machine Learning talent is only going to get hotter as more companies get a taste of its value. We will all see whether this game of inverse musical chairs will lasts long enough to make it to the official program of Tokyo 2020!

BigML for Google Sheets Endorsed as Outstanding Add-on

The good folks over at the innovative web application integration and event based automation startup Zapier have recently prepared a comprehensive list of Top 50 Google Sheets Add-Ons to Supercharge Your Spreadsheets. We were flattered to find out that they listed BigML for Google Sheets (aka BigML-GAS) as a top add-on for “number crunching”.

Top Google Sheet Add-ons

Like all of our BigML-Tools, BigML-GAS is super-easy to use and it is a great way to expose your predictive models externally to your workgroup or other partners regardless of their experience with Machine Learning. For example, imagine that you have a model predicting the likelihood of a prospect to become a customer based on your historical sales pipeline results.  Your Field Sales team that may collect data on prospects can easily score each of their prospects without needing a BigML account of their own or knowing anything about predictive models or Machine Learning for that matter! The beauty of it is that you need no complex IT integration with any CRM tools to get going with such a use case.

But, why stop there? With a vast ecosystem of apps, Google Sheets offers users a wide array of options to piece together functionality without relying on programming skills or internal developers. Besides BigML-GAS, there are other noteworthy add-ons for the analytical minded; in case they decide to get serious about automating the gathering and analysis of data in Google Sheets as well as implementing new processes based on the resulting actionable insights. Here are a couple from Zapier’s eBook:

  • Blockspring: Let’s you import and analyze data with APIs. For example, you can extract pricing data for your competitors’ products from Amazon API. Once you import the data, you can analyze it manually or on a schedule.
  • Aylien: This is a great add-on for text analytics tasks such as sentiment analysis. For instance, you can pull in Tweets about your products and analyze their sentiment scores to see which ones are being received more favorably in which geographies.
  •  With this service, you can turn any website into a spreadsheet. If you would like to augment your Sheet with data from external sites, use this handy tool to scrape the data and infuse with what you have. is Free for up to 10k queries per month.
  • Geocode by Awesome Table: This add-on helps you get latitudes & longitudes from addresses in a Google Sheet to display them on a map that you can share.

In the coming weeks, we will present some interesting use cases that leverage the Google Sheet ecosystem along with BigML. In the meanwhile, do you have cool idea that you have already implemented on BigML? Are you planning to extend to Google Sheets with BigML-GAS?  If so, let us know and we will present it on the BigML blog to spread the learning.


Scriptify: 1-Click Reification of Complex Machine Learning Workflows

Real world Machine Learning is not just about the application of an algorithm to a dataset but a workflow, which involves a sequence of steps such as adding new features, sampling, removing anomalies, applying a few algorithms in cascade, and stacking a few others. The exact steps are often arrived at during iterative experiments performed by the practitioner. In other words, when it comes to the real life Machine Learning process, not everything is as automatic as various business media may make you believe.

Usually, one starts by playing around a bit with the data to assess its quality and to get more familiar with it. Then, a significant amount of time is spent in feature engineering datasets, configuring models, evaluating them, and iterating or combining resources to improve results. Finally, when the right workflow is found, traceability and replicability become must have concerns to bring the workflow to a production environment. Without those attributes, one can’t ensure that errors are eliminated, or workflows can be rerun and otherwise improved by everyone in a workgroup.

You are probably asking, “That all sounds great, but how does one achieve this without creating even more complexity?”. That’s precisely why today BigML has launched a new game-changing feature, which can create a WhizzML script for any Machine Learning workflow in a single click: Scriptify.


Auto-scripting workflows to generate your resources

With this update, all BigML resources (datasets, models, evaluations, etc.) have a new menu option named Scriptify that allows you to automatically generate a script that can regenerate that resource (or generate a similar one if it is used with different inputs with the same field structure)! BigML resources have always been downloadable as JSON objects, where the user-specified properties at the time of the creation or update of the resource are stored as attributes. This white-box approach is crucial to ensure that everything you do with BigML is traceable, reusable and replicable. So we can use our recently launched scripting language, WhizzML to inspect any existing resource in BigML and easily extract the entire workflow needed to generate it.

Let’s start by explaining an example use case. Say you created an evaluation some time ago, but you don’t remember which model was evaluated, whether it was a balanced model or not, or which test dataset was used. No problem! This information is already stored in each of the resources. You don’t need to track it down manually or document it in a separate repo. Clicking the Scriptify your evaluation link  in the actions menu of your evaluation view screen will unravel it, and swiftly generate a script, which can reproduce the original evaluation.


This new script will now be available as another resource. Like any other WhizzML script, you will be able to execute it to recreate your evaluation on demand.

Scriptify steps

To create a resource in BigML, you usually provide the starting resource (e.g., if you want to build a model, you’ll need to decide from which dataset) and some configuration. The corresponding Scriptify action retrieves this information, and recursively does the same for all resources used throughout your entire workflow.

Following the example in the previous section, first the evaluation is analyzed to find out the IDs of both the model that was evaluated and the dataset used as test dataset since these are the origin resources for the evaluation. Then, each of them is analyzed recursively to find out the origin resources that were used to build them. The model was built from a dataset and the test dataset from a source. Finally, it turns out that the dataset used to build the model was built from the same source as the test dataset by using a 80%-20% split. In general, any Scriptify call will bubble up through the hierarchy of parent objects until it finds the source object for every resource involved in the analysis. As the Scriptify process needs to explore every object in the hierarchy, it will be stopped if any intermediate resource has been deleted. For each of the resources, the script extracts the attributes used in the create and/or update calls and generates a new script which contains the WhizzML code able to replicate them.


As you can see in the code example all the resources derive from the data in a file named s3://bigml-public/csv/diabetes.csv, which was initially uploaded to build a source object. This URL is kept as input for the script, but you can change it if need be. In a production environment, you periodically need to repeat the same process on new data. Using this script, you would only need to provide the new URL as input for the script to rebuild the evaluation on new data by using the same procedure and its configurations.

Scriptify as the building block for automation

Another interesting property of scripts in BigML is that you modify them to create new scripts. The link create a new script using this one opens an editor screen, where you can modify the code in the original script. Following the example, if you find out that your model was not balanced and you want to try the evaluation on a balanced model, you can do so by adding the balance_objective flag to the model creation call attributes.


Clicking the validate button checks your changes and points out any errors. You can also change the outputs of the script. This view searches all the possible outputs in the script and offers you a list to select from. In this case, we set the model ID to be returned as output in addition to the original evaluation ID.

As you can see, you don’t really need to know WhizzML to start scripting your workflows. Rather, you can create any workflow on BigML’s Dashboard and let Scriptify magically transform it into WhizzML code! You can even make it a 1-click automation option on your menu. Finally, this code can also be modified and composed to create, share, and execute new workflows in a scalable, parallelizable and reproducible way. Now it’s your turn to transform your custom Machine Learning resources into blazingly fast, traceable and reproducible workflows ready for production environments!

How to Discover What is Important to Your Clusters

Raw information is useless if you don’t understand what it means, but sometimes there is just so much it’s hard to get a handle on what is going on. One way to better understand your data is through cluster analysis – grouping similar data in “clusters”. At BigML we use a centroid-based clustering to group your data with just one click. While this is terribly convenient, it can obscure how the clustering decisions are actually made. When a dataset has dozens of input fields (or more!), how can you tell which ones were actually important in grouping your data?

Cluster Classification

What is important anyway?

This is a really big question, but at BigML we specialize in turning big questions into answers. Here something is important to a process if it affects the outcome of that process. For example, consider the importance of an input field when building a decision tree. A BigML decision tree automatically finds the importance of each input field by finding every time an input field was used to make a split in the tree and then averaging how different the prediction would have been without that split. With this definition of importance, more is better. While a single tree may give some understanding of which fields are important, a whole forest of trees would give even more certainty that the most important fields are identified.

How can we apply this definition of importance to the case of clusters? If one of the input fields was which cluster the datapoint belonged to, we could build a model to predict that field. Find the importance of other input fields in this model, and that would give their importance to deciding cluster membership.

In fact, there is already a one-click way to grow a tree based on a cluster just this way. When viewing a cluster in the BigML dashboard, if you shift click on a particular cluster, below on the right you can click to create a model (a single tree) from this cluster. This will give you the importance of each input field, but only to this cluster. But until now, there was no easy way to see the overall importance of each field considering all the clusters.


Global importance is here!

This new BigML script creates not just one, but an ensemble of trees designed to find the importance of your input fields. With just a few clicks, you will know which fields contribute to how your clusters were decided. And because the script uses an ensemble, you can be more confident that these fields really are the ones you want.

You can import this script directly from our gallery hereNow you are ready to analyze your cluster! Pick any of your clusters from the dropdown menu. Whether it was a fast one-click cluster, or you spent a long time carefully tailoring your cluster options, this script will be able to tell which input fields were the important ones. Once you’ve got your cluster, click “Execute” and let BigML do its thing. When complete, the output is importance-list, a map of input field id, field name and importance, ordered from most to least important.

Cluster Classification in more detail

This script is written in WhizzML, BigML’s new domain-specific language, WhizzML. If you have a complicated task to do, just a few lines of WhizzML can replace the repetitive clicking needed to massage data in the dashboard. The cluster classification script takes a cluster ID and uses WhizzML to:

  1. Create an extension of the cluster’s source dataset, adding a new field ‘cluster’
  2. Create an ensemble from those resources.
  3. Put it all together to report each field’s importance.

This only takes a few steps because the script exploits the features of two BigML resources: the batchcentroid and the ensemble. Many BigML resources automatically contain calculated information that would be too much to show in the dashboard. But with a little WhizzML we can reveal their secrets.

Here’s the function that creates the extension of the cluster’s source dataset:


It begins with define, which is how all WhizzML functions are defined. Here, a function label-by-cluster is defined to take as an input the cluster id cl-id. Next, a let expression assigns some variables to objects pulled out of that cluster. Here’s where BigML resources really shine. We ultimately want a dataset resource, and we could create the dataset we are after using Flatline to edit every row of the original dataset. But instead we will create a batchcentroid. Set the output_dataset parameter to true, and a batchcentroid resource automatically creates a dataset where each row is labeled by its cluster. But we want this dataset to have all the same fields as the original in addition to this new cluster field. So we set the parameter output_fields to the be the same fields as the original, and we’ve got exactly what we want!

Now that we have this extended dataset, we can figure out how important each field is in determining cluster membership by building an ensemble of trees with cluster membership as the objective. BigML automatically calculates the importance of each field in a model, we just have to know where to look to get those numbers.

Here’s the function that creates a map of field ids and their importance:


Just as before, define creates the function make-importance-map, which takes an ensemble id and a list of input field ids as inputs. In the let statement, we go into the ensemble and pull out a list of all the models it contains, then go into get model to pull out the list of field importances. Now we just have to put everything together. Without getting too lost in details, the helper functions list-to-map turns our lists into maps and merge-with combines all the maps by addition. One final map to divide by the number of models, and we have a map of field importance averaged over all the models of the ensemble.

That’s it in a nutshell! If you have ever spent a lot of time carefully setting cluster parameters, only to find you aren’t really sure why your clusters were chosen as they were, this is the script for you. It will tell you exactly which fields are important not just to a single cluster, but across all of them. More understanding is just a few clicks away.

Feature Selection not only for Programmers

Nowadays, many knowledge workers have access to lots of data that can be analyzed to extract interesting insights. This has brought people with different profiles and skill sets to explore Machine Learning. However, this data deluge can also be likened to a deep forest hiding the golden tree. Generating a large number of features to solve a Machine Learning problem can be a very time and resource-intensive affair. One can easily get lost in the complexity. How does one tell which features to use and which ones to spare then? Fortunately, machines can also help extract relevant features to our problem while discarding the ones that don’t add value. This process is known as automated feature selection.

An example of feature selection: The Boruta package

Different methods can be used to automatically select the significant features for a classification or regression problem. In this post we’ll follow the idea implemented in the Boruta package, written in R. In short, this method takes advantage of random decision forests and uses the field importance derived from the random decision forest splits to decide which features  are really relevant.

A random decision forest is an ensemble of decision trees, each of which is built from a random sample of your data. The random decision forest predictions are then computed by majority vote or average of the individual model predictions. The sampling to build the trees is twofold: it picks a random subset of the available rows for each tree as well as considering a random subset of the available features for each split. This procedure makes random decision forests quite powerful compared to other Machine Learning methods as the ensemble is able to generalize well when facing entirely new input data.

Let’s dig into what the feature selection algorithm in the Boruta package does:

  • Adds to the original dataset a new feature per column (a shadow feature), except for the one that you want to predict. The new feature will contain values of the original feature chosen at random. Thus, these values will have no correlation with the target value to be predicted.
  • Creates a random decision forest using the extended dataset. As a result, we get the importance that each feature has when it is used to split the data.
  • Compares the importance of the original features to the maximum of the importance for the new shadow features, which acts as a threshold of what level of importance can be achieved just randomly.
  • Detects the features whose importance is above and below that threshold. Those below the threshold are considered unimportant and removed from the original dataset.
  • The new reduced dataset goes again through these steps until all features are tagged as important or unimportant (or a maximum number of runs is reached).

So here random decision forests are used as a mechanism to reduce the dimensions you need to cope with in your problem in a quick and effective way. You may be thinking that methods like this are only available for programmers or that they require a lot of expertise to be properly used. Well, here’s where BigML comes to the rescue.

Boruta feature selection for non-programmers

For those who are not interested in the implementation details or don’t have the programming skills to dive into scripts and libraries, we’ve created a script in BigML, which mimics the procedure used in the Boruta package. The good news is that this script is available in the scripts gallery and you just need to clone it and add it to your BigML Dashboard menu as a “one-click action”. From then on, you’ll be able to create a new filtered dataset that automatically excludes all the unimportant features in a single click!

This will certainly:

  • reduce both the size and elapsed time of the Machine Learning tasks that you want to perform next up e.g., creating new model iterations.
  • clarify the relationships between the remaining features and your target variable (i.e., the objective field to be predicted).
  •  probably reduce also the times and costs involved in data acquisition overall.

If you think this would be a handy addition to your Machine Learning arsenal, just:

And voila, Boruta feature selection has now become one of your 1-click menu options! Note: you don’t need to read the rest of this post or bother about the programming details if you are happy with what we have gone over thus far. If your curiosity is not yet fully quenched, keep reading.

Peeking behind the scene: WhizzML

For those of you that are interested in the implementation details, this script has been built by using WhizzML. WhizzML is the domain specific language that BigML recently added to the platform. In WhizzML, all Machine Learning tasks and resources are first-class citizens. Here’s an example from the Boruta feature selection script. When you need to create the random decision forest, you just use a standard library function to create ensembles by providing the configuration parameters: the dataset ID you start from and the field that you want to predict (the objective field in BigML’s language).


The lines prefixed with ;; are comments and (define ...) is the directive, which defines a function or a variable in WhizzML. So here we define the random-decision-forest function. The body of this function is calling the create-and-wait-ensemble standard library function, which needs a map of arguments to create the random decision forest ensemble. The map is written as an alternating list of keys and their associated values. The first pairs in the map of arguments are the dataset ID and the objective field. The remaining attributes are used to define the sampling settings in the random decision forest. You can learn more about the available arguments in the API Documentation for Developers.

Another key step in the algorithm is transforming the original dataset by extending the number of features with a shadow field per feature. You can think that this will be the tricky bit, because you must randomly select values of the original field to fill the new one. Flatline, BigML‘s language for dataset transformations comes in handy here. Flatline offers weighted-random-value, which returns a value in the range of a given field while maintaining its distribution of values. Thus, by combining WhizzML and Flatline, creating a new extended dataset boils down to just two lines of code:


In the first line, the new-fields structure needed for the dataset transformation is created (you can learn more about that in the API Documentation for Developers). We set the new field names by prepending "shadow " to the original ones, and create their corresponding value using the flatline WhizzML function. Finally, the create-and-wait-dataset WhizzML function of the standard library is used to handle the new dataset creation.

The rest of the code is basically managing input and output formats, and handling iterations with a loop WhizzML structure. In a handful of lines of code, we’ve been able to implement the powerful Boruta feature selection algorithm in a parallelized, scalable and distributed fashion such that even non-programmers can adopt it. That’s what we call democratizing Machine Learning. Now it’s your turn to make the best of it for your next Machine Learning project!

Hands-on Summer School on Machine Learning in Valencia – 2nd Edition

The Machine Learning revolution has no signs of slowing down, as evidenced by its proven success and continued momentum that leading companies like Google or Facebook are experiencing, as well as numerous tech startups putting it at the core of their value propositions. It is especially encouraging to observe the pick up in the recent pace for us, as compared to our beginnings in 2011, when the BigML Team decided to take on the worthy challenge of making Machine Learning beautifully simple for everyone!

In order to play our part in increasing the awareness and application areas of Machine Learning, BigML has been actively organizing summer schools. Last year BigML helped organize the first edition of our summer school on Machine Learning, and this year we intend to improve it further with this second edition, which will take place on September 8 and 9 in Valencia, Spain.


BigML will be holding the two-day hands-on summer school for business leaders, advanced undergraduates, as well as graduate students and industry practitioners, who are interested in boosting their productivity by applying Machine Learning techniques. All lectures will take place at Las Naves from 8:30 AM to 6:00 PM CEST during September 8 and 9. You will be guided through this Machine Learning journey starting with basic concepts and techniques that you need to know to become the master of your data with BigML. Check out the program here!


The summer school 2016 is FREE, but by invitation only. The deadline to apply is Friday, September 2, at 9 PM CEST. Applications will be processed on an as received basis, and invitations will be granted right after individual confirmations to allow for travel plans. Make sure that you register soon since space is limited!

P.S: Following the tradition, any attendee contributing to the classroom discussion by asking questions will get a BigML t-shirt!


Datatrics is Bridging the Gap between Machine Learning and Marketing with BigML

We first ran into the predictive marketing startup Datatrics from the Netherlands at the PAPI’s Connect event in Valencia earlier this year, where they competed in the first ever AI Startup Battle. The Dutch startup offers marketing teams an easy and actionable way to leverage Machine Learning with its innovative data management platform, which we believe sets a great example for other startups in showing how BigML can add to their competitive edge and supercharge their growth. So we interviewed Bas Nieland, CEO and co-founder of Datatrics to find out more.

Bridging the ML Gap

BigML: Congrats on your high score at the first ever AI Startup Battle. Can you tell us what was the motivation behind starting Datatrics?

Bas Nieland: Nowadays digital marketers are awashed with data due to the fragmentation of consumer attention on many more channels. Naturally, they are all looking for better ways to leverage all the data their companies collect, yet there is a big gap between what data can offer marketing teams and what marketers actually use. The main culprit is the fact that there is a perceived necessity of a team of data scientists and collaborating developers to make sense of all that data. Since the average small and medium sized marketing teams do not have access to such resources, new tools are needed to translate data into meaningful actions to optimize the digital customer journey.

An example of a 360 degree customer profile in Datatrics

‘An example of a 360 degree customer profile in Datatrics’

BigML: What is the lowdown on Datatrics? How does it help bridge that gap?

Bas Nieland: Datatrics was founded in 2014 and it currently has 10 employees in the Netherlands. We define ourselves as a data management platform (DMP) that helps marketing teams gain actionable insights. It is an easy and accessible platform that gives concrete insights and actions every marketer can understand. It allows marketing teams to build 360-degree customer profiles, based on internal data sources such as their CRM tools, social media accounts, websites and external data sources such as the weather, social trends and traffic information. By following the recommended Next Best Actions by Datatrics, marketing teams know exactly who to contact, at what time, with what content, and through which channel.

BigML: Can you tell a bit about how Machine Learning comes into play?

Bas Nieland: All of this is driven by smart algorithms applied to those data sources, which is powered by BigML’s Machine Learning platform, among other components that make up our platform. We especially love how BigML helps us to deploy many predictive models in a fast and scalable way by abstracting away the infrastructure level concerns needed to crunch the data. This way our product team can concentrate on the actual analytics tasks and development of the platform for our clients. BigML is also very user-friendly and has a well-documented API, which is very important if you want to go beyond simply gaining insights by deploying scalable predictive applications to your end users.

An example of a Next Best Action in Datatrics

‘An example of a Next Best Action in Datatrics’

BigML: What are some of the predictive use cases you have and which other ones are you looking to add?

Bas Nieland: I already mentioned the Next Best Action models, which is a big benefit to our audience.  We also are in the process of testing BigML’s ‘Associations’ functionality to see how it can benefit us. We believe it can make our product recommendations even more relevant.

BigML: Can you share specifics on customer traction and measurable business outcomes Datatrics have been delivering?

Bas Nieland: We are seeing great uptake especially in retail and travel industries. Over the past year, we have noted a clear demand in the travel industry for DMPs such as Datatrics. As it is a highly competitive market, it is important for companies such as travel agencies and hotel chains to use customer insights from their data in order to communicate in a more personal and relevant way. Some of our customers have increased their revenue by as much as 30%!

BigML: That sounds great. What would you recommend other startups and self-starting developers that want to implement similar smart applications? Any key lessons learnt that you would like to share?

Bas Nieland: They should think hard before going the route of building their Machine Learning infrastructure from scratch. Provided that you have pertinent data, platforms like BigML can help you in building real world applications very fast while letting you get there at a fraction of the cost of hiring a new analyst. Of course our platform consists of many more components and there is not one solution that fits all, but a good Machine Learning platform such as BigML provides can get you a long way.

BigML: Thanks Bas. It is very impressive to see how you have been able to ramp up your Machine Learning efforts in such a limited time period despite constrained resources. We hope stories like yours inspire many more startups to realize that they too can turn their data and know-how into sustainable competitive advantages.

How to Put Machine Learning in your Machine Learning

There are so many Machine Learning algorithms and so many parameters for each one.  Why can’t we just use a meta-algorithm (maybe even one that uses Machine Learning) to select the best algorithm and parameters for our dataset?

— Every first year grad student who has taken a Machine Learning class

It seems obvious, right?  Many Machine Learning problems are formalized as an optimization wherein you’re given some data, there are some free parameters, and you have some sort of function to measure the performance of those parameters on that data.  Your goal is to choose the parameters to minimize (or maximize) the given function.

ML in ML

But this sounds exactly like what we do when we select a Machine Learning algorithm!  We try different algorithms and parameters for those algorithms on our data, evaluate their performance and finally select the best ones according to our evaluation.  So why can’t we use the former to do the latter?  Instead of stabbing around blindly by hand, why can’t we use our own algorithms to do this for us?

In just the last five years or so, there’s been a lot of work in the academic community around this very topic (usually it’s called hyperparameter optimization, and the particular type which is getting the attention lately is the Bayesian variety) which in turn has led to a number of open source libraries like hyperopt, spearmint, and Auto-WEKA.  They all have loosely the same flavor:

  1. Try a bunch of random parameter configurations to learn models on the data
  2. Evaluate those models
  3. Create a Machine Learning dataset from these evaluations where the features are the parameter values and the objective is the result of the evaluation
  4. Model this dataset
  5. Use the model to select the “most promising” set of next parameter sets to evaluate
  6. Learn models with those parameter sets
  7. Repeat steps 2-6, adding new evaluations to the dataset described in set 3 at each iteration

Most of the subtlety here is in steps four and five.  What is the best way to model this dataset and how do we use the model to select the next sets of parameters to evaluate?

My favorite specialization of the above is SMAC.  The original version of SMAC is a bit fancier than is necessary for our purposes, so I’ll dumb it down a little here in the name of simplicity (let’s call the simpler algorithm SMACdown):

  • In step four, we’re going to grow a random regression forest as our model for the parameter space.  Say we grow 32 trees: This means that for each parameter set we evaluate using our model, we’ll get 32 separate estimates of the performance of our algorithm.  Importantly, the mean and variance of these 32 estimates can be used to define a Gaussian distribution of probable performances given that parameter set.

  • In step five, we generate a whole bunch of parameter sets (say, thousands) and pass them through the model from step four to generate a Gaussian for each one.  We then measure, for each gaussian, how much of the lower tail is below our current best evaluation.  The ones with the most area below this lower tail are our most promising candidates.

SMACdownWith most of the details settled, all that’s left is to choose a language in which to implement the algorithm.

How about WhizzML?

Why would we choose WhizzML?  For starters, it allows us to kiss our worries about scalability goodbye.  We can prototype our script on some small datasets, then run exactly the same script on datasets that are gigabytes in size.  No extra libraries or hardware; it will just work out of the box.

Second, because the script itself is a BigML resource, it can be run from any language from which you can POST an HTTP request to BigML, and you can consume the results of that call as a JSON structure.  With WhizzML, there’s no longer the necessity of working in a particular language; you can implement once in WhizzML and run from anywhere.

We aren’t going to go through all of the code in detail, but we’ll hit on some of the major points here.

Our goal here is going to be to optimize the parameters for an ensemble of trees.  We’ll start by creating a function that generates a random set of parameters for an ensemble.  That looks like this:

random params

We use WhizzML’s lambda to define a function with no arguments that will generate a random set of parameters for our ensemble.  Note that we need to know if this is going to be a classification or a regression in advance, as setting balance_objective to true for regression problems is invalid.  This function returns a function that can be invoked over and over again to generate different sets of parameters each time.

The process of evaluating these generated parameter sets is fairly simple; for each parameter set you want to evaluate, you create an ensemble, perform an evaluation on your holdout set (you did hold out some data, didn’t you?), then pull out or create the metric on which you want to evaluate your candidates.

Once you have these evaluations in hand, you need to model them (step four).  That’s done here:

make ensemble

Here, we make the random forest described above.  The helper smackdown—data->dataset creates a dataset from our list of parameter evaluations.  We then create a series of random seeds and create a model for each one, returning the list of IDs.

The next thing is to create a bunch of new parameter sets and use our constructed model to evaluate them:

make predictions

The data argument here is our new list of parameter sets (created elsewhere by multiple invocations of the model-params-generator defined above), and mod-ids is the list of model IDs created by the smacdown--create-ensemble.  The logic here is again fairly simple:  We create a batch prediction for each model, then create a sample from each batch predicted dataset so we can pull all of the rows for each prediction into memory.  We’re left with a row of predictions for each datapoint in data.

Another function is applied to these lists to pull out the mean and variance from each one, then to compute, given the current best evaluation, which of these has the greatest chance to improve on our current best solution (that is, which has the highest percentage of the area under its Gaussian below the current best solution).

There’s a number of details here we’re glossing over, but thankfully you don’t have to know them all to run the script.  In fact, you can clone it right out of BigML’s script gallery:

What’s the takeaway from all of this?  Mainly, we want you to see that WhizzML is expressive enough to let you compose even complex meta-algorithms on top of BigML’s API.  When you choose to use it, WhizzML offers you scalability and language-agnosticity for your Machine Learning workflows, so that you can run them on any data, any time.

No excuses left now!  Go give it a shot and let us know what you think at or in the comments below.

WhizzML: Level Up with Gradient Boosting

Let’s get serious.

Sure, you can use WhizzML to fill in missing values or to do some basic data cleaning, but what if you want to go crazy?  WhizzML is a fully-fledged programming language, after all.  We can go as far down the rabbit hole as we want.

As we’ve mentioned before, one of the great things about writing programs in WhizzML is access to highly-scalable, library-free machine learning.  To put in another way, cloud-based machine learning operations (learn an ensemble, create a dataset, etc.) are primitives built into the language.

Put these two facts together, and you have a language that does more than just automate machine learning workflows.  We have the tools here to actually compose new machine learning algorithms that run on BigML’s infrastructure without any need for you, the intrepid WhizzML programmer, to worry about hardware requirements, memory management, or even the details of the API calls.

What sort of algorithms are we talking about, here?  Truth be told, many of your favorite machine learning algorithms could be implemented in WhizzML.  One important reason for this is because many machine learning algorithms feature machine learning operations as primitives.  That is, the algorithm itself is composed of steps like model, predict, evaluate, etc.

As a demonstration, we’ll take a look gradient tree boosting.  This is an algorithm that has gotten a lot of praise and press lately due to it’s performance in general, and the popularity of the xgboost library in particular.  Let’s see if we can cook up a basic version of this algorithm in WhizzML.

The steps to gradient boosting (for classification) are as follows:

  1. Compute the gradient of the objective with respect to the currently predicted class probabilities (which start out as, e.g., uniform over all classes) for each training point (optionally, on only a sample of the data)
  2. Learn a tree for each class as a functional approximation of this gradient step
  3. Use the tree to predict the approximate gradient at all training points
  4. Sum the gradient predictions with the running gradient sums for each point (these all start out as zero, of course).
  5. Use something like the softmax transformation to generate class probabilities from these scores
  6. Iterate steps 1 through 5 until a stopping condition is met (such as a small gradient magnitude).

You can see here that machine learning primitives feature prominently in the algorithm.  Step two involves learning one or more trees.  Step three uses those trees to make predictions.  Obviously, those steps are easily accomplished with the WhizzML builtins create-model and create-batchprediciton, respectively.But there are a few other steps where the WhizzML implementation isn’t as clear.  The gradient computation, summing of the predictions, and application of the softmax transformation don’t have (very) obvious WhizzML implementation, because they are operations that iterate over the whole dataset.  In general, the way we work with the data in WhizzML is via calls to BigML rather than explicit iteration.

So are there calls to the BigML API that we can make that will do the computations above?  There are, if we use Flatline.  Flatline is BigML’s DSL for dataset transformation, and fortunately all of the above steps that aren’t learning or prediction can be encoded as Flatline transformations.  Since Flatline is a first class citizen of WhizzML, we can easily specify those transformations in our WhizzML implementation.

Take step four, for example.  Suppose we have our current sum of gradient steps for each training point stored in a column of the dataset, and our predictions for the current gradient step in another.  If those columns are named current_sum and current_prediction, respectively, then the Flatline expression for the sum of those two columns is:


Where the f Flatline operator gets the value for a field given the name.  Knowing that we have a running sum and a set of predictions for each class, we need to construct a set of Flatline expressions to perform these sums.  We can use WhizzML (and especially the flatline builtin) to construct these programmatically:

sum columns

Here, we get the names for all of the running sum, current prediction, and new sum columns into the the last-sums, this-preds, and this-sums variables, respectively.  We then construct the flatline expression that creates the sum, and call make-fields (a helper defined elsewhere) to create the list of flatline expressions mapped to the new field names.  The helper add-fields then creates a new dataset containing the created fields.

We can do roughly the same thing to compute the gradient and apply the softmax transformation; We use WhizzML to compose Flatline expressions, then allow BigML to do the dataset operation on it’s servers.

This is just a peek into what a gradient boosting implementation might look like in WhizzML.  For a full implementation of this and some other common machine learning workflows, check out the WhizzML tutorials.  We’ve even got a Sublime Text Package to get you started writing WhizzML as quickly as possible.  What are you waiting for?

%d bloggers like this: