Start 2014 with Faster, Easier, and more Programmatic Machine Learning
Happy New Year everybody! It has been a while since we last blogged, but for good reason: we’ve been busier than ever developing our Winter release and serving our fast-growing user base at the same time. By now, BigML has been used to create more than 600,000 active predictive models—half of them were created during the last few weeks. We started 2013 with less than 10,000 models and we’re on track to eclipse the millionth model mark early in 2014.
BigML’s 2014 Winter release constitutes a significant step forward in three directions:
- Speed: you can now build a model in 1/8 of the time it previously took and also get fast, real-time predictions with BigML’s PredictServer.
- Programmability: we have empowered our API with programmatic means to filter, sample, or extend with new fields any dataset or list of datasets. Dozens of pre-built functions are already available and defining new functions is very easy.
- Data-driven decisions for everyone: BigML’s new development mode allows you to run unlimited tasks of up to 16 MB for FREE, making BigML the ideal framework to practice, teach, and learn machine learning or predictive analytics. There’s no excuse to not start making data-driven decisions today!
Let’s now quickly review the most salient new features. In the coming days we’ll have added blog posts to explain a few of them in further detail.
Much faster models
A third generation of our algorithm has significantly improved our performance when it comes to model building. To give you a quick idea, modeling a dataset with 50 fields and a little more than 500,000 rows (~80 MB of data) took around 8 minutes in our previous version. Now, it takes less than a minute.
For use cases like fraud or intrusion detection, ad click prediction, or best customer retention, the class of interest is the minority class (i.e., the number of instances of the class of interest is under-represented compared to the number of instances of the other classes). These situations are called imbalanced and they constitute a serious challenge, since most traditional supervised machine learning algorithms overlook the minority classes in favor of the majority ones. Techniques like under-sampling or over-sampling the input data can help, but they usually provide sub-optimal performance.
BigML’s new algorithm comes with three ways to elegantly cope with this problem and create weighted models. Using them you’ll be able to build models that will consider at building time every instance or class according to the weight criteria that you establish.
We have developed a Lisp-like language codenamed Flatline (after the legendary cyber-cowboy McCoy Pauley) and open sourced the specification here. Flatline can be used for both filtering the rows and columns of a dataset and for generating new fields. For example, if the temperature in your dataset is expressed in Fahrenheit degrees you can easily transform it to Celsius using a single Flatline expression.
Flatline allows you to horizontally select different fields in the same row or select a finite sliding window of rows to traverse a dataset vertically. This is useful to generate values based on a number of front and rear values. In other words, you can generate a new field based on computing a function over the previous values of another field. Imagine adding a new field with the 7-day average of the daily maximum temperatures.
Flatline comes with dozens of pre-built functions and it makes very easy to perform standard analytics tasks on your dataset, such as discretezing continuous variables, removing outliers, replacing missing values, normalizing variables, etc. Flatline expressions have a json-like equivalent for those who prefer it. You can read more about BigML’s dataset transformations here.
Being able to programmatically transform a dataset via a high-level language and a cloud-based API together, plus the rest of features that we will mention below, opens a new range of possibilities to program machine learning tasks on the cloud as it was not available before. We’ll give you some examples in the next days. For example, detecting covariate shift between your training data and production data in only a few API calls. We start calling this new paradigm “Programmatic Machine Learning“.
Multi datasets and multi-dataset models
Another cool feature in our Winter release is the ability to create a dataset using multiple datasets as input. This is very useful when you need to combine multiple sources of data into a single dataset or when you want to build an online solution that collects data in batches. You can even sample each dataset individually.
Also, you can use multiple datasets as an input to build models, ensembles and evaluations (i.e., you do not need to first merge them into a single dataset). You can read more about multi-datasets here.
New Prediction Strategies
We have developed a second strategy to deal with missing values in your input data. So far, when the input data that you use to generate a new prediction contained a missing value, BigML returned the prediction that had been computed up to the node (split) that needed that input. We call this strategy last prediction.
Now you can choose an alternative strategy named proportional that will evaluate all the subtrees of a missing split and will recombine their predictions based on the proportion of data in each subtree.
We have also developed a new threshold-based combiner for classification ensembles that comes in handy to implement either conservative or aggressive prediction strategies.
This combiner lets you trigger predictions based on a given threshold k. Imagine that you have created a 20-model ensemble to detect intruders in a computer network and you use a k-threshold of 1 for predictions. Then, as long as a single model in the ensemble predicts true, the ensemble will predict that an intrusion is going on. Or imagine that you have another 30-model ensemble to predict the success of marketing campaigns and this time you want to reduce the number of false positives. Then, you can set up k to a high value like 27 or 28 to make sure that you do not spend your dollars with customers who won’t react to your campaign.
New Development Mode
We noticed that many BigML users weren’t aware of our FREE development mode. As a result, many users would experiment with the promotional datasets that we offer in production mode and soon run out of credits before finishing their initial projects. To address this, we have now made the development switch more obvious in our web interface and have also increased the max task size to 16 MB.
BigML’s development mode has 3 limitations compared to production mode: 1) the maximum number of models of an ensemble cannot be higher than 10; 2) the maximum number of terms in text analysis is limited to 32; and, 3) the maximum number of nodes in a tree cannot be higher than 512. All other features are exactly the same and you can run unlimited tasks.
A few more things
That’s not all. There are some other great features in store like a new search box in your dashboard, multi predictions in Excel-exported models, sharing evaluations via private links, and much, much more. We’ll soon list all of these features in our What’s New section. Last but not least, BigML’s PredictServer is now available from Amazon Marketplace.
So let 2014 be the year that you start adding predictive analytics to your business, and that you start building Predictive Apps with BigML. To help you out, we are offering an extra 15% discount to the first 50 annual BigML subscriptions contracted this year. If you are interested in one, send us an email to email@example.com and we’ll be happy to send a coupon your way.