BigML is very excited to be collaborating with Tableau Software as a newly minted Tableau Technology Partner. This is a natural fit as both companies share similar philosophies on making analytics accessible (and enjoyable) to more people through incorporation of intuitive and easy-to-use tools. Not surprisingly, there’s a lot of overlap between our users: our latest survey showed that over 50% of BigML’s users also use Tableau. And of course, we’re eager to introduce Tableau’s customers to the predictive power of BigML.
So what does this mean for BigML and Tableau users? First and foremost, it means that BigML will be working on ways to enable easier usage across both tools. For starters, we have just launched a new feature that lets you export a BigML model directly to Tableau as a calculated field, as is demonstrated in the video below:
This approach unleashes the power of Tableau for prediction, letting users interact with a BigML model just like any other Tableau field. In the video, for example, we color a bar chart by predicted profit, which yields valuable insights into Tableau’s “superstore” retail dataset.
Moving forward, we’ll be looking at other ways that we combine the powerful analytic and visual capabilities of Tableau and BigML to enable joint customers to do amazing things.
We’d love to hear from any BigML users who are also Tableau customers so we can get added direction on this collaboration and provide you with early access to future implementations. If you’re willing and interested, please email us at email@example.com!
Just over a year ago BigML hit a milestone when the 10,000th predictive model was created using our platform. At the time, we were very happy to see that our idea of building a service that helped people easily build predictive applications had started to get some traction. One year and several thousand users later we are super excited that BigML has now supported the creation of over 1,000,000 predictive models—400,000 of which were created in just the last two months! We are extremely happy to see a growing community of fellow data practitioners and developers using BigML across a number of domains and use cases: predictive advertising, predictive lead scoring, preventive customer churn, security, fraud detection, etc.
And we’re just getting started! We have a roadmap of exciting features that will be coming out just around the corner… Stay tuned!
When trying to make data-driven decisions, we’re often faced with datasets that contain many more features than what we actually need for decision-making. As a consequence, building models becomes more computationally demanding, and even worse, model performance can suffer, as heavily parameterized models can lead to overfitting. How then, can we pare down our data to the most relevant set of features? In this post, we will discuss a solution proposed by Stanford scientists Ron Kohavi (now partner-level architect at Microsoft) and George H. John (now CEO at Rocket Fuel) in this article, in fact one of the the top 300 most-cited articles according to CiteSeerX.
The Basic Idea
Let’s consider a simple case where the data are composed of three features. We can denote each possible feature subset as a three digit binary string, where the value of each digit denotes whether the corresponding feature is included in the subset. For example, the string ‘101’ represents the subset which includes the first and third features from the original set of three. Our goal therefore, is to find which subset gives the best model performance. For any set of n features, there exist 2^n different subsets, so clearly an exhaustive search is out of the question for any data containing more than a handful of features.
The Advanced Idea
Now, let’s say each of these feature subsets is a node in a state space, and we form connections between those nodes that differ by only a single digit, i.e. the addition or deletion of a single feature. Our challenge therefore is to explore this state space in a manner which will quickly arrive at the most desirable feature subset. To accomplish this, we will employ the Best-First Search algorithm. To direct the search algorithm, we need a method of scoring each node in the state space. We will define the score as the mean accuracy after n-fold cross validation of models trained with the feature subset, minus a penalty of 0.1% per feature. This penalty is introduced to mitigate overfitting. As an added bonus, it also favors models which are quicker to build and evaluate. The above figure illustrates how a single node is evaluated using 3-fold cross validation. Lastly, we need to decide where in the state space to begin the search. Most of the branching will occur in the nodes encountered earlier in the search, so we will begin at the subset containing zero features, so that less time is spent building and evaluating models to explore these early nodes.
Testing the Idea
We’ll try this approach using the credit-screening dataset from the UC Irvine Machine Learning Repository, which contains 15 categorical and numerical features. Building a model using all the features, we get an average accuracy of 81.74% from five-fold cross validation. Not bad, but maybe we can do better… BigML’s Python bindings make it simple and straightforward to implement the search algorithm described above. You can find such an implementation using scikit-learn to create kfold cross validation data sources here. After running the feature subset search, we arrive at feature subset containing just 3 features, which achieves an average cross-validation accuracy of 86.37%. Moreover, we arrived at this result after evaluating only 117 different feature subsets, a far cry from 32768 which would be required for an exhaustive search.
Navigating the web nowadays, we constantly paddle through clouds of tags and other classification or taxonomy labels. We’ve grown so used to it that now we hardly realize their presence, but they are everywhere. They flow around the main stream of contents, pervade posts (such as this one), articles, personal pages, catalogs and all major content repositories. To store this kind of content, one naturally drifts to non-normalized fields or documental databases, where they end up stacked in a multi-occurrence field. From a Machine Learning point of view, they become a multi-labeled feature.
Machine Learning has well-known methods to cope with this kind of multi-labeled features. In this blog we devoted a previous post to talk about multi-labeled categories prediction using BigMLer, the command line utility for BigML. Now we present the next step: using multi-labeled fields’ labels as predictors.
Multi-labels in professional profiles: Recruiting toy example
In the spirit of the aforementioned post, let’s have a look at the typical profile page in your favourite recruiting web site. At first sight, we detect some contents that might as well be stored in multi-labeled fields: the companies we’ve worked for, the associations we belong to, the languages we speak, the positions we’ve assumed and our skills. Should you face the task of predicting which profile would be more suitable for a certain position, you would probably want to use this information. For example, in technical positions people that understand English will probably perfom better in their job. But maybe this is not so true for the chemical or pharmaceutical sector, where German has traditionally been a dominant communication language. Also the number of spoken languages could be a determinant factor. Who knows what other relations are hidden in your multi-labeled features!
OK, so there seems to be a bunch of valuable information stored in multi-labeled fields, but it looks like it must be reshaped to be useful in your machine learning system. Each label per se is a new feature you can use as input for your prediction rules. Even aggregation functions over the labels, like
last, can be useful as new input. Now the good news: BigMLer can do this for you!
Multiple predictor multi-label fields.
Let’s build an example based on our recruiting site scenario. Suppose we make up a sample with some of the features available in users’ profile pages, such as the name, age, gender, marital status, number of certifications, number of recommendations, number of courses, titles, languages and skills. This is an excerpt of the training data:
The last three fields above are filled with colon-separated multiple values. For each person, they contain:
- Titles: the titles of the posts she has occupied
- Languages: the languages she speaks
- Skills: the skills she has
As our goal is to produce a quick and simple proof of concept, the values for the
Titles field are restricted to just four categories: Student, Engineer, Manager and CEO. The skills have also been chosen from a list of popular skills. Looking at the data we have, the first question that arises is: could we build a model that predicts if a new profile would fit a certain position? Well, probably some skills are a must in CEO‘s profiles, while other are still missing when you are a Student. Our data has implicitly all of this information, but we would need some work to rearrange it in order to build a predictive model from it. For example, we would need to build a new field for each of the available skills and populate it with a True value if the profile has that skill and False otherwise. What if you could let BigMLer take care of these uninteresting details for you? Guess what: you can!
See the next BigMLer command?
--multi-label-fields options tell BigMLer that the contents of the fields
Skills are a series of labels. Using
--label-separator you set colon as the labels’ delimiter and… ta da! BigMLer generates an extended file where each of the fields declared as multi-labeled fields are transformed into a series of binary new fields, one for each different label. The extended file is used to build the corresponding source, dataset and model objects that you should need to make a prediction. The value to predict is the
Titles field, that has been targeted as our objective using the
--objective option. So, in just one line and with no coding at all, you have been able to add each and every label of your
Titles as an independent new feature available to be used as predictor or objective in your models.
Still, you might be missing additional features, such as the number of languages or the last occupied position. BigMLer has also added a new
--label-aggregate option that can help you with that. You can use
count to create a new field holding the number of occurrences of labels in a multi-label field and
last to generate a new field with the first or last label found in the multi-label fields. In our example, we could use
Titles - count,
Languages - count,
Skills - count,
Titles - last,
Languages - last,
Skills - last new fields will be added to our original source data and used in model building.
Selecting significant input data
We have just built an extended file and generated the BigML resources you need to create a trained model. Nevertheless, we focused mainly in showing the advantages that BigMLer offers to build up new features from multi-labeled fields, disregarding the convenience of including or excluding some of them as inputs for our model. In our example, the
Name fields, for example, should be excluded from the model building input fields, as they are external references that have nothing to do with the
Titles values (we don’t expect all CEO’s to be named Francisco). Also, once the new label fields are included, we may prefer to exclude the prior multi-labeled fields from the model input fields to ensure that the prediction rules are only based in separate labels. For the
Titles objective field, we would also like to ignore the generated aggregation fields to avoid useless prediction rules, like saying that if the last occupied post in a profile is Manager, then the profile is suitable to be a Manager. In addition to that, we exclude the
Marital Status fields because we don’t want these features to influence our analysis. This is how we chose to build our final model:
--model-fields option has been set to a comma-separated list of all the fields that you would like to be excluded as input for your model prefixed by a
- sign. Then, having a look at the generated models, one for each position, some rules appear. You can see the entire process in the next video:
According to our data, Students have a limited number of skills such as web programming but lack others such as Business Intelligence (that is found in Managers‘ profiles), Algorithm Design (frequently found in Engineers‘ profiles) or Software Engineering Management (appearing in CEOs‘ profiles). Similar patterns can be found in the Engineers‘ model, the Managers‘ model and the CEOs‘ model, so that when we come across new profiles, you could use these models (tagged with the multi-label-recruiting text) to predict the positions they would be suitable to by calling
where new_profiles.csv would contain the information of the new profiles in CSV format. The predictions.csv file generated by BigMLer would store the predicted values for each new profile and their confidences separated by comma, all ready to use.
This is a simplified example of how BigMLer can empower you to easily use multi-labeled fields in your machine learning system. It can split the fields’ contents, generate new binary fields for each existing label and even aggregate the labels information with count, first or last functions. BigMLer‘s functionality keeps growing steadily, and new options like weighted models and threshold limited predictions are ready for you to try–but this will be the subject for another post, so stay tuned!
One of the great things about sports these days are the number of bright minds taking fresh looks at sports statistics and analytics. This has been particularly true in baseball, where there’s healthy debate between baseball Traditionalists (who favor subjective-based scouting reports and place value on traditional statistics like Home Runs, Wins, etc) and “Sabermetricians” (who favor advanced statistics—many of which were created by Nate Silver), with more and more credence being lent to analytic-driven decisions for roster composition, player salaries and the like. The Sabermetrics approach to baseball is well-documented in “Moneyball” and throughout the baseball blogosphere.
Football (the American variety) also has a burgeoning statistical community, and Football Outsiders is one of the leaders in this arena. At the core of their approach is their proprietary Defense-adjusted Value Over Average (DVOA) system that breaks down every single NFL play and compares a team’s performance to a league baseline based on situation in order to determine value over average—which they explain in detail here. Another example of cool football analytics in action is this report by Lock Analytics on how to use Random Forests to estimate a game’s win probability on a play by play basis.
What was the objective of the model?
With the Super Bowl being right around the corner, we thought it would be interesting to work with some of Football Outsiders’ historical data plus key football gambling metrics to see if we could come up with predictions for the big game.
What is the data source?
We used a few sources to create a dataset for this project. First and foremost, we pulled statistical data from Football Outsiders. Then, to get some added context on the Super Bowl for some of the betting metrics, we pulled historical scores, point spreads and over/under totals from Vegas Insider.
What was the modeling strategy?
As stated, we combined Football Outsiders’ DVOA-driven analysis with historical betting statistics, with the objectives being to predict the NFC (Seattle Seahawks) and AFC (Denver Broncos) total points. However, since there were only relevant statistics from the past 25 seasons we had a pretty small dataset. As such, we decided to build a 100-model ensemble in order to gain a stronger predictive value.
What fields were selected for the ensembles?
We used AFC & NFC Efficiency—both weighted (factoring in late-season performance) and non-weighted; AFC & NFC Offensive & Defensive ranks (weighted & non-weighted), AFC & NFC Special Teams ranks, the point spread, and the over/under total. We also included historical points scored for both the AFC & NFC team from past Super Bowls. You can view and clone the full dataset that we built here, although you’ll have to deselect some of the fields if you want to replicate this specific ensemble.
What did we find?
Using the Prediction function within the BigML interface, we entered the key values for this season to predict both AFC and NFC points (pictured below)
And the winner is..:
As you can see below, the result of our prediction is NFC 26.38 (with error ±4.95), AFC 22.16 (with error ±2.20). This reflects a prediction using confidence weighting for the ensemble. If we use a plurality-based prediction the result is NFC 29.93 (±13.74), AFC 23.97 (±6.0).
Incidentally, we also ran a categorical ensemble to simply predict the winner (NFC/AFC), and that ensemble also pointed to a Seahawks’ victory, with ~76% confidence. Another ensemble predicting the winner against the spread showed Seattle with a ~72% confidence.
But before you Seattle denizens start planning a victory parade through the Emerald City, please bear in mind a few things:
- These ensembles were built from small datasets so treat the results with a grain of salt. In other words, don’t wager your mortgage on 24 rows of data.
- Factoring in the expected error (which reflects our 95% confidence threshold), Denver could still win.
- This blog was authored by a lifelong Seahawks fan.
Last but not least, do not forget the greatest football variable of all, which a wise data analyst recently described to me as the “Any Given Sunday” coefficient…
When BigML first launched the SunBurst visualization for decision trees, I was amazed at what a big improvement it was over the traditional decision tree viz. Instead of showing each node as the same size, the SunBurst shows each node as an arc with length proportional to its number of instances. You can then color these arcs in different ways—my favorite is “confidence view”, which colors each arc according to the amount of separation between classes. So to find the strongest patterns in your model, just look for the longest arcs that are the brightest green (and to find weak spots, just look for arcs that are brown or red).
Today we launched the Open SunBurst feature, which lets you drop a SunBurst model into any web page. This is particularly useful for blogs and news sites who want to present an interactive model to their readers as part of a broader story. Simply copy this bit of HTML from your model’s “secret link” section:
The result is a miniaturized version of the SunBurst, suitable for embedding in a news article or blog post. You still get all the interactivity of the SunBurst, including the ability to zoom in on a node, mouse over nodes to find rules, and toggle between prediction view and confidence view. Check out the video below for more details!
We had a great turnout for Tuesday’s webinar that detailed BigML’s Winter 2014 Release. In the webinar, BigML’s CIO Poul Petersen used loan data from Prosper (a peer-to-peer lending site) to assess risk for prospective loan applicants.
Some of the key features that were highlighted in the webinar include:
- Dataset filtering
- Dataset sampling
- How (and why) to use weights to balance models
- How to adjust the node threshold (i.e., the depth) in BigML’s decision trees
- How to add new fields to a dataset
- K-threshold ensembles
- Batch predictions
Check it out for yourself here and stay tuned for details on our next webinar, which will focus on Programmatic Machine Learning via BigML’s API!
Building a predictive model that performs reasonably well scoring new data in production is a multi-step and iterative process that requires the right mix of training data, feature engineering, machine learning, evaluations, and black art. Once a model is running “in the wild”, its performance can degrade significantly when the distribution generating the new data varies from the distribution that generated the data used to train the model. Unfortunately, this is often the norm and not the exception in many real-world domains. We briefly described this issue recently. This problem is formally known as Covariate Shift, when the distribution of the inputs used as predictors (covariates) changes between training and production stages, or as Dataset Shift, when the joint distribution of inputs and the output (the target being predicted) also changes. Both Covariate Shift and Dataset Shift are receiving more attention from the research community. But, in practical settings, how can you automatically detect that there’s a significant difference between training and production data to properly take action and retrain or adjust your model accordingly?
In this post, I’m going to show you how to use Machine Learning (as it couldn’t be otherwise) to quickly check whether there’s a covariate shift between training data and production data. You read it right: Machine Learning to learn whether machine-learned models will perform well or not. I’ll describe a quick and dirty method and leave rigorous explanations and formal comparisons with other techniques (e.g., KL-divergence, Wald-Wolfowitz test, etc) for other forums. The method will also be helpful as a sneak peek of a couple of new exciting capabilities that we just brought to BigML: dataset transformations and multi-dataset models.
Let me start giving you the basic intuition behind the method.
The Basic Idea
Most supervised machine learning techniques are built on the assumption that data at the training and production stages follow the same distribution. So, to test whether that is the case in a particular scenario, we can just create a new dataset mixing training and production data, where each instance in the new dataset has been labeled either “train” or “production”, according to its provenance. In the absence of covariate shift, the distributions of the train and production data instances will be nearly identical, and we would have no easy way of distinguising between them. On the other hand, when the distributions of the two labels difer (i.e., when we do have a covariate shift in our data), it must be possible to predict whether an instance will have either the “train” or the “production” label. So our strategy to detect covariate shift will consist on building a predictive model with the provenance label as its target, and evaluating how well it scores when used to sift training from production data. More specifically, these are the steps we will follow:
- Create a random sample of your training data adding a new feature (Origin) with the value “train” to each instance in the sample.
- Create a random sample of your production data adding a new feature (Origin) with the value “production” to each instance in the sample.
- Create a model to predict the Origin of an instance using a sample (e.g., 80%) of the previous samples as training data.
- Evaluate the new model using the rest (e.g., 20%) of the previous samples as test data.
- If the phi coefficient of the evaluation is smaller than 0.2 then you can say that your training and production data are indistinguishable and they come from the same or at least very similar distribution. However, if the phi coefficient is greater than 0.2 then you can say that there’s a covariate shift and your training data is not really representative of your production data.
- To avoid the result to be just a matter of chance, you should run a number of trials and average the result.
I leave a formal explanation about the right number of trials and the right size of the samples for other forums but if you run at least 10 trials and use 80% of your data you’ll be on the safe zone. Another topic for a formal discussion would be threshold setting, that is, how to estimate the degradation of our model as phi increases; e.g., if phi is, say, 0.21, how bad is our model?… It will be disastrous in some domains and unnoticeable in others.
Next, I will show you how to quickly test the idea using as an example the training and test data (using the latter as “production” data) of the Titanic dataset as they were provided at the Kaggle competition, but removing the label and the ids of each instance.
Testing for Covariate Shift in a few API calls
I’m going to use BigML’s raw REST API within a simple bash script. A simpler function providing covariate shift checking will be available in BigML’s bindings, command-line and web interfaces very soon.
First of all, I create a source for the training data (lines 58-61) and another for the production data (lines 63-66) using their respective remote locations defined at the beginning of the script (lines 22 and 23). Then I create a full dataset for the training source (lines 70-73) and another for the production source (lines 76-79). Then I sample the training data and add a new field named “Origin”, with the value “train” to each instance (lines 90-98) and do the same for the production data but adding the value “production” to each instance (lines 100-108).
The next step is to create a multi-dataset model using both datasets above. Notice that I also subsample the dataset with a bigger number of instances to make sure that the class distribution (train and production) is balanced. I also specify a seed to make sure that I can later evaluate against a complete disjoint subset of the data. Once the model is created, I’m ready to create an evaluation of the new model with the portion of the dataset that I didn’t use to create the model (“out_of_bag”: true). I iterate 10 times and in each iteration display and save the phi coefficient to finally compute and show the average phi at the end of the 10 iterations.
Next I’m going to run the same test but I’ll induce a covariate shift first. To do that, I filter the training data to just contain instances of “male” passengers and filter the production data to just contain instances of female passengers (just uncommenting lines 31 and 32 makes the trick).
In this case, the average phi turns to be 0.6564 so I can say that there is a covariate shift between the training data and production. I shared one of the models of the last test here. You can see that Name is a great predictor for Sex or, in other words, it gets almost perfect discrimination between the training and production data. So just to eliminate any suspicions about the method, I will repeat the test again in this case excluding the field Name from the model (uncommenting line 35). As you can see below, the method still detects the covariate shift in this case with an average phi of 0.3635.
Once you detect a covariate shift you’ll need to retrain your model using new production data. If that is not feasible and depending on the nature of your data there is a number of techniques that can help you adjust your training data.
Notice that In the API calls above, I used two new BigML features:
- Dataset transformations: I was able to filter and sample a dataset based on the value of a field, and also adding new values to existing datasets.You’ll see in an upcoming post many other ways to add fields to an existing dataset, as well as how to remove outliers, discretize continuous variables or creating windows, among many other tricks.
- Multi-dataset models: I could create a model using two datasets as input, sampling each one differently/at differing rates. We’ll also see in a future post that you can create models with up to 32 different datasets individually sampled. This can be very useful to deal with imbalanced datasets or to weight your datasets differently.
Both new capabilities are available via our API and soon via our web interface as well. As you are probably beginning to see, BigML is opening a new venue in automating predictive modeling in the cloud, in ways that had never been available before.
To quickly check whether there’s a covariate shift between your training and production data you can create a predictive model using a mix of instances from the training and production data. If the model is capable of telling apart training instances from production instances then you can say that there’s a covariate shift. The idea behind this method is not new, I first heard of it from Prof. Tom Dietterich, but it’s probably part of the “folk knowledge” needed to develop robust machine learning applications, and it is hard to track it to a specific or original first reference. Anyway, being able to implement it in a few API calls is kind of cool, isn’t it? It will be even cooler once BigML provides it just one call or click away. Stay tuned!
Happy New Year everybody! It has been a while since we last blogged, but for good reason: we’ve been busier than ever developing our Winter release and serving our fast-growing user base at the same time. By now, BigML has been used to create more than 600,000 active predictive models—half of them were created during the last few weeks. We started 2013 with less than 10,000 models and we’re on track to eclipse the millionth model mark early in 2014.
BigML’s 2014 Winter release constitutes a significant step forward in three directions:
- Speed: you can now build a model in 1/8 of the time it previously took and also get fast, real-time predictions with BigML’s PredictServer.
- Programmability: we have empowered our API with programmatic means to filter, sample, or extend with new fields any dataset or list of datasets. Dozens of pre-built functions are already available and defining new functions is very easy.
- Data-driven decisions for everyone: BigML’s new development mode allows you to run unlimited tasks of up to 16 MB for FREE, making BigML the ideal framework to practice, teach, and learn machine learning or predictive analytics. There’s no excuse to not start making data-driven decisions today!
Let’s now quickly review the most salient new features. In the coming days we’ll have added blog posts to explain a few of them in further detail.
Much faster models
A third generation of our algorithm has significantly improved our performance when it comes to model building. To give you a quick idea, modeling a dataset with 50 fields and a little more than 500,000 rows (~80 MB of data) took around 8 minutes in our previous version. Now, it takes less than a minute.
For use cases like fraud or intrusion detection, ad click prediction, or best customer retention, the class of interest is the minority class (i.e., the number of instances of the class of interest is under-represented compared to the number of instances of the other classes). These situations are called imbalanced and they constitute a serious challenge, since most traditional supervised machine learning algorithms overlook the minority classes in favor of the majority ones. Techniques like under-sampling or over-sampling the input data can help, but they usually provide sub-optimal performance.
BigML’s new algorithm comes with three ways to elegantly cope with this problem and create weighted models. Using them you’ll be able to build models that will consider at building time every instance or class according to the weight criteria that you establish.
We have developed a Lisp-like language codenamed Flatline (after the legendary cyber-cowboy McCoy Pauley) and open sourced the specification here. Flatline can be used for both filtering the rows and columns of a dataset and for generating new fields. For example, if the temperature in your dataset is expressed in Fahrenheit degrees you can easily transform it to Celsius using a single Flatline expression.
Flatline allows you to horizontally select different fields in the same row or select a finite sliding window of rows to traverse a dataset vertically. This is useful to generate values based on a number of front and rear values. In other words, you can generate a new field based on computing a function over the previous values of another field. Imagine adding a new field with the 7-day average of the daily maximum temperatures.
Flatline comes with dozens of pre-built functions and it makes very easy to perform standard analytics tasks on your dataset, such as discretezing continuous variables, removing outliers, replacing missing values, normalizing variables, etc. Flatline expressions have a json-like equivalent for those who prefer it. You can read more about BigML’s dataset transformations here.
Being able to programmatically transform a dataset via a high-level language and a cloud-based API together, plus the rest of features that we will mention below, opens a new range of possibilities to program machine learning tasks on the cloud as it was not available before. We’ll give you some examples in the next days. For example, detecting covariate shift between your training data and production data in only a few API calls. We start calling this new paradigm “Programmatic Machine Learning“.
Multi datasets and multi-dataset models
Another cool feature in our Winter release is the ability to create a dataset using multiple datasets as input. This is very useful when you need to combine multiple sources of data into a single dataset or when you want to build an online solution that collects data in batches. You can even sample each dataset individually.
Also, you can use multiple datasets as an input to build models, ensembles and evaluations (i.e., you do not need to first merge them into a single dataset). You can read more about multi-datasets here.
New Prediction Strategies
We have developed a second strategy to deal with missing values in your input data. So far, when the input data that you use to generate a new prediction contained a missing value, BigML returned the prediction that had been computed up to the node (split) that needed that input. We call this strategy last prediction.
Now you can choose an alternative strategy named proportional that will evaluate all the subtrees of a missing split and will recombine their predictions based on the proportion of data in each subtree.
We have also developed a new threshold-based combiner for classification ensembles that comes in handy to implement either conservative or aggressive prediction strategies.
This combiner lets you trigger predictions based on a given threshold k. Imagine that you have created a 20-model ensemble to detect intruders in a computer network and you use a k-threshold of 1 for predictions. Then, as long as a single model in the ensemble predicts true, the ensemble will predict that an intrusion is going on. Or imagine that you have another 30-model ensemble to predict the success of marketing campaigns and this time you want to reduce the number of false positives. Then, you can set up k to a high value like 27 or 28 to make sure that you do not spend your dollars with customers who won’t react to your campaign.
New Development Mode
We noticed that many BigML users weren’t aware of our FREE development mode. As a result, many users would experiment with the promotional datasets that we offer in production mode and soon run out of credits before finishing their initial projects. To address this, we have now made the development switch more obvious in our web interface and have also increased the max task size to 16 MB.
BigML’s development mode has 3 limitations compared to production mode: 1) the maximum number of models of an ensemble cannot be higher than 10; 2) the maximum number of terms in text analysis is limited to 32; and, 3) the maximum number of nodes in a tree cannot be higher than 512. All other features are exactly the same and you can run unlimited tasks.
A few more things
That’s not all. There are some other great features in store like a new search box in your dashboard, multi predictions in Excel-exported models, sharing evaluations via private links, and much, much more. We’ll soon list all of these features in our What’s New section. Last but not least, BigML’s PredictServer is now available from Amazon Marketplace.
So let 2014 be the year that you start adding predictive analytics to your business, and that you start building Predictive Apps with BigML. To help you out, we are offering an extra 15% discount to the first 50 annual BigML subscriptions contracted this year. If you are interested in one, send us an email to firstname.lastname@example.org and we’ll be happy to send a coupon your way.
Slowly, machine learning has been creeping across the software world, especially in the last 5-10 years. Time was, only huge companies could leverage it, making sense out of data that would be useless to their smaller counterparts. Thanks to BigML, a number of other companies, and some great open source projects, machine learning is now democratized. Now nearly anyone, regardless of technical ability, can use machine learning at a fraction of the cost of a single software engineer.
But when I say “machine learning” above, I’m really only talking about supervised classification and regression. While incredibly useful and perhaps the easiest facet of machine learning to understand, it is a very small part of the field. Some companies (including BigML) know this, and have begun to make forays just slightly further into the machine learning wilderness. With supervised classification becoming a larger and larger part of the public consciousness, it’s natural to wonder where the next destination will be.
Here, I’ll speculate on three directions that machine learning commercialization could take in the near future. To one degree or another, there are probably people working on all of these ideas right now, so don’t be surprised to see them at an internet near you quite soon.
Concept Drift or Covariate Shift: You’re Living in the Past, Man!
One problem with machine-learned models is that, in a sense, they’re already out-of-date before you even use them. Models are (by necessity) trained and tested on data that already happened. It does us no good to have accurate predictive analytics on data from the past!
The big bet, of course, is that data from the future looks enough like data from the past that the model will perform usefully on it as well. But, given a long enough future, that bet is bound to be a losing one. The world (or, in data science parlance, the distribution generating the data) will not stand still simply because that would be convenient for your model. It will change, and probably only after your model has been performing as expected for some time. You will be caught of guard! You and your model are doomed!
Okay, slow down. There’s a lot of research into this very problem, with various closely related versions labeled “concept drift”, “covariate shift”, and “dataset shift”. There is a fairly simple, if imperfect, solution to this problem: Every time you make a prediction on some data, file it away in your memory. When you find yourself making way more bad predictions than you’d normally expect, recall that memorized data and construct a new model based on it. Use this model going forward.
Essentially the only subtlety in that plan is the bit about “more bad predictions that you’d normally expect”. How do you know how many you’d expect? How many more is “way more”? It turns out that you can figure this out, but it often involves some fancy math.
We’ve already done some writing about this, and we’re working hard to make this one a reality sooner rather than later. Readers here will know the moment it happens!
Semi-supervised Learning: What You Don’t Know Can Hurt You
Supervised classification is great when you have a bunch of data points, already nicely labeled with the proper objective. But what if you don’t have those labels? Sometimes, getting the data can be cheap but labeling it can be expensive. A nice example of this is image labeling: We can collect loads of images from the internet with simple searching and scraping, but labeling them means a person looking at every single one. Who wants to pay someone to do that?
Enter semi-supervised learning. In semi-supervised learning, we generally imagine that we have a small number of labeled points and a huge number of unlabeled points. We then use fancy algorithms to infer labels for the remainder of the data. We can then, if we like, learn a new model over the now-completely labeled data. The extension from the purely supervised case is fairly clear: One need only provide an interface where a user can upload a vast number of training points and a mechanism to hand-label a few of them.
Perhaps even more interesting is the special case provided by co-training, where two separate datasets containing different views of the data are uploaded. The differing views must be taken from different sources of information. The canonical example of this is webpages, where one view of the data is the text of the page and the other is the text of the links from other pages to that page. Another good example may be cellphones, where one view is the accelerometer data and another view is the GPS data.
With these two views and a small number of labels, we learn separate classifiers in each view. Then, iteratively, each classifier is used to provide more data for the other: First, the classifier from view one finds some points for which it knows the correct class with high confidence. These high confidence points are added to the labeled set from view two, and that classifier is retrained. The process then operates in reverse, and the loop is continued until all points are labeled. The idea is fairly simple, but performs very well in practice, and is probably the best choice for semi-supervised learning in the case where one has two views of the data.
Reinforcement Learning: The Carrot and The Stick
Further from traditional supervised learning is reinforcement learning. The reinforcement learning problem sounds similar to the classification problem: We are trying to learn a mapping form states to actions, much as in the supervised learning case where we’re learning a mapping from points to labels. The main thing that changes the flavor of the problem is that it has much more the feel of a game: When we take an action we occasionally get a reward or punishment, depending on the state. Also, the action we take determines the next state in which we find ourselves.
A nice example of this sort of problem is the game of chess: The state of the game is the various locations of the pieces on the board, and the action you take is to move a piece (and thus change the state). The reward or punishment comes at the end of the game, when you win or lose. The main cleverness is to figure out how that reward or punishment should modify how you select your actions. More precisely, which thing or things that you did led to your eventual loss or victory?
Because this learning setting has such a game-like feel, it might not be surprising that it can be used to learn how to play games. Backgammon and checkers have turned out to be particularly amenable to this type of learning.
So What’s It Going To Be?
It’s difficult to say if any of these things or something else entirely will be The Next Big Thing in commercial machine learning. A lot of that depends on the cleverness of people with problems and if they can find ways to solve their problems with these tools. Many of these technologies don’t yet have broad traction yet simply because . . . they don’t have broad traction yet (no, I’m not in the tautology club). Once a few pioneers have shown how such things can be used, lots of other people will see the light. I’m reminded of a story about the first laser, long before fiberoptic cables and corrective eye surgery, when an early pioneer said that lasers were “a solution looking for a problem“. There are plenty of technologies like these in the machine learning literature; great answers looking for the right question, waiting for someone to match them up.
Got other ideas about what machine learning technology should be democratized next? We’d love to hear about it in the comments below!