Skip to content

Automatic Weighting of Imbalanced Datasets

Very often datasets are imbalanced. That is, the number of instances for each of the classes in the target variable that you want to predict is not proportional to the real importance of each class in your problem. Usually, the class of interest is not the majority class. Imagine a dataset containing clickstream data that you want to use to create a predictive advertising application. The number of instances of users that did not click on an ad would probably be much higher than the number of click-through instances. So when you build a statistical machine-learning model of an imbalanced dataset, the majority (i.e., most prevalent) class will outweigh the minority classes. These datasets usually lead you to build predictive models with suboptimal classification performance. This problem is known as the class-imbalance problem and occurs in a multitude of domains (fraud prevention, intrusion detection, churn prediction, etc). In this post, we’ll see how you can deal with imbalanced datasets configuring your models or ensembles to use weights via BigML’s web interface. You can read how to create weighted models using BigML’s API here and via BigML’s command line here.

BigML Weighthing

 

Weights

A simple solution to cope with imbalanced datasets is re-sampling. That is, undersampling the majority class or oversampling the minority classes. In BigML, you can easily implement re-sampling by using multi-datasets and sampling for each class differently. However, usually basic undersampling takes away instances that might turn to be informative and basic oversampling does not add any extra information to your model.

Another way to not dismiss any information and actually work closer to the root of the problem is to use weights. That is, weighing instances accordingly to the importance that they have in your problem. This enables things like telecom customer churn models where each customer is weighted according to their Lifetime Value. Let’s next see what the impact of weighting a model might be and also examine the options to use weights in BigML.

The Impact of Weighting

Let me illustrate the impact of weighting on model creation by means of two sunburst visualizations for models of the Forest Covertype dataset. This dataset has 581,012 instances that belong to 7 different classes distributed as follows:

  • Lodgepole Pine:  283,301 instances
  • Spruce-Fur: 211,840 instances
  • Ponderosa Pine: 35,754 instances
  • Krummholz: 20,510 instances
  • Douglas-fir: 17,367 instances
  • Aspen: 9,493 instances
  • Cottonwood-Willow: 2,747 instances

The first sunburst below corresponds to a single (512-node) model colored by prediction that I created without using any weighting. The second one corresponds to a single (512-node) weighted model created using BigML’s new balance objective option (more on it below). In both sunbursts, red corresponds to the Cottonwood-Willow class and light green to the Aspen class. In the first sunburst, you can see that the model hardly ever predicts those classes. However, in the sunburst of the weighted model you can see that there are many more red and light green nodes that will predict those classes.

Screen Shot 2014-02-26 at 3.09.20 PM

Screen Shot 2014-02-26 at 3.17.42 PM

So as you can see, weighting helped make us aware of outcomes of predicted classes that are under-represented in the input data that otherwise would be shadowed by over-represented values.

Weighting Models in BigML

BigML gives you three ways to apply weights to your dataset:

  1. Using one of the fields in your dataset as a weight field;
  2. Specifying a weight for each class in the objective field; or
  3. Automatically balancing all the classes in the objective field.

 

Weights

Weight Field

Using this option, BigML will use one of the fields in your dataset as a weight for each instance. If your dataset does not have an explicit weight, you can add one using BigML new dataset transformations.  Any numeric field with no negative or missing values is valid as a weight field. Each instance will be weighted individually according to the value of its weight field. This method is valid for both classification and regression models.

Weighing by LTV

Objective Weight

This method for adding weights only applies to classification models. A set of objective weights may be defined, one weight per objective class. Each instance will be weighted according to its class weight. If a class is not listed in the objective weights, it is assumed to have a weight of 1.  Weights of value zero are valid as long as there are some other positive valued weights. If every weight does end up being zero (this can happen, for instance, if sampling the dataset produces only instances of classes with zero weight) then the resulting model will have a single node with a nil output.

Weight using Objective Weights

Balance Objective

The third method is a convenience shortcut for specifying weights for a classification objective which are inversely proportional to their category counts. This gives you an easy way to make sure that all the classes in your dataset are evenly represented.

Balance Objective

 

Finally, bear in mind that when you use weights to create a model its performance can be significantly impacted. In classification models, you’re actually trading off precision and recall of the classes involved. So it’s very important to pay attention not only to the plain performance measures returned by evaluations but also to the corresponding misclassification costs. That is, if you know the cost of  a false positive and the cost of a false negative in your problem, then you will want to weigh each class to minimize the overall misclassification costs when you build your model.

BigML + Tableau = Powerful Predictive Visualizations for Everyone

BigML is very excited to be collaborating with Tableau Software as a newly minted Tableau Technology Partner.  This is a natural fit as both tableau_partner_logocompanies share similar philosophies on making analytics accessible (and enjoyable) to more people through incorporation of intuitive and easy-to-use tools. Not surprisingly, there’s a lot of overlap between our users: our latest survey showed that over 50% of BigML’s users also use Tableau. And of course, we’re eager to introduce Tableau’s customers to the predictive power of BigML.

So what does this mean for BigML and Tableau users?  First and foremost, it means that BigML will be working on ways to enable easier usage across both tools. For starters, we have just launched a new feature that lets you export a BigML model directly to Tableau as a calculated field, as is demonstrated in the video below:

This approach unleashes the power of Tableau for prediction, letting users interact with a BigML model just like any other Tableau field.  In the video, for example, we color a bar chart by predicted profit, which yields valuable insights into Tableau’s “superstore” retail dataset.

Moving forward, we’ll be looking at other ways that we combine the powerful analytic and visual capabilities of Tableau and BigML to enable joint customers to do amazing things.

We’d love to hear from any BigML users who are also Tableau customers so we can get added direction on this collaboration and provide you with early access to future implementations. If you’re willing and interested, please email us at tableau@bigml.com!

One Million Predictive Models Created and Counting!

Just over a year ago BigML hit a milestone when the 10,000th predictive model was created using our platform. At the time, we were very happy to see that our idea of building a service that helped people easily build predictive applications had started to get some traction.  One year and several thousand users later we are super excited that BigML has now supported the creation of over 1,000,000 predictive models—400,000 of which were created in just the last two months!  We are extremely happy to see a growing community of fellow data practitioners and developers using BigML across a number of domains and use cases: predictive advertising, predictive lead scoring, preventive customer churn, security, fraud detection, etc.

And we’re just getting started!  We have a roadmap of exciting features that will be coming out just around the corner… Stay tuned!

1000000_models

Smart Feature Selection with scikit-learn and BigML’s API

When trying to make data-driven decisions, we’re often faced with datasets that contain many more features than what we actually need for decision-making. As a consequence, building models becomes more computationally demanding, and even worse, model performance can suffer, as heavily parameterized models can lead to overfitting. How then, can we pare down our data to the most relevant set of features? In this post, we will discuss a solution proposed by Stanford scientists Ron Kohavi (now partner-level architect at Microsoft) and George H. John (now CEO at Rocket Fuel) in this article, in fact one of the the top 300 most-cited articles according to CiteSeerX.

The Basic Idea

Let’s consider a simple case where the data are composed of three features. We can denote each possible feature subset as a three digit binary string, where the value of each digit denotes whether the corresponding feature is included in the subset. For example, the string ‘101’ represents the subset which includes the first and third features from the original set of three. Our goal therefore, is to find which subset gives the best model performance. For any set of n features, there exist 2^n different subsets, so clearly an exhaustive search is out of the question for any data containing more than a handful of features.

Feature Subset Selection

The Advanced Idea

Now, let’s say each of these feature subsets is a node in a state space, and we form connections between those nodes that differ by only a single digit, i.e. the addition or deletion of a single feature. Our challenge therefore is to explore this state space in a manner which will quickly arrive at the most desirable feature subset. To accomplish this, we will employ the Best-First Search algorithm. To direct the search algorithm, we need a method of scoring each node in the state space. We will define the score as the mean accuracy after n-fold cross validation of models trained with the feature subset, minus a penalty of 0.1% per feature. This penalty is introduced to mitigate overfitting. As an added bonus, it also favors models which are quicker to build and evaluate. The above figure illustrates how a single node is evaluated using 3-fold cross validation. Lastly, we need to decide where in the state space to begin the search. Most of the branching will occur in the nodes encountered earlier in the search, so we will begin at the subset containing zero features, so that less time is spent building and evaluating models to explore these early nodes.

Testing the Idea

We’ll try this approach using the credit-screening dataset from the UC Irvine Machine Learning Repository, which contains 15 categorical and numerical features. Building a model using all the features, we get an average accuracy of 81.74% from five-fold cross validation. Not bad, but maybe we can do better… BigML’s Python bindings make it simple and straightforward to implement the search algorithm described above. You can find such an implementation using scikit-learn to create kfold cross validation data sources here. After running the feature subset search, we arrive at feature subset containing just 3 features, which achieves an average cross-validation accuracy of 86.37%. Moreover, we arrived at this result after evaluating only 117 different feature subsets, a far cry from 32768 which would be required for an exhaustive search.

Surfing the sea of tags with multi-label Machine Learning and BigMLer

Navigating the web nowadays, we constantly paddle through clouds of tags and other classification or taxonomy labels. We’ve grown so used to it that now we hardly realize their presence, but they are everywhere. They flow around the main stream of contents, pervade posts (such as this one), articles, personal pages, catalogs and all major content repositories. To store this kind of content, one naturally drifts to non-normalized fields or documental databases, where they end up stacked in a multi-occurrence field. From a Machine Learning point of view, they become a multi-labeled feature.

Machine Learning has well-known methods to cope with this kind of multi-labeled features. In this blog we devoted a previous post to talk about multi-labeled categories prediction using BigMLer, the command line utility for BigML. Now we present the next step: using multi-labeled fields’ labels as predictors.

Multi-labels in professional profiles: Recruiting toy example

In the spirit of the aforementioned post, let’s have a look at the typical profile page in your favourite recruiting web site. At first sight, we detect some contents that might as well be stored in multi-labeled fields: the companies we’ve worked for, the associations we belong to, the languages we speak, the positions we’ve assumed and our skills. Should you face the task of predicting which profile would be more suitable for a certain position, you would probably want to use this information. For example, in technical positions people that understand English will probably perfom better in their job. But maybe this is not so true for the chemical or pharmaceutical sector, where German has traditionally been a dominant communication language. Also the number of spoken languages could be a determinant factor. Who knows what other relations are hidden in your multi-labeled features!

OK, so there seems to be a bunch of valuable information stored in multi-labeled fields, but it looks like it must be reshaped to be useful in your machine learning system. Each label per se is a new feature you can use as input for your prediction rules. Even aggregation functions over the labels, like count, first or last, can be useful as new input. Now the good news: BigMLer can do this for you!

Multiple predictor multi-label fields.

Let’s build an example based on our recruiting site scenario. Suppose we make up a sample with some of the features available in users’ profile pages, such as the name, age, gender, marital status, number of certifications, number of recommendations, number of courses, titles, languages and skills. This is an excerpt of the training data:

The last three fields above are filled with colon-separated multiple values. For each person, they contain:

  • Titles: the titles of the posts she has occupied
  • Languages: the languages she speaks
  • Skills: the skills she has

As our goal is to produce a quick and simple proof of concept, the values for the Titles field are restricted to just four categories: Student, Engineer, Manager and CEO. The skills have also been chosen from a list of popular skills. Looking at the data we have, the first question that arises is: could we build a model that predicts if a new profile would fit a certain position? Well, probably some skills are a must in CEO‘s profiles, while other are still missing when you are a Student. Our data has implicitly all of this information, but we would need some work to rearrange it in order to build a predictive model from it. For example, we would need to build a new field for each of the available skills and populate it with a True value if the profile has that skill and False otherwise. What if you could let BigMLer take care of these uninteresting details for you? Guess what: you can!

See the next BigMLer command?

The --multi-label and --multi-label-fields options tell BigMLer that the contents of the fields TitlesLanguages and Skills are a series of labels. Using --label-separator you set colon as the labels’ delimiter and… ta da! BigMLer generates an extended file where each of the fields declared as multi-labeled fields are transformed into a series of binary new fields, one for each different label. The extended file is used to build the corresponding source, dataset and model objects that you should need to make a prediction. The value to predict is the Titles field, that has been targeted as our objective using the --objective option. So, in just one line and with no coding at all, you have been able to add each and every label of your Skills, Languages and Titles as an independent new feature available to be used as predictor or objective in your models.

Still, you might be missing additional features, such as the number of languages or the last occupied position. BigMLer has also added a new --label-aggregate option that can help you with that. You can use count to create a new field holding the number of occurrences of labels in a multi-label field and first or last to generate a new field with the first or last label found in the multi-label fields. In our example, we could use

and some Titles - count, Languages - count, Skills - count, Titles - last, Languages - last, Skills - last new fields will be added to our original source data and used in model building.

Selecting significant input data

We have just built an extended file and generated the BigML resources you need to create a trained model. Nevertheless, we focused mainly in showing the advantages that BigMLer offers to build up new features from multi-labeled fields, disregarding the convenience of including or excluding some of them as inputs for our model. In our example, the ID and Name fields, for example, should be excluded from the model building input fields, as they are external references that have nothing to do with the Titles values (we don’t expect all CEO’s to be named Francisco). Also, once the new label fields are included, we may prefer to exclude the prior multi-labeled fields from the model input fields to ensure that the prediction rules are only based in separate labels. For the Titles objective field, we would also like to ignore the generated aggregation fields to avoid useless prediction rules, like saying that if the last occupied post in a profile is Manager, then the profile is suitable to be a Manager. In addition to that, we exclude the Age, Gender and Marital Status fields because we don’t want these features to influence our analysis. This is how we chose to build our final model:

where the --model-fields option has been set to a comma-separated list of all the fields that you would like to be excluded as input for your model prefixed by a - sign. Then, having a look at the generated models, one for each position, some rules appear. You can see the entire process in the next video:

According to our data, Students have a limited number of skills such as web programming but lack others such as Business Intelligence (that is found in Managers‘ profiles), Algorithm Design (frequently found in Engineers‘ profiles) or Software Engineering Management (appearing in CEOs‘ profiles). Similar patterns can be found in the Engineers‘ model, the Managers‘ model and the CEOs‘ model, so that when we come across new profiles, you could use these models (tagged with the multi-label-recruiting text) to predict the positions they would be suitable to by calling

where new_profiles.csv would contain the information of the new profiles in CSV format. The predictions.csv file generated by BigMLer would store the predicted values for each new profile and their confidences separated by comma, all ready to use.

This is a simplified example of how BigMLer can empower you to easily use multi-labeled fields in your machine learning system. It can split the fields’ contents, generate new binary fields for each existing label and even aggregate the labels information with count, first or last functions. BigMLer‘s functionality keeps growing steadily, and new options like weighted models and threshold limited predictions are ready for you to try–but this will be the subject for another post, so stay tuned!

So you want to predict the Super Bowl?

One of the great things about sports these days are the number of bright minds taking fresh looks at sports statistics and analytics. This has been particularly true in baseball, where there’s healthy debate between baseball Traditionalists (who favor subjective-based scouting reports and place value on traditional statistics like Home Runs, Wins, etc) and “Sabermetricians” (who favor advanced statistics—many of which were created by Nate Silver), with more and more credence being lent to analytic-driven decisions for roster composition, player salaries and the like. The Sabermetrics approach to baseball is well-documented in “Moneyball” and throughout the baseball blogosphere.

Football (the American variety) also has a burgeoning statistical community, and Football Outsiders is one of the leaders in this arena. At the core of their approach is their proprietary Defense-adjusted Value Over Average (DVOA) system that breaks down every single NFL play and compares a team’s performance to a league baseline based on situation in order to determine value over average—which they explain in detail here. Another example of cool football analytics in action is this report by Lock Analytics on how to use Random Forests to estimate a game’s win probability on a play by play basis.

super-bowl-48-broncos-vs.-seahawks
What was the objective of the model?
With the Super Bowl being right around the corner, we thought it would be interesting to work with some of Football Outsiders’ historical data plus key football gambling metrics to see if we could come up with predictions for the big game.

What is the data source?
We used a few sources to create a dataset for this project. First and foremost, we pulled statistical data from Football Outsiders. Then, to get some added context on the Super Bowl for some of the betting metrics, we pulled historical scores, point spreads and over/under totals from Vegas Insider.

What was the modeling strategy?
As stated, we combined Football Outsiders’ DVOA-driven analysis with historical betting statistics, with the objectives being to predict the NFC (Seattle Seahawks) and AFC (Denver Broncos) total points. However, since there were only relevant statistics from the past 25 seasons we had a pretty small dataset. As such, we decided to build a 100-model ensemble in order to gain a stronger predictive value.

What fields were selected for the ensembles?
We used AFC & NFC Efficiency—both weighted (factoring in late-season performance) and non-weighted; AFC & NFC Offensive & Defensive ranks (weighted & non-weighted), AFC & NFC Special Teams ranks, the point spread, and the over/under total. We also included historical points scored for both the AFC & NFC team from past Super Bowls. You can view and clone the full dataset that we built here, although you’ll have to deselect some of the fields if you want to replicate this specific ensemble.

What did we find?
Using the Prediction function within the BigML interface, we entered the key values for this season to predict both AFC and NFC points (pictured below)

SB prediction

And the winner is..:
As you can see below, the result of our prediction is NFC 26.38 (with error ±4.95), AFC 22.16 (with error ±2.20). This reflects a prediction using confidence weighting for the ensemble. If we use a plurality-based prediction the result is NFC 29.93 (±13.74), AFC 23.97 (±6.0).

NFC prediction

AFC prediction

Incidentally, we also ran a categorical ensemble to simply predict the winner (NFC/AFC), and that ensemble also pointed to a Seahawks’ victory, with ~76% confidence. Another ensemble predicting the winner against the spread showed Seattle with a ~72% confidence.

Important Caveats:
But before you Seattle denizens start planning a victory parade through the Emerald City, please bear in mind a few things:

  • These ensembles were built from small datasets so treat the results with a grain of salt. In other words, don’t wager your mortgage on 24 rows of data.
  • Factoring in the expected error (which reflects our 95% confidence threshold), Denver could still win.
  • This blog was authored by a lifelong Seahawks fan.

Last but not least, do not forget the greatest football variable of all, which a wise data analyst recently described to me as the “Any Given Sunday” coefficient…

Share the SunBurst!

When BigML first launched the SunBurst visualization for decision trees, I was amazed at what a big improvement it was over the traditional decision tree viz. Instead of showing each node as the same size, the SunBurst shows each node as an arc with length proportional to its number of instances.  You can then color these arcs in different ways—my favorite is “confidence view”, which colors each arc according to the amount of separation between classes. So to find the strongest patterns in your model, just look for the longest arcs that are the brightest green (and to find weak spots, just look for arcs that are brown or red).

SunBurst viz of StumbleUpon model

Today we launched the Open SunBurst feature, which lets you drop a SunBurst model into any web page. This is particularly useful for blogs and news sites who want to present an interactive model to their readers as part of a broader story.  Simply copy this bit of HTML from your model’s “secret link” section:

Image

The result is a miniaturized version of the SunBurst, suitable for embedding in a news article or blog post. You still get all the interactivity of the SunBurst, including the ability to zoom in on a node, mouse over nodes to find rules, and toggle between prediction view and confidence view. Check out the video below for more details!

Using Dataset Transformations and Machine Learning to Assess Loan Risk

We had a great turnout for Tuesday’s webinar that detailed BigML’s Winter 2014 Release.  In the webinar, BigML’s CIO Poul Petersen used loan data from Prosper (a peer-to-peer lending site) to assess risk for prospective loan applicants.

Some of the key features that were highlighted in the webinar include:

  • Dataset filtering
  • Dataset sampling
  • How (and why) to use weights to balance models
  • How to adjust the node threshold (i.e., the depth) in BigML’s decision trees
  • How to add new fields to a dataset
  • K-threshold ensembles
  • Batch predictions

Check it out for yourself here and stay tuned for details on our next webinar, which will focus on Programmatic Machine Learning via BigML’s API!

A Simple Machine Learning Method to Detect Covariate Shift

Building a predictive model that performs reasonably well scoring new data in production is a multi-step and iterative process that requires the right mix of training data, feature engineering, machine learning, evaluations, and black art. Once a model is running “in the wild”, its performance can degrade significantly when the distribution generating the new data varies from the distribution that generated the data used to train the model. Unfortunately, this is often the norm and not the exception in many real-world domains. We briefly described this issue recently. This problem is formally known as Covariate Shift, when the distribution of the inputs used as predictors (covariates) changes between training and production stages, or as Dataset Shift, when the joint distribution of inputs and the output (the target being predicted) also changes. Both Covariate Shift and Dataset Shift are receiving more attention from the research community. But, in practical settings, how can you automatically detect that there’s a significant difference between training and production data to properly take action and retrain or adjust your model accordingly?

In this post,  I’m going to show you how to use Machine Learning (as it couldn’t be otherwise) to quickly check whether there’s a covariate shift between training data and production data. You read it right: Machine Learning to learn whether machine-learned models will perform well or not. I’ll describe a quick and dirty method and leave rigorous explanations and formal comparisons with other techniques (e.g., KL-divergence, Wald-Wolfowitz test, etc)  for other forums. The method will also be helpful as a sneak peek of a couple of new exciting capabilities that we just brought to BigML: dataset transformations and multi-dataset models.

Covariate Shift

Let me start giving you the basic intuition behind the method.

The Basic Idea

Most supervised machine learning techniques are built on the assumption that data at the training and production stages follow the same distribution. So, to test whether that is the case in a particular scenario, we can just create a new dataset mixing training and production data, where each instance in the new dataset has been labeled either “train” or “production”, according to its provenance. In the absence of covariate shift, the distributions of the train and production data instances will be nearly identical, and we would have no easy way of distinguising between them.  On the other hand, when the distributions of the two labels difer (i.e., when we do have a covariate shift in our data), it must be possible to predict whether an instance will have either the “train” or the “production” label.  So our strategy to detect covariate shift will consist on building a predictive model with the provenance label as its target, and evaluating how well it scores when used to sift training from production data.  More specifically, these are the steps we will follow:

  1. Create a random sample of your training data adding a new feature (Origin) with the value “train” to each instance in the sample.
  2. Create a random sample of your production data adding a new feature (Origin) with the value “production” to each instance in the sample.
  3. Create a model to predict the Origin of an instance using a sample (e.g., 80%) of the previous samples as training data.
  4. Evaluate the new model using the rest (e.g., 20%) of the previous samples as test data.
  5. If the phi coefficient of the evaluation is smaller than 0.2 then you can say that your training and production data are indistinguishable and they come from the same or at least very similar distribution. However, if the phi coefficient is greater than 0.2 then you can say that there’s a covariate shift and your training data is not really representative of your production data.
  6. To avoid the result to be just a matter of chance, you should run a number of trials and average the result.

I leave a formal explanation about the right number of trials and the right size of the samples for other forums but if you run at least 10 trials and use 80% of your data you’ll be on the safe zone. Another topic for a formal discussion would be threshold setting, that is, how to estimate the degradation of our model as phi increases; e.g., if phi is, say, 0.21, how bad is our model?… It will be disastrous in some domains and unnoticeable in others.

Next, I will show you how to quickly test the idea using as an example the training and test data (using the latter as “production” data) of the Titanic dataset as they were provided at the Kaggle competition, but removing the label and the ids of each instance.

Testing for Covariate Shift in a few API calls

I’m going to use BigML’s raw REST API within a simple bash script. A simpler function providing covariate shift checking will be available in BigML’s bindings, command-line and web interfaces very soon.

First of all,  I create a source for the training data (lines 58-61) and another for the production data (lines 63-66) using their respective remote locations defined at the beginning of the script (lines 22 and 23). Then I create a full dataset for the training source (lines 70-73) and another for the production source (lines 76-79). Then I sample the training data and add a new field named “Origin”, with the value “train” to each instance (lines 90-98) and do the same for the production data but adding the value “production” to each instance (lines 100-108).

The next step is to create a multi-dataset model using both datasets above. Notice that I also subsample the dataset with a bigger number of instances to make sure that the class distribution (train and production) is balanced. I also specify a seed to make sure that I can later evaluate against a complete disjoint subset of the data. Once the model is created, I’m ready to create an evaluation of the new model with the portion of the dataset that I didn’t use to create the model (“out_of_bag”: true). I iterate 10 times and in each iteration display and save the phi coefficient to finally compute and show the average phi at the end of the 10 iterations.

No covariate shiftAs the average phi coefficient is a poor 0.1201, We conclude that there’s not a covariate shift.

Next I’m going to run the same test but I’ll induce a covariate shift first. To do that, I filter the training data to just contain instances of “male” passengers and filter the production data to just contain instances of female passengers (just uncommenting lines 31 and 32 makes the trick).

titanic_covariate_shiftIn this case, the average phi turns to be 0.6564 so I can say that there is a covariate shift between the training data and production. I shared one of the models of the last test here. You can see that Name  is a great predictor for Sex or, in other words,  it gets almost perfect discrimination between the training and production data. So  just to eliminate any suspicions about the method, I will repeat the test again in this case excluding the field Name from the model (uncommenting line 35). As you can see below, the method still detects the covariate shift in this case with an average phi of 0.3635.

Titanic Covariate Shift No Name

Once you detect a covariate shift you’ll need to retrain your model using new production data. If that is not feasible and depending on the nature of your data there is a number of techniques that can help you adjust your training data.

Notice that In the API calls above, I used two new BigML features:

  • Dataset transformations: I was able to filter and sample a dataset based on the value of a field, and also adding new values to existing datasets.You’ll see in an upcoming post many other ways to add fields to an existing dataset, as well as how to remove outliers, discretize continuous variables or creating windows, among many other tricks.
  • Multi-dataset models:  I could create a model using two datasets as input, sampling each one differently/at differing rates. We’ll also see in a future post that you can create models with up to 32 different datasets individually sampled. This can be very useful to deal with imbalanced datasets or to weight your datasets differently.

Both new capabilities are available via our API and soon via our web interface as well. As you are probably beginning to see, BigML is opening a new venue in automating predictive modeling in the cloud, in ways that had never been available before.

Recap

To quickly check whether there’s a covariate shift between your training and production data you can create a predictive model using a mix of instances from the training and production data. If the model is capable of telling apart training instances from production instances then you can say that there’s a covariate shift. The idea behind this method is not new, I first heard of it from Prof. Tom Dietterich, but it’s probably part of the “folk knowledge” needed to develop robust machine learning applications, and it is hard to track it to a specific or original first reference. Anyway, being able to implement it in a few API calls is kind of cool, isn’t it? It will be even cooler once BigML provides it just one call or click away. Stay tuned!

Start 2014 with Faster, Easier, and more Programmatic Machine Learning

Happy New Year everybody! It has been a while since we last blogged, but for good reason: we’ve been busier than ever developing our Winter release and serving our fast-growing user base at the same time. By now, BigML has been used to create more than 600,000 active predictive models—half of them were created during the last few weeks. We started 2013 with less than 10,000 models and we’re on track to eclipse the millionth model mark early in 2014.

BigML’s 2014 Winter release constitutes a significant step forward in three directions:

  • Speed: you can now build a model in 1/8 of the time it previously took and also get fast, real-time predictions with BigML’s PredictServer.
  • Programmability: we have empowered our API with programmatic means to filter, sample, or extend with new fields any dataset or list of datasets. Dozens of pre-built functions are already available and defining new functions is very easy.
  • Data-driven decisions for everyone:  BigML’s new development mode allows you to run unlimited tasks of up to 16 MB for FREE, making BigML the ideal framework to practice, teach, and learn machine learning or predictive analytics. There’s no excuse to not start making data-driven decisions today!

BigML

Let’s now quickly review the most salient new features. In the coming days we’ll have added blog posts to explain a few of them in further detail.

Much faster models

A third generation of our algorithm has significantly improved our performance when it comes to model building. To give you a quick idea, modeling a dataset with 50 fields and a little more than 500,000 rows (~80 MB of data) took around 8 minutes in our previous version. Now, it takes less than a minute.

Weighted models

For use cases like fraud or intrusion detection, ad click prediction, or best customer retention, the class of interest is the minority class (i.e., the number of instances of the class of interest is under-represented compared to the number of instances of the other classes). These situations are called imbalanced and they constitute a serious challenge, since most traditional supervised machine learning algorithms overlook the minority classes in favor of the majority ones. Techniques like under-sampling or over-sampling the input data can help, but they usually provide sub-optimal performance.

BigML’s new algorithm comes with three ways to elegantly cope with this problem and create weighted models. Using them you’ll be able to build models that will consider at building time every instance or class according to the weight criteria that you establish.

Dataset Transformations

We have developed a Lisp-like language codenamed Flatline (after the legendary cyber-cowboy McCoy Pauley) and open sourced the specification here. Flatline can be used for both filtering the rows and columns of a dataset and for generating new fields. For example, if the temperature in your dataset is expressed in Fahrenheit degrees you can easily transform it to Celsius using a single Flatline expression.

Flatline allows you to horizontally select different fields in the same row or select a finite sliding window of rows to traverse a dataset vertically. This is useful to generate values based on a number of front and rear values. In other words, you can generate a new field based on computing a function over the previous values of another field. Imagine adding a new field with the 7-day average of the daily maximum temperatures.

Flatline comes with dozens of pre-built functions and it makes very easy to perform standard analytics tasks on your dataset, such as discretezing continuous variables, removing outliers, replacing missing values, normalizing variables, etc. Flatline expressions have a json-like equivalent for those who prefer it.  You can read more about BigML’s dataset transformations here.

Being able to programmatically transform a dataset via a high-level language and a cloud-based API together, plus the rest of features that we will mention below, opens a new range of possibilities to program machine learning tasks on the cloud as it was not available before. We’ll give you some examples in the next days. For example, detecting covariate shift between your training data and production data in only a few API calls.  We start calling this new paradigm “Programmatic Machine Learning“.

BigML

Multi datasets and multi-dataset models

Another cool feature in our Winter release is the ability to create a dataset using multiple datasets as input. This is very useful when you need to combine multiple sources of data into a single dataset or when you want to build an online solution that collects data in batches. You can even sample each dataset individually.

Also, you can use multiple datasets as an input to build models, ensembles and evaluations (i.e., you do not need to first merge them into a single dataset). You can read more about multi-datasets here.

New Prediction Strategies

We have developed a second strategy to deal with missing values in your input data. So far, when the input data that you use to generate a new prediction contained a missing value, BigML returned the prediction that had been computed up to the node (split) that needed that input. We call this strategy last prediction.

Missing strategy

Now you can choose an alternative strategy named proportional that will evaluate all the subtrees of a missing split and will recombine their predictions based on the proportion of data in each subtree.

We have also developed a new threshold-based combiner for classification ensembles that comes in handy to implement either conservative or aggressive prediction strategies.

Threshold-based Combiner

This combiner lets you trigger predictions based on a given threshold k. Imagine that you have created a 20-model ensemble to detect intruders in a computer network and you use a k-threshold of  1 for predictions.  Then, as long as a single model in the ensemble predicts true, the ensemble will predict that an intrusion is going on. Or imagine that you have another 30-model ensemble to predict the success of marketing campaigns and this time you want to reduce the number of false positives. Then, you can set up k to a high value like 27 or 28 to make sure that you do not spend your dollars with customers who won’t react to your campaign.

New Development Mode

We noticed that many BigML users weren’t aware of our FREE development mode. As a result, many users would experiment with the promotional datasets that we offer in production mode and soon run out of credits before finishing their initial projects. To address this, we have now made the development switch more obvious in our web interface and have also increased the max task size to 16 MB.

Development Mode

BigML’s development mode has 3 limitations compared to production mode: 1) the maximum number of models of an ensemble cannot be higher than 10; 2) the maximum number of terms in text analysis is limited to 32; and, 3) the maximum number of nodes in a tree cannot be higher than 512.  All other features are exactly the same and you can run unlimited tasks.

BigML FREE

A few more things

That’s not all. There are some other great features in store like a new search box in your dashboard, multi predictions in Excel-exported models, sharing evaluations via private links, and much, much more. We’ll soon list all of these features in our What’s New section. Last but not least, BigML’s PredictServer is now available from Amazon Marketplace.

So let 2014 be the year that you start adding predictive analytics to your business, and that you start building Predictive Apps with BigML. To help you out, we are offering an extra 15% discount to the first 50 annual BigML subscriptions contracted this year. If you are interested in one, send us an email to info@bigml.com and we’ll be happy to send a coupon your way.

BigML

Follow

Get every new post delivered to your Inbox.

Join 110 other followers

%d bloggers like this: