Skip to content

BigML Clusters in Action!

After a few weeks in Beta, we’ve now released BigML Clusters in final form into the BigML interface and also as part of our Python bindings—many thanks to all of you who gave feedback on this important new piece of BigML functionality!

Curious to see what you can do with BigML clusters? Check out our archived webinar where Poul Petersen gives a high-level overview of clustering concepts in general, and then digs deeper to walk through several use cases ranging from grouping similar tasting whiskies (Item Discovery) to identifying high-value mobile gaming users (Customer Segmentation) to disease diagnoses (Active Learning). The webinar showcases how to leverage clusters both in the BigML interface, as well as through our underlying API via an iPython Notebook.

You can find the YouTube video here; the SlideShare presentation here, and the iPython notebook (from the Active Learning usecase) here.

clusters

Got more questions about clustering? Feel free to email us, or better yet, join one of our weekly Google Hangouts on Wednesday at 10:30 AM Pacific where we interact with BigML customers to answer all sorts of machine learning and predictive analytics questions.

The People Who Would Teach Machines to Learn

Have you ever had an idea about how the human mind works? Douglas Hofstadter has had that idea. He’s also thought of all of the arguments against it and all of the counter arguments to those arguments. He’s refined it, polished it, and if it was really special, it’s in one of his books, expressed with impeccable clarity and sparkling wit. If you haven’t attempted to read his book Gödel, Escher, Bach, try now. For sheer intellectual scope, it’s a singular experience.

A few months ago, there was an article in the Atlantic profiling Dr. Hofstadter and his approach to AI research. It was well written and I thought it gave unfamiliar readers a reasonable sense of Hofstadter’s work. It contrasted this work with machine learning research, however, in a way that minimized the scope and quality of the work being done in that area.

In this little riposte, I’ll present my case that the machine learning research community is doing work that is just as high-minded as Hofstadter’s when viewed from the proper angle.  One caveat: It may at times sound like I am unnecessarily minimizing Hofstadter’s work. I can assure you I have neither the inclination nor the intelligence to offer that or any such opinion. All I contend here is that his approach to the problem he’s trying to solve isn’t the only one, and there are reasons to believe the machine learning approach is also valid.

AI brain

Everybody Loves A Good Story

The Atlantic story goes something like this: Dr. Hofstadter has a singular ambition, to build a computer that thinks like a human. This leads to a lot of speculation, research, and programming, but little in the way of what academics call “results”. He doesn’t publish papers or attend conferences. He is content to work with his small group of graduate students on the “big questions” in artificial intelligence, trying to replicate mind-level cognition in computers, and have his work largely ignored by the rest of the AI community.

Meanwhile, the rest of the AI community has enjoyed considerable scientific and commercial success, albeit on smaller questions. Techniques born in AI planning save billions of dollars and countless person-hours by automating logistics for many industries. Computer vision systems allow individual faces to be picked out from a database of tens of thousands. And machine learning has turned raw data into useful systems for speech recognition, autonomous vehicle navigation, medical diagnosis, and on and on.

But none of these systems come close to mimicking the human mind, especially in its ability to generalize. And here, some would say, we have settled for less. We have used techniques inspired by cognition to craft some nice solutions, but we’ve fallen short of the original goal: To make a system that thinks.

So Dr. Hofstadter, unimpressed, continues to work on the big questions in relative solitude, while the rest of AI research, lured by the siren’s song of easy success, attacks lesser problems.

It’s a nice, clean story:  The eccentric but brilliant loner versus the successful but small-minded establishment, and I suppose if you’re reading a magazine you’d prefer such a story to a messy reality. But while there’s certainly some truth to it, I don’t see it in nearly such black and white terms. Hofstadter’s way of working is to observe the human mind at work, then write a computer program that you can observe doing the same thing. This is, of course, a very direct approach, but I’m not convinced it’s the only or even necessarily the best one.

In the other corner, we have machine learning, frequently implied in the article as being one of the easier problems that Hofstadter avoids. Surely research into such a (relatively) simple problem would never contribute directly towards creating a human-level intelligence, would it?

I’m going to say that it just might. Here are three reasons why.

Accidental Intelligence

Remember where cognition came from; A system (evolution) did its best to produce a solution to the following optimization problem:

  1. Survive long enough to have as many children as possible
  2. (Optional) Assist your children in executing step 1

Out of this (this!) came a solution that, as a side effect, produced things like Brahms’ Fourth Symphony and the Reimann hypothesis.

With this in mind, it doesn’t seem unreasonable to think that a really marvelous solution to problems as general and useful as those in machine learning might lead to a system that exhibits some awfully interesting ancillary behaviors, one of which might even be something like creativity.

Am I saying that this will happen, or that I have any idea how it might? Absolutely not. But look at how the human mind was created in the first place, and tell me honestly that you would have seen it coming. The question is whether the set of problems that machine learning researchers tackle are interesting enough to inspire solutions with mind-level complexity.

There are suggestions in recent research that they might be.  In the right hands, the process of solving “simple” machine learning problems can produce deep insights about how information in our world is structured and how it can be used. I’m thinking of, for example, Josh Tenenbaum and Tom Griffiths’ ideas about inductive bias, Andrew Ng‘s work on efficient sparse coding, and Carlos Guestrin’s formulation of submodular optimizations. These ideas say fundamental things about many problems that used to require human intelligence to solve. Do these things have any relationship to human cognition? Given the solutions they’ve led to, it would certainly be hasty to dismiss the idea out of hand.

The Opacity of the Mind

Several years ago, a colleague and I were developing a machine learning solution that could figure out if a photograph was oriented right-side up or not.  After much algorithmic toying and feature engineering, we reached a point where I said, “Well, it works, but this is totally unrelated to the way people do it”.

My colleague’s reply was “How do you know? You’re imagining that you can keep track of everything that happens in your own mind, but you could just be inventing a story after the fact that explains it in a simpler way.  When you recognize that a photo is upside-down, it probably happens far too fast for you to recognize all of the steps that happened in between.”

There’s a fundamental problem in trying to model human cognition by observing cognitive processes, which is that the observations themselves are results of the process.  If we create a computer program consistent with those observations, have we modeled the actual underlying process, or just the process we used to make the observations?  Hofstadter himself admits the difficulty of the problem, and says that his understanding of his own mind’s operation is based on “clues” like linguistic errors and game-playing strategies.  As of now, our understanding of our own subconscious is vague at best.

Rather than trying to match observations that might be red herrings, why not use actual performance as the similarity metric?  Machine learning algorithms can learn to recognize handwritten digits nearly as well as humans can and often make the same mistakes.  Isn’t it at least possible that the inner workings of such an algorithm bears some relationship to human cognition, regardless of whether that relationship is easily observed?

To Learn is to Become Intelligent

Hofstadter, while working on the big questions, generally works on them in toy domains, like word games, trying to make his machines solve them like humans might. Machine learning research focuses on a more universal problem:  How to make sense out of large amounts of data with limited resources.

Consider the data that has been fed into the brain of a five-year old. We can start with several years of high resolution stereo video and audio. Hours of speech. Sensory input from her nose, mouth, and hands. Perhaps hundreds of thousands of actions taken and their consequences. Using this data, a newborn who cannot understand language is transformed into an intelligence that can play basic word games.  Can there be any doubt that life is at least in part a big data problem?

Moreover, consider how important learning is to our perception of intelligence. Suppose you had two computer systems that could answer questions in fluent Chinese (with a nod to John Searle). The first was built by a fleet of programmers and turned on yesterday. The second was programmed five years ago, had no initial ability to speak Chinese, but slowly learned how by interacting with hundreds of Chinese speakers. Which system would you say has more in common with the human mind?

Big Answers from Small Questions

Is the machine learning research community suffering from a lack of ambition? I don’t buy it. Just because the community isn’t setting out, in general, to build a human-level intelligence doesn’t mean they’re headed in a different direction. After all, the systems we’ve built so far can translate between languages, recognize your speech, face, and handwriting, pronounce words it has never seen, and do it all by learning from examples.

If the machine learning community does create a human-level intelligence, it may not be any one person who had the “aha!” idea that allowed it to happen. It might be more like Minsky envisioned in “Society of Mind“; not a single trick but a collection of specialized processes, intelligent only in combination.

If that sounds like a cop-out, like an excuse not to look for big answers, consider the path followed by children’s cancer research: Science has yet to find a single wonder drug or treatment that cures all childhood cancers, yet decades of piecemeal research on the parts of the problem has driven cure rates from about 10% to nearly 90%. A bunch of people working on smaller problems may not look like much in situ, but the final product is what matters most.

Follow your data’s inner voice! Evaluation-guided techniques for Machine Learning

Spring has come, and the steady work of gardening data is starting to bloom in BigML. We’ve repeatedly stressed in our blog posts the importance of listening to what data has to tell us through evaluations.  A couple of months ago, we published a post explaining how you could achieve accuracy improvements in your models by carefully selecting subsets of features used to build them. In BigMLer we stick to our evaluations to show us the way, and we’d like to introduce you to the ready-to-use evaluation-guided techniques that we’ve recently added to BigMLer’s wide list of capabilities: smart feature selection and node threshold selection. Both are available via a new BigMLer subcommand, analyze, which gives access runnersto these new features through two new command line options.  Now, the power of evaluation-directed modeling starts in BigMLer, your command line ML tool of choice.

k-fold cross-validation

To measure the performance of the models used in both procedures, we have included k-fold cross-validation in BigMLer. In k-fold cross-validation, the training dataset is divided in k subsets of equal size. One of the subsets is reserved as holdout data to evaluate the model generated using the rest of training data. The procedure is repeated for each of the k subsets and finally, the cross-validation result is computed as the average of the k evaluations generated. The syntax for the k-fold cross-validation in BigMLer is:

where dataset/536653050af5e86d9c01549e is the id of the training dataset  (if you need help for creating a dataset in BigML with BigMLer you can read our previous posts).

Smart feature selection

Those who read our previous post on this matter will remember that, following the article by Ron Kohavi and George H. John, the idea was to find the subset of available features that will produce a model with better performance. The key thing here is being clever about the way of searching for this feature subset, because the number of possibilities grows exponentially with the number of features. Kohavi and John use an algorithm that finds a shorter path by starting with one-feature models and scoring their performance to select the best one. Then, new subsets are repeatedly created by adding each remaining feature to the best subset, and its scoring is used to select again. The process ends when the score stops improving or you end up with the entire feature set. In that post, we implemented a command tool using BigML’s python bindings to help you do that process using the accuracy (or r-squared for regressions) of BigML’s evaluations, but now BigMLer has been extended to include it as a regular option.

So let’s say you have a dataset in BigML and you want to select the dataset’s features that produce better models. You just type:
and the magic begins. BigMLer:

  • Creates a 5-fold split of the dataset to be used in evaluations.
  • Uses the smart feature selection algorithm to choose the subsets of features for model construction
  • Creates the 5-fold evaluation of the models created with this subset of features and use its accuracy to score the results , chosing the best subset.
  • Outputs the optimal feature subset and its accuracy.

You probably realize that this procedure generates a large number of resources in BigML. Each k-fold evaluation generates k datasets and, for each feature subset that is being tested, k more models and evaluations are created. Thus, you will probably like to tune the number of folds, as well as other parameters like the penalty per feature (used to avoid overfitting) or the number of iterations with no score improvement that causes the algorithm to stop. This is easy:

will use a penalty of 0.002 per feature and stop the search the third time that score does not improve in 2-fold evaluations. You can even speed up the computation by parallelizing the k models (and evaluations) creation. Using

all models and evaluations will be created in parallel.

But sometimes a good accuracy in evaluations will not lead you to a really good model for your Machine Learning problem. Such is the case for spam or anomaly detection problems, where you are interested in detecting the instances that correspond to very sparse classes. The trivial classifier that predicts always false for this class, will have high accuracy in these cases, but of course won’t be of much help. That’s why  it may sometimes be better to focus on precision, recall or composed metrics, such as phi or f-measure. BigMLer is ready to help here too:

will extract the subset of features that maximizes the evaluations recall.

Node threshold selection

Another parameter that can be used to tune your models is their node threshold (that is, the maximum number of nodes in the final tree). Decision tree models usually grow a large number of nodes to fit the training data by maximizing some sort of information gain on each split. Sometimes, though, you can be interested in growing them until they maximize a different evaluation measure. BigMLer offers the command, that will look for the node threshold that leads to the best score. The command will:

  • Generate a 5-fold split of a dataset to be used in evaluations
  • Grow models using node thresholds going from 3 (the minimum allowed value) to 2000, in steps of 100
  • Create the 5-fold evaluation of the models and use its accuracy as score
  • Stop when scores don’t improve or the node threshold limit is reached, and output the optimal node threshold and its accuracy

As shown in the last section, the associated parameters, i.e. --k-folds, --min-nodes, --max-nodes and --nodes-step can be configured to your liking. Also the --maximize option is available to choose the evaluation metric that you prefer as score. The full fledged version looks like this:

where you can choose the value of every parameter.

These are the first evaluation-based tools available in BigMLer to tend to your data, but we plan to add some more to help you get the most of your models. Spring has only just begun, and more of BigML’s buds are about to burst into blossoms, providing colorful new features for you and your data. Stay tuned!

BigML Spring Release: A Cluster of Powerful New Features!

The BigML team has been hard at work the past few months. While we haven’t figured out how to work more than 24 hours in a day, we think you’ll be pleased with the key features of BigML’s Spring Release.

First and foremost, we’re excited to announce Cluster Analysis, our first foray into unsupervised learning. Other key features in the Spring release include more filter and new field optionsonline dataset creation, major updates to BigMLer and more.

flowersters

Clusters

Now you can automatically group together the most similar instances in your dataset into different groups (i.e., clusters).

BigML’s clustering algorithm has been inspired by k-means, so you can select the number of groups to create (i.e., k) and also how each field in your dataset influences to which group each data point belongs (i.e., scales). Similar to our trees and ensembles, you can choose to create a cluster in only one click from your dataset. By default, BigML will create 8 clusters and we’ll apply automatic scaling to all the numeric fields, although the number of clusters as well as field scaling and weighting can be easily modified when you configure your cluster.

Scaling is very important as very often datasets will contain fields with very different magnitudes. For example, a demographics dataset might contain age and salary. If clustering is performed on those fields, salary will dominate the clusters while age is mostly ignored. Generally that’s not what you want when clustering, hence the auto-scale fields (balance_fields in the API) option. When auto-scale is enabled, all the numeric fields will be scaled so that their standard deviations are 1. This makes each field have roughly equivalent influence. You can also pick your own scale for each field.

Once you build a cluster you can use it to predict the centroid (i.e., find the closest centroid for a new data point) and also to create batch centroids in the same way batch predictions work.

Cluster Analysis has been released in Beta so please let us know if something does not work as expected or if think we’ve overlooked any helpful features.

By the way, you can also share clusters via private links in the same way you can share datasets or models:

cluster share

More Filter and New Field Options

In our Winter release we introduced Flatline—a Lisp-like language that can be used to filter rows of a dataset, and also to generate new fields using a mix of columns and rows. This language easily allows one to extend the number of filtering and new field creation options that BigML offers, and is now implemented in the BigML interface as part of our Spring Release. For example, you can now easily filter a dataset using different comparison, equality, missing value, and statistics functions. You can also create new fields discretizing, replacing missing fields, normalizing, and performing all kind of math transformations on previous values of your dataset. Have a look at the Filter Dataset and Add Field to Dataset options, and remember that you can also use Flatline to input any complex function that you might need.

dataset filter

Segment-based dataset creation in models

Have you ever wanted to create a new dataset for further analysis from a specific node in a tree? Now you can! When you’re in a model or sunburst view, simply mouse over a node and then press your keyboard’s shift button. This will freeze the view and allow you to export the rules for that segment and/or create a new dataset with the instances at that node.

model segment

Dataset exports

Now you can also export datasets from a dataset view into a comma-separated values (CSV) file. This works very well in combination with the dataset creation above as it can help you identify the instances that follow certain criteria.

dataset export

Ensemble Summary Report

A nifty (and perhaps underappreciated) feature of BigML is the ability to get a summary report of your model which shows your predicted data distribution, field importance and associated outcomes. You can now get a similar report for your Ensembles! This is a great way to get a quick summary on what fields have the greatest impact on your predictive outcome–something that can be very illustrative when working with new and/or wide datasets.

ensemble summary report

New BigMLer

BigMLer, our popular command line tool for machine learning, now features powerful new evaluation-guided techniques to support advanced predictive modeling. Specifically, through a new subcommand bigmler analyze you can quickly perform smart feature selection and node threshold selection. Feature selection detects the subset of features that will produce better models according to their evaluation measures (be it accuracy, phi or the one you like best). Node threshold selection finds out the number of nodes that your model should grow to in order to optimize evaluations. We’ll give you more details in an upcoming blog post.

And More..

There’s also a bunch of tiny small things like: directly log-in with Github, processing firebase URLs, and of course tons of improvements in our API, backend and infrastructure.

The Spring Release features are available immediately—simply log into your account and get started today! And be sure to let us know your feedback—we love hearing from our users and want to make sure that we continue to deliver the best machine learning platform possible.

Predicting Stock Swings with PsychSignal, Quandl and BigML

People like to tweet about stocks, so much so that ticker symbols get their own special dollar sign like $AAPL or $FB.  What if you could mine this data for insight into public sentiment about these stocks?  Even better, what if you could use this data to predict activity in the stock market? That’s the premise behind PsychSignal, a provider of “real time financial sentiment”. They harvest large streams of data from Twitter and other sources, then compute real time sentiment scores (one “bullish” and one “bearish”) on a scale from 0 to 4.  In a blog post titled Can the Bloomberg Terminal be “Toppled”?, a former Managing Director of Bloomberg Ventures asks an intriguing question: Could this kind of crowdsourced data be used to replace some of the functionality of a Bloomberg terminal?

So just for fun, we combined Quandl price and volume data with PsychSignal sentiment scores for 20 technology stocks. We then trained a simple model to predict whether the percentage “swing” (intraday high minus low) is higher or lower than the median, using only data available before the commencement of the trading day. Looking at the SunBurst view, we see a lot of bright green, which means the model is picking up some interesting correlations. For example, if the previous day’s close is down more than 3% from the previous day’s open, the opening price is less than the previous day’s close, and the previous day’s bearish signal is more than 0.84, then the model strongly predicts a price swing higher than the median (shown as a category named “2nd” in the screenshot below).

BigML Stock Swing Model

Evaluating the model on a single holdout set shows that it does much better than random guessing (see below). To be extra thorough, I used the BigML API to run 5-fold cross validation (not shown), confirming that average accuracy really is more than 64%.

Evaluation of BigML Stock Swing Model

Interestingly, if I try to predict whether a stock simply went up or down, the model is barely better than flipping a coin; for whatever reason, it’s easier to predict a tech stock’s intraday volatility than its daily gain or loss. Still, the accuracy of the “swing” model is impressive—just look at all that green in the SunBurst view. And you can bet your greek symbols that options traders are interested in predicting volatility.

Company founder James Crane-Baker puts this all in perspective: “Social media is such a rich vein of data about investor sentiment, it would be surprising if it didn’t contain useful information.” And this data is available in real time, so you could try building a model to make predictions using same-day data. Then maybe you’ll really start seeing some green!

Using BigML and Tableau to Visualize … Graffiti?

We’ve all heard about case interview questions where the victim (sorry, applicant) is asked to explain why manhole covers are round, or estimate how many piano tuning experts there are in the world. Well, here’s a new one for you: How many incidents of graffiti have been logged in San Francisco since July 2008? The answer: more than 162,000. That’s almost 80 incidents per day for more than 5 years, and those are just the ones that got called in to 311; presumably a great deal of street art went unappreciated.

Thankfully, the City has painstakingly recorded a wealth of detail for each graffito, including whether it is “offensive” or “not offensive”. Interestingly, about 61% of incidents are deemed “not offensive”, although this is the home of the Folsom Street Fair (NSFW) so we assume folks grade on a curve.

As a bonus, the data is already geocoded, so we can easily train a model to find parts of the city likely to have (un)offensive graffiti. The result is here, with 56% of offensive graffiti contained in only 27% of the dataset, a rectangle that stretches from the south end of Market Street to Ocean Beach (to the west) and the Marina (to the north). Within this rectangle, public property is defaced by offensive orange dots while private property is, er, enhanced with relatively unoffensive blue dots. (Click the map to see the full Tableau goodness.)

Graffiti in San Francisco

I’m still pondering why this particular rectangle, and specifically the public property within this rectangle, is such a magnet for offensive graffiti. Is public property just more vulnerable to the orange stuff? But then why didn’t the model find a heavy mix of orange around Van Ness and Civic Center (never mind the Tenderloin)? Is this rectangle more sparsely populated than downtown, making people more likely to indulge in the extra-bad behavior of writing not just graffiti, but offensive graffiti? I don’t like either of these explanations; perhaps San Francisco 311 has some insight.

What I do know is that the SunBurst is full of green, which always makes me happy, and evaluation on a single holdout set shows more than 80% accuracy, which is much better than guessing. The confusion matrix tells us that the model’s main weakness is false negatives, i.e. offensive graffiti mislabeled as not offensive.

Screen Shot 2014-04-18 at 8.28.13 PM

This is another great example of combining BigML and Tableau to create a compelling visualization: BigML finds a geographic pattern simply by analyzing latitude and longitude as numbers, and Tableau displays this pattern using its (really cool) built-in maps. And in typical Tableau fashion, the finding just leaps off the page.

Three Workshops on Predictive Applications

This is a guest post by Louis Dorard, author of Bootstrapping Machine Learning

When it comes to learn about Machine Learning APIs from experts, webinars are a great place to start. However, there’s nothing like in-person workshops. Today, BigML is announcing three exceptional workshops on Predictive Applications taking place in Spain in May. I will have the honour to speak at these workshops alongside Francisco and jao, who are the CEO and CTO of BigML. Here are the three dates:

  • Bellaterra (Barcelona): May 7 at 12pm, IIIA (Artificial Intelligence Research Institute) at the Campus of the Universitat Autonòma de Barcelona.
  • Madrid: May 8 at 5pm, Wayra/Telefonica.
  • Valencia: May 13 at 4pm, Universitat Politècnica de València.

Workshop on Predictive Applications Workshop on Predictive Applications Workshop on Predictive Applications

 

 

 

 

 

 

 

If you’re anywhere near these cities at that time, you should definitely come! Click the links to check out the posters above and get your invite!

Some more details

The aim of these workshops is to give you a quick and practical view of why and how to embrace predictive applications.

I will start with an introduction to Machine Learning, its possibilities, its limitations, and I will provide some advice on how to apply Machine Learning to your domain. I have the privilege of being the one to introduce BigML, but I will also review some other Prediction APIs. Then, Francisco will provide an introduction to BigML’s REST API. Finally, jao’s part will be the most hands-on as he will show step-by-step how to build a predictive application in 30 minutes.

We’re also thrilled to be joined in Madrid by Enrique Dans, who authors one of the most read technology blogs in Spanish and who also blogs in English, and by Richard Benjamins, who is Group Director BI & Big Data Internal Exploitation at Telefonica.  In Valencia, we’ll be hosted by Professor Vicent Botti who leads the Artificial Intelligence Group at the Universitat Politècnica de València.

I am happy to announce to the readers of the BigML blog that these workshops will coincide with the launch of my new book on Prediction APIs, on May 7. I will cover some of its content in my talks, but not only! You can already get a flavour of what’s coming by downloading the first chapters and the table of contents for free.

Don’t forget to send an email to info@bigml.com to request an invite and don’t hesitate to say hi if you’re coming! I can be found on Twitter: @louisdorard. See you soon in Spain!

BigML API + Matlab = Beautiful, Interactive Trees.

Whatever your preferred working environment, we want to make BigML available to you. That’s why we’re happy to announce a new set of BigML API bindings for MATLAB. These bindings, in the form of a MATLAB class, expose all the functions of BigML’s RESTful API.

BigML-logo Matlab_Logo

 

Some of you may be aware that MATLAB includes a classification and regression tree implementation.  So what makes BigML different? Here are some points that stand out in our mind:

  • MATLAB presents its trees as static line drawings, as seen in the screenshot below of a classification tree trained on the iris dataset. In contrast, BigML’s models are fully interactive. You can view a model built with BigML here. We think that our model interface makes it easy to follow decision paths and tease out data patterns at a glance.
  • Users of MATLAB will admit that it’s not the most efficient computational platform, and that heavyweight scripts will quickly consume all your workstation’s resources. BigML on the other hand, is a cloud-based service. You can train your models in the background, leaving your machine free to perform other number crunching tasks.
  • One last important difference is that in order to use MATLAB’s trees, your MATLAB installation needs to include the statistics toolbox. Our API bindings will work with just a base MATLAB install.

snapshot1

 

Keep your eyes on our blog to see an application to showcase MATLAB and BigML working together. In the meantime, grab the API bindings here, and start playing. We can’t wait to see what neat things you can accomplish!

Automatic Weighting of Imbalanced Datasets

Very often datasets are imbalanced. That is, the number of instances for each of the classes in the target variable that you want to predict is not proportional to the real importance of each class in your problem. Usually, the class of interest is not the majority class. Imagine a dataset containing clickstream data that you want to use to create a predictive advertising application. The number of instances of users that did not click on an ad would probably be much higher than the number of click-through instances. So when you build a statistical machine-learning model of an imbalanced dataset, the majority (i.e., most prevalent) class will outweigh the minority classes. These datasets usually lead you to build predictive models with suboptimal classification performance. This problem is known as the class-imbalance problem and occurs in a multitude of domains (fraud prevention, intrusion detection, churn prediction, etc). In this post, we’ll see how you can deal with imbalanced datasets configuring your models or ensembles to use weights via BigML’s web interface. You can read how to create weighted models using BigML’s API here and via BigML’s command line here.

BigML Weighthing

 

Weights

A simple solution to cope with imbalanced datasets is re-sampling. That is, undersampling the majority class or oversampling the minority classes. In BigML, you can easily implement re-sampling by using multi-datasets and sampling for each class differently. However, usually basic undersampling takes away instances that might turn to be informative and basic oversampling does not add any extra information to your model.

Another way to not dismiss any information and actually work closer to the root of the problem is to use weights. That is, weighing instances accordingly to the importance that they have in your problem. This enables things like telecom customer churn models where each customer is weighted according to their Lifetime Value. Let’s next see what the impact of weighting a model might be and also examine the options to use weights in BigML.

The Impact of Weighting

Let me illustrate the impact of weighting on model creation by means of two sunburst visualizations for models of the Forest Covertype dataset. This dataset has 581,012 instances that belong to 7 different classes distributed as follows:

  • Lodgepole Pine:  283,301 instances
  • Spruce-Fur: 211,840 instances
  • Ponderosa Pine: 35,754 instances
  • Krummholz: 20,510 instances
  • Douglas-fir: 17,367 instances
  • Aspen: 9,493 instances
  • Cottonwood-Willow: 2,747 instances

The first sunburst below corresponds to a single (512-node) model colored by prediction that I created without using any weighting. The second one corresponds to a single (512-node) weighted model created using BigML’s new balance objective option (more on it below). In both sunbursts, red corresponds to the Cottonwood-Willow class and light green to the Aspen class. In the first sunburst, you can see that the model hardly ever predicts those classes. However, in the sunburst of the weighted model you can see that there are many more red and light green nodes that will predict those classes.

Screen Shot 2014-02-26 at 3.09.20 PM

Screen Shot 2014-02-26 at 3.17.42 PM

So as you can see, weighting helped make us aware of outcomes of predicted classes that are under-represented in the input data that otherwise would be shadowed by over-represented values.

Weighting Models in BigML

BigML gives you three ways to apply weights to your dataset:

  1. Using one of the fields in your dataset as a weight field;
  2. Specifying a weight for each class in the objective field; or
  3. Automatically balancing all the classes in the objective field.

 

Weights

Weight Field

Using this option, BigML will use one of the fields in your dataset as a weight for each instance. If your dataset does not have an explicit weight, you can add one using BigML new dataset transformations.  Any numeric field with no negative or missing values is valid as a weight field. Each instance will be weighted individually according to the value of its weight field. This method is valid for both classification and regression models.

Weighing by LTV

Objective Weight

This method for adding weights only applies to classification models. A set of objective weights may be defined, one weight per objective class. Each instance will be weighted according to its class weight. If a class is not listed in the objective weights, it is assumed to have a weight of 1.  Weights of value zero are valid as long as there are some other positive valued weights. If every weight does end up being zero (this can happen, for instance, if sampling the dataset produces only instances of classes with zero weight) then the resulting model will have a single node with a nil output.

Weight using Objective Weights

Balance Objective

The third method is a convenience shortcut for specifying weights for a classification objective which are inversely proportional to their category counts. This gives you an easy way to make sure that all the classes in your dataset are evenly represented.

Balance Objective

 

Finally, bear in mind that when you use weights to create a model its performance can be significantly impacted. In classification models, you’re actually trading off precision and recall of the classes involved. So it’s very important to pay attention not only to the plain performance measures returned by evaluations but also to the corresponding misclassification costs. That is, if you know the cost of  a false positive and the cost of a false negative in your problem, then you will want to weigh each class to minimize the overall misclassification costs when you build your model.

BigML + Tableau = Powerful Predictive Visualizations for Everyone

BigML is very excited to be collaborating with Tableau Software as a newly minted Tableau Technology Partner.  This is a natural fit as both tableau_partner_logocompanies share similar philosophies on making analytics accessible (and enjoyable) to more people through incorporation of intuitive and easy-to-use tools. Not surprisingly, there’s a lot of overlap between our users: our latest survey showed that over 50% of BigML’s users also use Tableau. And of course, we’re eager to introduce Tableau’s customers to the predictive power of BigML.

So what does this mean for BigML and Tableau users?  First and foremost, it means that BigML will be working on ways to enable easier usage across both tools. For starters, we have just launched a new feature that lets you export a BigML model directly to Tableau as a calculated field, as is demonstrated in the video below:

This approach unleashes the power of Tableau for prediction, letting users interact with a BigML model just like any other Tableau field.  In the video, for example, we color a bar chart by predicted profit, which yields valuable insights into Tableau’s “superstore” retail dataset.

Moving forward, we’ll be looking at other ways that we combine the powerful analytic and visual capabilities of Tableau and BigML to enable joint customers to do amazing things.

We’d love to hear from any BigML users who are also Tableau customers so we can get added direction on this collaboration and provide you with early access to future implementations. If you’re willing and interested, please email us at tableau@bigml.com!

%d bloggers like this: