We here at BigML are big fans of ensemble algorithms (and Ronald Reagan movies). Using them, a simple model like a BigML decision tree can be leveraged into a very high-powered predictor. During my regular surveys of machine learning research literature, however, I’ve noticed that a very popular class of ensemble algorithms, boosters, has been getting some bad press lately, and I thought I’d offer our readers a brief synopsis.
The tl;dr for this post is absolutely not “boosting is bad”. Adaptive boosting still gives excellent performance in a ton of important applications. In the cases where it fails, another type of ensemble algorithm, even another type of booster, usually succeeds. The overall moral is more about complexity. Boosting is a more complex way of creating an ensemble than are the two types we provide at BigML: Bootstrap Aggregation and Random Decision Forests. We use the simpler methods because they are faster, more easily parallelized, and don’t have some of the weaknesses that boosting has. Sometimes, less is more in machine learning.
A final caveat before I begin: This post is fairly technical. If you’re not up for a deep dive into some machine learning papers at the moment, you might want to check this out.
Learning to Rank
The first paper I’ll mention is Killian Weinberger’s paper on the Yahoo! Learning to Rank Challenge. Not surprisingly, the paper is about learning to rank search query results by relevance. One of the surprising results is that boosting is outperformed by random forests for a variety of parameter settings. This is surprising because gradient boosted regression trees are one of the commercially deployed algorithms in web search ranking.
The authors manage to “fix” boosting by using a two-phase algorithm in which the gradient booster is initialized with the model generated by the random forest algorithm. Still, however, the fact that plain ol’ boosting gets beat by random forests in such an important and common application is tough to ignore.
Bayesian Model Combination
Another recent paper that I like a lot (and that hasn’t gotten the attention it deserves) is Kristine Monteith and Tony Martinez’s Paper on Bayesian Model Combination. In this paper, we see that boosting outperforms bootstrap aggregation slightly when performance is averaged over a large number of public datasets. This is expected.
However, the authors then try a fairly simple augmentation of the bagged model: They essentially choose random weighted combinations of the models learned by bagging and pick several of the best performing combinations to compose their final model (note that this is a bit of an oversimplification; the variety of randomness they use is a good deal fancier than your garden-variety coin-flipping randomness, and “choosing the best performers” is also done in a clever fashion). They find that this simple step gives the bagged model a significant performance boost; enough to outperform boosting on average.
This is particularly hard on boosting because the final output of boosting is more or less the same as the final output of the weighted, bagged model; that is, a weighted combination of trees. The fact that we can produce weights better than boosting by selecting them at random suggests that the reason boosting is better than bagging is because it is allowed to weight its constituent models, and not because of the additional complexity used to choose those weights.
Noise Sensitivity
The last and scariest result comes from a paper that is close to five years old now by Phil Long and Rocco Servido. This paper is easily the most theoretical and mathematical of the bunch, but within all the math is a not-too-complex idea.
Suppose we have a training data set with a noise example. That is, a training data set where one of the instances is accidentally mislabeled. This is extremely common in data analysis: Maybe in your customer survey, someone clicked the “would recommend” button instead of the “would not recommend” button, or the technical support staff marked a permanently broken part as being “repaired”. In other words, data is rarely completely free of errors.
The authors show that, if you carefully construct a dataset, you can make certain types of boosters (ones that optimize convex potential functions, for those who care) produce models that perform no better than chance by adding just a single carefully selected noise instance. This is a pretty rough result for boosting. The fact that you can destroy an algorithm’s performance by moving a single training instance around means that the algorithm is a lot more fragile than we previously thought.
Now of course this should be taken with a grain of salt. The authors have gone through a lot of trouble to cause a lot of trouble for boosting, and maybe nothing like the data set they’ve constructed really exists in the real world. However, in my experience (and that of others cited in the paper), boosting really is fragile in some cases, and combined with the other two results above this fragility starts to come into clearer focus.
The Moral
So what’s the point? I still like boosting. It’s still got a lot of good theory and even more good empirical results behind it. You’re probably using boosted models every day without even knowing it. But the three results above show that even a great tool like boosting can’t escape the No Free Lunch Theo- rem: Boosting provides performance benefits at times by using additional algorithmic complexity, but that additional complexity causes weaknesses at other times
Those weaknesses can sometimes be remedied by using techniques that are simpler and faster, and we’re always looking for such remedies at BigML. As the great E. W. Dijkstra said, “The lurking suspicion that something could be simplified is the world’s richest source of rewarding challenges.”
There’s a lot of hype these days around predictive analytics, and maybe even more hype around the topics of “real-time predictive analytics” or “predictive analytics on streaming data”. Like most things that are over-hyped, what is actually meant by the term is often lost in the noise. In this case that’s really a shame, because these terms refer to at least two different things, either or both of which may be important in a given context.
This post forms the basis of a lightning talk I’m giving (remotely) at the Real-time Big Data Meetup in Menlo Park, California. Join the group if you’re interested!
What is Machine Learning from Streaming Data?
Generally, when I hear people talking about “machine learning from streaming data”, they may be talking about a couple of things.
- They want a model that takes into account recent history when it makes its predictions. A good example is the weather; If it has been sunny and 80 degrees the last two days, it is unlikely that it will be 20 and snowing the next day.
- They want a model that is updatable. That is, they want their model to in some sense “evolve” as data streams through their infrastructure. A good example might be a retail sales model that remains accurate as the business gets larger.
These two phenomena sound like the same thing, but they are potentially very different. The central question is whether the underlying source generating the data is changing. In the case of the weather, it really isn’t (okay, okay): Given the weather from the previous few days you can usually make a pretty good guess at the weather for the next day, and your guess, given recent history will be roughly the same from year to year. The same model for last year will work for this year.
In the case of the business, the underlying source is changing; the business is growing, and so your guess of the sales given the previous few days of sales is probably going to be different from last year. So last year’s data, when the business was small, is really not relevant to this year, when the business is large. We need to update the model (or scrap it completely and retrain) to get something that works.
The first case, where you want the prediction conditioned on history, I’m going to call time-series prediction. That problem deserves a post all its own, but it suffices to say that solutions to this problem revolve largely around feeding the prediction history to the model as input. That’s a massive oversimplification, but there’s plenty of information out there if you’re more interested.
The second case, where you need to update the model or retrain completely, is about dealing with non-stationarity, and that’s largely what the rest of this post is going to be about. But consider this first lesson learned: Time series prediction and non-stationary data distributions are two different problems.
Some Options
When I think about the second case above, a couple of classes of approaches jump to mind:
- Incremental Algorithms: These are machine learning algorithms that learn incrementally over the data. That is, the model is updated each time it sees a new training instance. There are incremental versions of Support Vector Machines and Neural networks. Bayesian Networks can be made to learn incrementally.
- Periodic Re-training with a batch algorithm: Perhaps the more straightforward solution. Here, we simply buffer the relevant data and retrain our model “every so often”.
Note that any incremental algorithm can work in a batch setting (by simply feeding the input instances in the batch into the algorithm one after another). The reverse, however, isn’t trivially true. Many batch algorithms can only be made to work incrementally with significant work or power sacrifices, and some things just can’t be done.
So is the sacrifice worth it? Here are a couple of considerations to think about:
- Data Horizon: How quickly do you need the most recent datapoint to become part of your model? Does the next point need to modify the model immediately, or is this a case where the model needs to behave conditionally based on that point? If it is the latter, perhaps this is a time-series prediction problem rather than an incremental learning problem.
- Data Obsolescence: How long does it take before data should become irrelevant to the model? Is the relevancy somehow complex? Are some older instances more relevant than some newer instances? Is it variable depending on the current state of the data? Good examples come from economics; generally, newer data instances are more relevant. However, in some cases data from the same month or quarter from the previous year are more relevant than the previous month or quarter of the current year. Similarly, if it is a recession, data from previous recessions may be more relevant than newer data from a different part of the economic cycle.
With these two concerns in mind, along with architecture and implementation, you can get a pretty good idea of whether you’re looking at a problem where incremental learning is desirable or not.
Obviously, the shorter the data horizon, the more likely you are to want incremental learning. However, it’s a common mistake to confuse a short data horizon with a time-series prediction problem: If you want your model to behave differently based on the last few instances, the right thing to do is condition the behavior of the model on those instances. If you want the model to behave differently based on the last few thousand instances, you may want incremental learning.
This, however, is where the second concern rears its ugly head. Incremental learners all have built-in some parameter or assumption that controls the relevancy of old data. This parameter may or may not be modifiable and the relationship may be complex, but the algorithm will be making some implicit assumption about how relevant old data is. This is the second lesson: Be wary of the data relevance assumptions made by incremental learning algorithms!
By contrast, retraining in batches has lots of flexibility in this regard. It is easy to select data for retraining, filter by relevant criteria, even weight the data according to some relevancy function using one of the many batch training algorithms that take weighting into account. There’s even been some recent work in automatically detecting when retraining is necessary, based essentially on how different the incoming data is from previous recent data.
Summing Up
First of all, don’t confuse learning from streaming data with time series prediction; while the data sources from the two problems look similar, the two concerns are often orthogonal.
Incremental learning is great for two cases: First, simplicity. There’s no buffering and no explicit retraining of the model. Second, speed. You always have a model that’s up to date. You make sacrifices, however, in terms of the power of the model that you can learn and the flexibility of the model to incorporate old data to different degrees. There are also some corner cases where incremental learning is necessary, such as when data privacy demands that instances be discarded immediately after they are seen.
Periodic retraining requires more decisions and more complex implementation. However, you get all of the power of any supervised classification algorithm, and specialized tools can be built on top of it to allow you to retrain on only relevant data and only when necessary. It also offers the nice benefit of being able to plug-and-play different machine learning algorithms into your architecture with a minimum of hassle, as the learning bit is built completely from off-the-shelf algorithms.
BigML has mainly gone the second route, allowing you to easily upload new data and trigger retraining (either manually or via the BigML API) whenever you find it necessary. One thing on our to do list is to implement some of those specialized tools on top of the BigML API, so that data can be streamed to BigML and the model is automatically re-trained only when it needs to be. We’ll update you on our blog about any big changes!
EDIT: This article originally used the terms “classifier” and “predictor” more or less interchangeably, and they’re not really interchangeable as one astute reader pointed out. I’ve replaced both with the term “model”, which follows the convention we’ve been using at BigML. Sorry for any confusion that might have caused!
This is a guest post by Erik Meijer (@headinthebox). He is an accomplished programming-language designer who runs the Cloud Programmability Team at Microsoft and a professor of Cloud Programming at TUDelft.
There is a lot of hype and mystique around Machine Learning these days. The combination of the words “machine” and “learning” induces hallucinations of intelligent machines that magically learn by soaking up Big Data and then both solving world hunger and making us rich while we lay on the beach sipping a cold one.
Worse yet, the esoteric and mathematical terminology of many Machine Learning textbooks and research papers fuels the mystique, resulting in the persona of the Data Scientist as the 21st century druid that mystically distills insight and knowledge from raw data.
However, just as normal programmers can write code without needing to understand Universal Turing Machines, power domains, or predicate transformers, we believe that normal programmers can use Machine Learning without needing to understand vectors, features, probability density, Jacobians, etc. In fact, the very essence of Machine Learning is creating code from a finite set of sample input/output pairs. This is something that programmers are already deeply familiar with; in other words, Machine Learning is Test Driven Development performed by code (TDD).
For example, given the well-know “Iris flower dataset” as our test case:
| Sepal length | Sepal width | Petal length | Petal width | Iris |
| 5.1 | 3.5 | 1.4 | 0.2 | Setosa |
| 7.0 | 3.2 | 4.7 | 1.4 | Versicolor |
| 6.1 | 2.6 | 5.6 | 1.4 | Virginica |
we can apply Machine Learning to generate executable code that “predicts” the class of Iris given the four parameters:
enum Iris { Setosa, Versicolor, Virginica }
Iris Predict(double sepalLength, double sepalWidth, double petalLength, double petalWidth){ … }
The nice folks at BigML have created a service that allows developers like you and me to either use Machine Learning as TDD programmatically using a simple REST API, or manually using an elegant website. The process involves uploading a datasource (typically a csv file), converting it into a typed dataset, and finally creating a model using the decision tree generator algorithm.
This model can then be rendered as code in several programming languages such as C#, Java, Objective-C, Python, etc or exposed as an interactive webpage where users can navigate the model by answering questions.
When confronted with a REST API, the very first thing that every sane developer does is to implement a high level abstraction that hides all the low-level details of making HTTP requests, creating query strings, and munching JSON. Developers from many languages that use BigML’s REST API are no exception to this rule, and have created bindings for Clojure, iOS, Java, Python, R, Ruby and now also .NET.
The .NET bindings for BigML that are available on Github expose a full LINQ provider, a strongly typed projection of all the JSON objects exposed by the REST API, as well as the ability to compile models to .NET assemblies.
To access BigML using the .NET bindings, you first create a new client object by passing your user name and API key. The client object provides (strongly typed) methods for all the operations provided by the BigML API as documented here; for example listing, filtering and sorting your BigML sources using LINQ queries. Of course the binding may not reflect all of the latest features, for example we do not implement Evaluations yet, but that is why we provide the source on Github. The implementation of the LINQ provider may be an interesting topic of study by itself, and follows the pattern as outlined in the CACM paper “The World According to LINQ”.
// New BigML client using username and API key.
Console.Write("user: "); var User = Console.ReadLine();
Console.Write("key: "); var ApiKey = Console.ReadLine();
var client = new Client(User, ApiKey);
Ordered<Source.Filterable, Source.Orderable, Source> result
= (from s in client.ListSources()
orderby s.Created descending
select s);
var sources = await result;
foreach(var src in sources) Console.WriteLine(src.ToString());
Below is an example of how to create a new source from an in-memory collection, then a dataset and finally a model. Since the BigML resource creation is asynchronous we need to poll until we get the status code “finished” back from the service. Note that BigML (and the .NET bindings) also supports creating sources from local files, Amazon S3, or Azure Blob store.
<em id="__mceDel">// New source from in-memory stream, with separate header. var source = await client.Create(iris, "Iris.csv", "sepal length, sepal width, petal length, petal width, species"); // No push, so we need to busy wait for the source to be processed. while ((source = await client.Get(source)).StatusMessage.StatusCode != Code.Finished) await Task.Delay(10); Console.WriteLine(source.StatusMessage.ToString()); // Default dataset from source var dataset = await client.Create(source); // No push, so we need to busy wait for the source to be processed. while ((dataset = await client.Get(dataset)).StatusMessage.StatusCode != Code.Finished) await Task.Delay(10); Console.WriteLine(dataset.StatusMessage.ToString()); // Default model from dataset var model = await client.Create(dataset); // No push, so we need to busy wait for the source to be processed. while ((model = await client.Get(model)).StatusMessage.StatusCode != Code.Finished) await Task.Delay(10); Console.WriteLine(model.StatusMessage.ToString());
Of course what we are really after, since we want to show that Machine Learning is automated TDD, is the generated model for our source. The model description is a giant JSON object that represents the decision tree that BigML has “learned” from the data we fed it. In the example below, we translate the model into a .NET expression tree, compile the expression tree into a .NET delegate, and then call it on one of the test inputs to see if it predicts the same kind of iris:
<em id="__mceDel"><em id="__mceDel">var description = model.ModelDescription;
Console.WriteLine(description.ToString());
// First convert it to a .NET expression tree
var expression = description.Expression();
Console.WriteLine(expression.ToString());
// Then compile the expression tree into MSIL
var predict = expression.Compile() as Func<double,double,double,double,string>;
// And try the first flower of the example set.
var result2 = predict(5.1, 3.5, 1.4, 0.2);
Console.WriteLine("result = {0}, expected = {1}", result2, "setosa");
We hope that this library makes BigML easily accessible to .NET programmers that want to incorporate Machine Learning in their applications. And we hope to see it flourish like all the other bindings for BigML that are available at https://github.com/bigmlcom/io.
We love data, big and small and we are always on the lookout for interesting datasets. Over the last two years, the BigML team has compiled a long list of sources of data that anyone can use. It’s a great list for browsing, importing into our platform, creating new models and just exploring what can be done with different sets of data.
In this post, we are sharing this list with you. Why? Well, searching for great datasets can be a time consuming task. We hope this list will support you in that search and help you to find some inspiring datasets. Some data sources are great for complementing your own data. Others are interesting or just fun to play with. If you have your own list of favorite data sources and want to share them, feel free to let us know and we’ll update our list.
Categories of data sources
We grouped the links into some categories that bit.ly calls ‘Bundles’ to help you find what you are looking for and bundled the Bundles into a single Data Sources Bundle. Here is a short discussion of the categories, with some examples.
Machine Learning Datasets
Although many datasets can be used for machine learning tasks, the sources in this Bundle are specifically pre-processed for machine learning. We included one of the most famous sources of machine learning datasets in this Bundle: the UCI Machine Learning Repository. There are hundreds of datasets in this repository, nicely categorized so you have multiple angles to search. All datasets are well documented, including data set descriptions. One of the datasets you can find here is the widely used ‘iris’ dataset. Another example is this vertebral column dataset that has data on 6 features to diagnose orthopaedic patients.
| Example: | Vertebral Column Model, predicting one of three condition of the vertebral column, based on various metrics. |
| File: | vertebral_column_data.zip |
| Format: | .csv and .arff |
| Access: | direct download |
| License: | Adhere to citation policy |
Machine Learning Challenges
Our next bundle of links contains links to Machine Learning Challenges. Each challenge comes with data and usually this data is available for download, even after the challenge is closed. So challenges offer an interesting source of all kinds of data. One of the leaders in this field is Kaggle. But lesser known challenges like Digging Into Data or Causality Workbench have interesting repositories too.
Marketplaces and data hubs
There is an ever growing number of places where one can offer data, search data and download data. Some are commercial offerings that have both paid and free datasets. Others are more community style, non-commercial places where people share datasets. In this bundle we have combined these into a nice collection of places that have thousands of datasets. One of our favorites is Infochimps. Although they are now focusing on monetizing their big data platform, they still have a diverse data marketplace that is easy to browse through. Another great place is Microsoft’s Windows Azure Marketplace. This marketplace has both paid and free datasets that you can subscribe to. You collect the data through an API, through setting a query and downloading the results or through BigML’s integrated Azure Marketplace widget.
|
Example: |
Car crash with fatalities, day of week: models day of week patterns in car crash fatalities. |
| File: | USA 2011 Car Crash Data |
| Format: | CSV |
| Access: | OAuth |
| License: | Terms set by the publisher of the data. |
Open companies
We found some companies that share some of their data through downloads or API. We hope this short list is just the beginning and many more will follow! Among the early adopters are Best Buy, EBay and LendingClub. Also on this list is Prosper: we’ve created a daily feed through their API to model Prosper Status, Borrower Rate, Loan Status and Bid Count.
| Example: | Prosper Borrower Rate model, predicting the rate at which a borrower can get money. |
| File: | various files |
| Format: | XML and CSV |
| Access: | Direct download and through ProsperAPI |
| License: | Prosper API Terms of Use |
Data search engines
There are even special search engines that help you find data and data sets. Like Quandl, where you can search in over 3,000,000 financial, economic and social datasets. Open data @CTIC will let you scout open data initiatives worldwide. Zanran is a web site where you can search the web for data and statistics.
Data Journals
Then there are Data Journals. We only found a few. They not only publish data related stories, but also the data to go with the story. This way you can make your own analysis, visualization or models and even share them. At Sargasso most posts are in Dutch but you’ll find some in English with data-links and nice visualizations. The Guardian Data blog is iconic of course. This blog is very open to what you can do with the data they provide. It even has a separate Flickr group where you can upload new visualizations. As an example, here’s a post on credit ratings per country by S&P, Moody’s and Fitch that we created some models for.
| Example: | Souvereigns S&P Ratings, predicting a country’s S&P rating based on the previous ratings by the big rating agencies. |
| File: | https://docs.google.com/a/bigml.com/spreadsheet/ccc?key=0AonYZs4MzlZbdDdpVmxmVXpmUTJCcm0yYTV2UWpHOVE#gid=19 |
| Format: | Google Docs Spreadsheets |
| Access: | Download through Google Docs in multiple formats |
| License: | public data |
Open Data Sources
There’s a big push for governments to be transparant and share data. We found a lot of sources on open data and have grouped them into a number of bundles that are more or less self explanatory.
Local government that contain links to open data sites of cities and counties around the world. National Government, with links for countries, but also including (US) States and regions. International bodies and Agencies, like the United Nations, the Worldbank et cetera. Finally a special bundle for US agencies.
One of the International Organizations that has made an amazing amount of data available is the World Bank. There is for example the World Development Indicators Database. You can extract data from this database using a simple interface and download the data you are looking for. For this sample ‘Net Migration Model’ we combined various economic and demographic indicators with net migration data.
| Example: | Net Migration Model, predicting migration streams based on a number of indicators. |
| File: | data taken from World Development Indicators Database |
| Format: | csv, xls, txt |
| Access: | Direct Download |
| License: | Terms of Use for Datasets |
A list of lists
We also have a bundle that contain lists of data sources. You will find interesting new sources but also some doubles in these lists. Sources are for instance Hillary Mason’s Bundle of links on where to find research quality datasets, links to Quora questions & answers that contain references to data sources, blog posts that feature data source lists and a variety of other lists we found.
Miscellaneous
We’ll end with an interesting collection of miscellaneous sites that publish specific data, like Cosm.com where you can publish and download sensor data. You can find the Million Song Dataset here, Google’s n-gram datasets, data ranging from scientific to personal. It’s a great list to spend a rainy day on. As an example, take a look at the data from the Pew Internet & American Life Project. It is a rich source of data on internet related and social topics. Like this survey regarding the use of Facebook. We used it to make a predictive model on ‘Do you expect to spend more/less time on Facebook the upcoming year?’.
| Example: | Will you spend more/less time on Facebook?, predicting the amount of time people spend on Facebook. |
| File: | Omnibus_Dec_2012_csv.csv |
| Format: | csv, SPSS, Word |
| Access: | Direct Download |
| License: | Use Policy |
A Final Word
Just a final word about these lists. We focused on the English speaking part of the world, very aware of the fact that there are many sites in other languages that we missed. Even for the English speaking part, there’s no way we would find every possible source. Feel free to send us your favorite links at datasources@bigml.com and we will add them to the list!
We hope this great list of sources encourages you to go out and have fun with data. Not familiar with BigML’s services? No problem: registration only takes a minute, is free and will get you a good set of promotional credits to get you going with some of these datasets!
Last year, our friend Wendy the Wine Merchant showed you how business owners can use BigML to help them make their business run smarter. Today, Sara the Software Developer has arrived to show us how developers can partner with BigML to help them build applications that learn about their users.
Let’s give her a warm Internets round of applause to welcome her to BigML’s landing page!
If you are a meticulous developer or you are learning the basics of machine learning, you are going to love our new sandbox feature, inline sources.
As you might know, BigML comes with a free Machine Learning Sandbox, a development mode that you can switch on/off in your account settings. It lets you do everything that you want to do in the production mode for free but with the limitation of 1 MB maximum for datasets, models, and evaluations.
While in development mode, we charge you no credits for using BigML. This way you can focus on developing and testing your application, while not spending yourself into the poor house.
While developing and testing applications, you may want to manually create input data to simulate specific situations. To that end we have added an inline editor. Now you can create inline sources in BigML by manually inputting any data you like. Of course you can also copy/paste the content of an existing file and edit that data.
So how does it work? As always: with almost shocking simplicity! Make sure you are in Development mode first. Then go to the Source-tab of your dashboard. Here you’ll see an icon for creating an inline Source.
Clicking that icon opens up the editor. Now you can create, paste, edit all you want to create your new inline source. Make sure you give it an appropriate name, hit ‘Create’ and you are good to go!
Inline sources might also be useful to help you understand how machine learning works in general and how BigML’s algorithm specifically behaves. Imagine that you are taking an introductory class on learning with decision trees and want to check how different data alters how the trees are built. You can now easily do that in a matter of clicks. In fact, the example above is the party dataset taken from Steven Marsland introductory book to machine learning from an algorithm perspective. See below the corresponding model.
Inline sources is a simple feature that we trust will be helpful as you are developing great applications using BigML, or just learning more about how machine learning works.
In my last post, I started discussing Pedro Domingos’ excellent paper reviewing some of the underpinnings and pitfalls of machine learning, making the paper a little less academic and adding a few examples. I’ll continue that review in this post. Many of the headings come directly from the paper, and I’ll use quote marks when quoting directly.
Finally, I’m using “feature” and “field” here interchangeably, as Domingos uses the former term and BigML generally uses the latter.
Feature Engineering is the Key
We’ll start out this time with a topic that is so important that it deserves an instructive example. In fact, Domingos calls it “easily the most important factor” in determining the success of a machine learning project, and I agree with him.
Suppose you have a dataset in which you have pairs of cities, coupled with a prediction of whether most people would consider the two cities to be comfortably drivable within a single day. You’ve got a nice database with the longitudes and latitudes for all of your cities, and so these are your input fields. Your dataset might look something like this (note that the values don’t correspond to actual city locations – they are random):
| City 1 Lat. | City 1 Lng. | City 2 Lat. | City 2 Lng. | Drivable? |
|---|---|---|---|---|
| 123.24 | 46.71 | 121.33 | 47.34 | Yes |
| 123.24 | 56.91 | 121.33 | 55.23 | Yes |
| 123.24 | 46.71 | 121.33 | 55.34 | No |
| 123.24 | 46.71 | 130.99 | 47.34 | No |
And you expect to construct a model that can predict for any two cities whether the distance is drivable or not.
Probably not going to happen.
The problem here is that no single input field, or even any single pair of fields, is closely correlated with the objective. It is a combination of all four fields (the distance from one pair of geo-coordinates to the other), and a combination by a fairly complex formula, that is correlated with the input. Machine learning algorithms are limited in the way they can combine input fields; if they weren’t, they could totally exhaust themselves trying everything.
But all is not lost! Even if the machine doesn’t have any knowledge about how longitudes and latitudes work, you do. So why don’t you do it? Apply the formula to each instance and get a dataset like this (again, random values):
| Distance (mi.) | Drivable? |
|---|---|
| 14 | Yes |
| 28 | Yes |
| 705 | No |
| 2432 | No |
Ah. Much more manageable. This is what we mean by feature engineering. It’s when you use your knowledge about the data to create fields that make machine learning algorithms work better.
Domingos mentions that feature engineering is where “most of the effort in a machine learning project goes”. I couldn’t agree more. In my career, I would say an average of 70% of the project’s time goes into feature engineering, 20% goes towards figuring out what comprises a proper and comprehensive evaluation of the algorithm, and only 10% goes into algorithm selection and tuning.
How does one engineer a good feature? One good rule of thumb is to try to design features where the likelihood of a certain class goes up monotonically with the value of the field. So in our example above, “drivable = no” is more likely as distance increases, but that’s not true of longitude or latitude. You probably won’t be able to engineer a feature where this is strictly true, but it is a good feature even if it is somewhat close to that ideal.
Typically, there isn’t a single data transformation that makes learning immediately easy (as there was in the above example), but at least as typically there are one or more things you can do to the data to make machine learning easier. There’s no formula for this, and a lot of it happens by itch and by twitch. BigML attempts to do some of the easy ones for you (automated date parsing is an example) but far more interesting transformations can happen with detailed knowledge of your specific data. Great things happen in machine learning when human and machine work together, combining a person’s knowledge of how to create relevant features from the data with the machine’s talent for optimization.
More Data Beats A Cleverer Algorithm
More data wins. It’s not just Domingos saying this; there’s increasingly good evidence that, in a lot of problems, very simple machine learning techniques can be levered into incredibly powerful classifiers with the addition of loads of data.
A big reason for this is because, once you’ve defined your input fields, there’s only so much analytic gymnastics you can do. Computer algorithms trying to learn models have only a relatively few tricks they can do efficiently, and many of them are not so very different. Thus, as we have said before, performance differences between algorithms are typically not large. Thus, if you want better classifiers, you should spend your time:
- Engineering better features
- Getting your hands on more high-quality data
Learn Many Models, Not Just One
Here, Domingos discusses the power of ensembles, which we’ve blogged about before. There’s no need to discuss it at greater length here, but it bears repeating in brief: One can often make a more powerful model by learning multiple classifiers over different random subsets of the data. The downside is that some interpretability is lost; instead of a single sequence of questions and answers to arrive at a prediction, you now have a sequence for each model, and the models vote for the final prediction. However, if your application is very performance sensitive, the loss in interpretability might be worth the increase in power.
Simplicity Does Not Imply Accuracy
There is an old saying in hypothesis testing known as Occam’s Razor. In the original fancy Latin it reads something like, “Plurality is not to be posited without necessity”. In layman’s terms, if you have two explanations for something, you should generally prefer the simpler one. For example, if you wake up in the middle of a cornfield having no memory of what you did the previous night, one explanation is that you were abducted by aliens and they implanted a memory suppression device in your brain. Another is that you got really drunk. The latter is simpler and therefore preferred (unless you are a member of the fringe media).
So too in machine learning. If we have two models that fit the data equally well, many machine learning algorithms have a way of mathematically preferring the simpler of the two. The folk wisdom here is that a simpler model will perform better on out-of-sample testing data, because it has less parameters to fit, and thus is less likely to be overfit (see part one for more on the dangers of overfitting).
One should not take this rule too far. There are many places in machine learning where additional complexity can benefit performance. On top of that, it is not quite accurate to say that model complexity leads to overfitting. More accurate is that the procedure used to fit all that complexity leads to overfitting if it is not very clever. But there are plenty of cases where the complexity is brought to heel by cleverness in the model fitting process.
Thus, prefer simple models because they are smaller, faster to fit, and more interpretable, but not necessarily because they will lead to better performance; the only way to know that is to evaluate your model on test data.
Representable Does Not Imply Learnable
The creators of many machine learning algorithms are fond of saying that the function representing an accurate prediction on your data is representable by the learning algorithm. This means that it is possible for the algorithm to build a good model on your data.
Unfortunately, this possibility is rarely comforting by itself. Building a good model may require much more data than you have, or the good model might simply never be found by the algorithm. Just because there’s a good model out there that the algorithm could find does not mean that it will find it.
This is another great argument for feature engineering: If the algorithm can’t find a good model, but you are pretty sure that a good model exists, try engineering features that will make that model a little more obvious to the algorithm.
Correlation Does Not Imply Causation
This is such an old adage in statistics that Domingos almost decides not to mention it, but it is so important that he does.
The point of this common saying is that modeling observational data can only show us that two variables are related, but it cannot tell us the “why”. In a soon-to-be-classic example from the excellent book Freakonomics, data from public school test scores showed that children who lived in homes with a high number of books tended to have higher standardized test scores than those with a lower number of books in the house. The mayor of Chicago, doubtless while screaming “Science!” at all of his advisors, proposed a program to send free books to the homes of children, which would definitely raise their test scores, right?
It turns out, of course, that intelligence doesn’t really work like that. More likely, the relationship exists because smarter and more affluent parents tend to buy books and also do a whole bunch of other helpful things for their children. The books don’t cause the kids to be successful, they are just good indicators of success.
You should take similar care when interpreting your models. Just because one thing predicts another doesn’t mean it causes another, and making business (or public policy) decisions based on some imagined causal relationship should be done with extreme caution.
The Big Picture
Machine learning is an awfully powerful tool, and like any powerful tool, misuses of it can cause a lot of damage. Understanding how machine learning works and some of the potential pitfalls can go a long way towards keeping you out of trouble.
An overall attitude that I find helpful is one of skeptical optimism: If you think you have a good model, try first to find all of the ways you think it might be broken, paying special attention to problems caused by overfitting or too many features.
If your model isn’t performing well, don’t lose heart! With a combination of feature engineering and gathering more data, the path to a better model is sometimes shorter than you think.
Recently
, Professor Pedro Domingos, one of the top machine learning researchers in the world, wrote a great article in the Communications of the ACM entitled “A Few Useful Things to Know about Machine Learning“. In it, he not only summarizes the general ideas in machine learning in fairly accessible terms, but he also manages to impart most of the things we’ve come to regard as common sense or folk wisdom in the field.
It’s a great article because it’s a brilliant man with deep experience who is an excellent teacher writing for “the rest of us”, and writing about things we need to know. And he manages to cover a huge amount of ground in nine pages.
Now, while it’s very light reading for the academic literature, it’s fairly dense by other comparisons. Since so much of it is relevant to anyone trying to use BigML, I’m going to try to give our readers the Cliff’s Notes version right here in our blog, with maybe a few more examples and a little less academic terminology. Often I’ll be rephrasing Domingos, and I’ll indicate it where I’m quoting directly.
How Does Machine Learning Work?

We know that supervised machine learning models (the ones you build at BigML) can predict one field in your data (the objective field) from some or all of the others (the input fields). But how does the model building process actually work? All machine learning algorithms (the ones that build the models) basically consist of the following three things:
- A set of possible models to look thorough
- A way to test whether a model is good
- A clever way to find a really good model with only a few test
A good analogy here might be trying to find a really good restaurant in your neighborhood: Your set of possibilities is all of the restaurants in your neighborhood, and to test if the restaurant is good you can have a meal there.
Our third necessity is a bit more elusive; you want to find a good restaurant without having to eat at every single one. How best to be clever about it? Maybe you take a look at the menu, or the outside of the building, or the surrounding neighborhood. Maybe you ask some people you trust. In any case, you know a lot of tricks that will let you find a good restaurant without trying all of them. You may not find the best one; that would be a lot more work, and is probably not necessary. But you could probably find one that’s pretty good.
With machine learning, things are even more tricky: The set of possible models for any real machine learning algorithm is big. Really, really big. In many cases, it’s not even finite. Fortunately, we already know loads of clever tricks for making the search manageable (we are using many of them here at BigML!).
Once you’ve found a good model, “the proof is in the pudding”, as they say. We hope that the model we’ve found will also apply to data outside of your original dataset, so you can make predictions from the inputs when you don’t know the objective.
Unfortunately, even with this fairly simple idea, there are many ways that learning algorithms fail, and things you can do to make sure they don’t. Domingos visits several of the most common ones in his paper.
Overfitting Has Many Faces

The general moral of this section of the paper is to always measure the performance of your classifier on out-of-sample data. That is, to know if your model is good, you must do at least one test where you train on some of your data and test on the rest. Even if that test succeeds, to have real knowledge you must do several (where you split the data differently each time). If your data is placed in time somehow (weekly retail sales, for example), you might do a test where you train on all of the weeks except one, and test on that week, and you should do this for each of the weeks in your data.
You cannot do too many of these training and testing splits. You should even make some predictions on data you imagine yourself, to see what the model does in certain situations. It is only by doing this that you will understand how good your model is, and what sort of mistakes it makes when it makes them.
Domingos focuses on a particular way of analyzing the errors of your model, called bias-variance decomposition, but BigML provides many other ways of understanding the performance of your model on out-of-sample data. You can see them all when you create an evaluation.
Intuition Fails in High Dimensions
So you have a model, with some input fields, and some objective field, and it is not performing as well as you’d like. One of your first intuitions might be to add more input fields. After all, if the model can do as well as it does with the input fields you have, surely if you give it more information it will do better, right?
Not so fast. While this might work if you add a particularly useful input field (that is, one that helps you predict the objective very well), this often doesn’t work. In fact, if you add input fields that are not useful, or redundant ones that contain information already in other input fields, you may very well get a model that performs worse than the original. One reason is that the more input fields you have, the more likely that the model will see some relationship between one of them and the objective that isn’t real, but just the product of random noise. Of course, this will make the model think that field is important when it is not, and so the model will be the worse because of it.
What this means in practice is that as you add more and more input fields, you must also add more and more training data to “fill up” the space created by the additional inputs if you want to use them accurately.
This is known as the curse of dimensionality in machine learning, and it is only one problem with having too many input fields. You can combat this problem in your own modeling efforts by selecting as inputs only the fields you know are relevant to predicting the objective. You can do this by opening up the configuration panel when you create your model and deselecting the less relevant fields.
Theoretical Guarantees Are Not What They Seem
Many machine learning papers offer fancy mathematics that show that if your training set is a certain size, you can guarantee the error of your model will be better than some number. Often, people see these guarantees and think, “Wow, with such fancy math behind it, this algorithm must surely be better than all others”. The problem is that these guarantees are more often than not irrelevant in practice because the guaranteed number is so pathetic than any reasonable algorithm will hit it. In fact, some recent work showed that some of these theoretical guarantees are useless not only in practice, but even in theory.
In my machine learning career, I’ve seen a similar phenomenon with people becoming attached families of algorithms. That is, “Algorithm X” cannot be good because it is not a neural network” or “it is not a support vector machine” or what have you. For example, our illustrious CEO recalls a failed machine learning project at a major company. The reason for the failure? A high-level manager thought that the type of classifier being used was inferior to one he had heard about in an undergraduate class years before. Unfortunately, his lack of solid evidence didn’t translate into a lack of authority and the project was scrapped at a late stage, costing the company millions of dollars in wasted effort.
All such biases for and against algorithms are at best misleading. The only certain way (that we know of now) to know if an algorithm will model your data well is to try it out. While you may know some data-specific things that may help you select a machine learning algorithm before trying it out, you may be surprised by the results anyway. Come to the process with no preconceptions and you are likely to find the best answer.
Stay Tuned
Let’s take a breather. In the next post I’ll review the rest of the paper, outlining a few more of the things to do and the things to avoid when modeling your data. See you then!
We’re happy to announce that we’ve open-sourced our “fancy” streaming histograms. We’ve talked about them before, but now the project has been tidied up and is ready to share.
The histograms are a handy way to compress streams of numeric data. When you want to summarize a stream using limited memory there are two general options. You can either store a sample of data in hopes that it is representative of the whole (such as a reservoir sample) or you can construct some summary statistics, updating as data arrives. The histogram library provides a tool for the latter approach.
The project is a Clojure/Java library. Since we use a lot of Clojure at BigML, the readme’s examples are all Clojure oriented. However, Java developers can still find documentation for the histogram’s public methods.
Since the histogram provides an approximation of the data’s original distribution, you can find all the basic stats you’d expect, such as mean, median, and arbitrary percentiles. You can even generate functions for the PDF and CDF. Below we show the library in action (using a Clojure REPL) while exploring a histogram built on 200K samples from a normal distribution (mean of 0, variance of 1).
examples> (def hist (reduce insert! (create) ex/normal-data))
examples> (mean hist)
-0.0026
examples> (median hist)
-0.0009
examples> (variance hist)
0.9985
examples> (sum hist 0)
100077.6513
examples> (density hist 0)
80165.2707
examples> (percentiles hist 0.5 0.95 0.99)
{0.5 -0.0009, 0.95 1.6446, 0.99 2.3263}
examples> (map (cdf hist) [-2 0 2])
(0.0233 0.5004 0.9775)
examples> (map (pdf hist) [-2 0 2])
(0.0558 0.4008 0.0537)
The histograms have a few more tricks. Along with the primary variable the histograms can track information about secondary numeric or categorical variables. We use this feature when growing decision trees, but it could be useful whenever you want to watch for correlation between variables in a streaming context. For example, you could build a histogram on time-of-day for HTTP requests and also track the response time. With that, you might see that evenings show a spike in the number of requests and a corresponding increase in response time.
If you’re interested, there’s a lot more info on the histograms in our previous post and on the project page. As always, feel free to share questions and comments. Thanks!
Clone or fork the project here:
BigML’s decision tree visualizations are a powerful way to gain insight into your predictive models. Today, we’re making them even better with enhanced filtering features, enabling you to locate branches and predictions that wouldn’t otherwise be easy to find. If you need a quick overview of the basics of how our decision trees work, please check out this earlier blog post. The new filtering feature is available for your personal BigML models, and is also available for any given model in our gallery.
Filter by Support, Prediction, and Confidence
The new controls for support, prediction, and confidence filtering will show up below the original toolbar controls. Each control specifies a filter, and changing the filters will determine which branches are shown in the decision tree. For instance, the tree in the above image is predicting whether or not a given person makes more than $50,000. We can easily change the filter to only show branches that lead to a prediction of over $50,000 by adjusting the corresponding prediction filter (Note: this filter was previously available, we’ve only relocated it with the other new filter controls).
In addition to filtering by the prediction, it is also possible to filter by the support, or percentage of training data that a given branch received. This filter is useful for identifying branches that were well supported by training data, and the following image shows the tree after a minimum support threshold of 1% was applied.
The final way of filtering is by confidence, which is a measure of certainty for the given model on a given branch. Check our recent blog post on this for more information. This filter is useful for identifying branches that consistently contained certain outcomes, which in turn lead to predictions that are very likely given the training data. In the next picture, we see that by adjusting the minimum confidence filter to 95%, that we’ve eliminated all but two of the remaining branches. The currently highlighted branch indicates that individuals that are married, have more than 11 years of education, have capital gains of more than $5,095, and are younger than 61 are very likely to make more than $50,000.
Keep in mind that all of the filters are cumulative. So, each branch must pass each filter in order to be displayed. If you want to reset all of the filters, you can simply press the “Reset Filter” button to return the tree to its original state, as shown in the following picture.
Finding “Interesting” Branches
In addition to the reset filter button, we’ve included two additional buttons that let you quickly set filters for certain interesting branches. The “Rare Interesting Patterns” button will select all branches of the tree that have less than 1% support, while having at least 50% confidence (or be in the lowest decile for expected error in regression trees).
These “rare” branches do not have as much support, but have high confidence. They might be interesting patterns of behavior in your data that are worth exploring. Likewise, the “Frequent Interesting Patterns” button will filter for branches that have support above 1%, with high confidence, as shown in the following image.
Whether you’re only interested in understanding predictions for a certain class, you’re looking for the most confident predictions, or you’re looking for branches that are “off the beaten path”, our new filter interfaces have you covered!








































