Skip to content

How BigML Finds Important Variables in Wide Datasets

This blog post is based on a talk I gave at the Dare2Data conference in Madrid.

I recently found a fascinating sociology survey with more than 39,000 responses to almost 400 questions. The survey, which has been given in the United States since 1972, covers a wide range of topics. Besides demographic info like age, gender, race and income, the survey also covers personal beliefs (“Should racists be allowed to teach college?”), living situation (“Have you been too tired to do housework recently?”) and life experience (“Have you ever injected illicit drugs?”).

While it’s great to have a dataset that’s so, um, rich, most of the variables are simply not relevant to whatever it is I want to predict. If I’m predicting whether your income is higher or lower than the United States median of $50,000, it doesn’t really matter if you’ve received a traffic ticket for a moving violation, or if you think marriage counseling is scientific. (Yes, those are actual questions.)

This is where BigML comes in. Because our algorithm does a “greedy” search through the data, examining every input individually to see how well it predicts the output, it excels at finding the needle of insight in a haystack of irrelevance. BigML actually does check whether moving violations predict income, but quickly learns that marital status, education, employment and age are much more useful.

Top 10 Variables for Predicting Income

Of course, if you change what you’re trying to predict, the list of important variables changes too. At Dare2Data, I tried predicting political beliefs instead of income, with interesting results. (Since I excluded moderates from the training set, it’s more accurate to say that I’m predicting strongly held political beliefs.)

For example, if you meet these five criteria, then you identify as conservative more than 85% of the time:

  1. You disapprove of homosexuality (or don’t respond to the question);
  2. You disapprove of sex before marriage (or don’t respond to the question);
  3. You are white;
  4. You go to church almost every week;
  5. You live in a single-family detached house (a proxy for living in the suburbs).

Of the 2,550 people who meet these five criteria, 2,224 (more than 85%) identify as conservative. This group, who might call themselves “social conservatives”, are an impressive 19% of conservatives in the entire dataset.

The model even finds a sixth factor: if you are also Protestant, but not United Methodist, then you are even more likely to be conservative.  At first I thought this was just noise, but there is actually a large liberal wing within the United Methodist Church that supports same-sex marriage. Amazingly, BigML is able to find this nuance in the data—talk about a needle in a haystack!

On the liberal side, there’s a group that doesn’t disapprove of homosexuality, does disapprove of the death penalty, and is strongly pro-choice. This group is about 85% liberal, accounting for 12% of all liberals in the dataset. Again, it’s remarkable that BigML can find groups of people that behave in such recognizable ways, even though it knows nothing about politics, religion, or other touchy subjects.

Once again, only a small subset of the 400 variables actually matters for prediction:

Top 10 Variables for Predicting Political Beliefs

Hopefully I’ve conveyed how great BigML is at sifting through a dataset with lots of variables.  This type of “wide” dataset pops up all the time in business, especially when examining customer behavior, and traditional tools like Excel or Tableau simply aren’t designed to handle the analysis. By examining the full richness of your data, BigML helps you focus on what’s really important—even if it’s traffic tickets.

Advancing Machine Learning integration with Apple ResearchKit and HealthKit


At BigML we are excited to announce BigMLKit, a new open source framework for iOS and OS X that blends the power of BigML’s best-in-class Machine Learning platform with the ease and immediacy of Apple technologies.


BigMLKit brings the capability of “one-click-to-predict” to iOS and OS X developers in that it makes it really easy to interact with BigML’s REST API though a higher-level view of a “task.” A task is, in its most basic version, a sequence of steps that is carried out through BigML’s API. Each step has traditionally required a certain amount of work such as preparing the data, launching the remote operation, waiting for it to complete, collecting the right data to prepare the  next step and so on. BigMLKit takes care of all of this “glue logic” for you in a streamlined manner, while also providing an abstracted way to interact with BigML and build complex tasks on top of our platform.

BigML is already offering a variety of tools and libraries to make it easy to integrate BigML with whatever environment you might be working in. This includes a REST API, as well as bindings that provide a higher-lever view of it from the most popular programming languages, including PythonNode.jsObjective-C, and so on.  We also provide more advanced tools such as our powerful bigmler, a veritable command-line Swiss Army knife for machine learning, and we have many more surprises in the works that will make machine learning capabilities ever more accessible.

The introduction of HealthKit put the iPhone into the rapidly growing field of health tracking devices that can be used to monitor daily activities that impact one’s health. The Apple Watch will certainly fuel the trend towards health-oriented applications, and the recent open-sourcing of ResearchKit by Apple is providing further momentum for this to extend into medical research.

All of this surely creates a powerful constellation, but it leaves behind a key factor which is not included in the solution that Apple provides with HealthKit and ResearchKit: an easy way to make sense of the collected data. This is where BigML is happy to enter the picture with BigMLKit, which we believe will be a key enabler for a new class of applications in health care and medical research that will empower researchers, doctors, hospitals and health professionals to learn from health data collected via HealthKit and ResearchKit.

BigMLKit thus reaffirms BigML’s commitment to enable new machine-learning-powered applications on any platforms – and adds a special focus on the Apple ecosystem, where the combination of  existing and emerging devices and solutions (such as the iPhone, HealthKit, Apple Watch and ResearchKit) is promising to revolutionize health care and health research.

BigMLKit is still a very young project that can be found on GitHub. We welcome your feedback and we really appreciate your pull requests. Stay tuned for more updates, including a follow-up post with more information about the way you can integrate BigMLKit in your app.

Democratizing Machine Learning: The More, The Merrier!

The machine learning marketplace is heating up. The latest news in the machine learning front was Amazon’s launch of Amazon Machine Learning, which follows a few months on the heels of the commercial release of Azure Machine Learning from Microsoft.  These forays from technology stalwarts (along with IBM Watson) show that the marketplace is ready for machine learning at scale, which certainly reflects the growing business imperative to be able to make smarter decisions from Big Data backends. And more companies providing machine learning solutions is good for the industry at large:  it provides customers with more choices, and will further hasten the pace of innovation from machine learning providers, including BigML.

While BigML clearly isn’t as big as Microsoft, Amazon and the like we do have the benefit of perspective as we were the first company to bet on democratizing machine learning way back in 2011. (At that time Google Prediction API existed but was only oriented to developers, and hasn’t evolved much since).  Rather than pointing out that imitation is the sincerest form of flattery (and yes, we are flattered!), we think this is a good opportunity to highlight some top attributes of BigML in relation to emerging solutions on the marketplace.


BigML provides a robust, full-featured and scalable platform which has been informed by feedback from over 17,000 users who have created tens of millions of predictive models and machine learning tasks that have supported a countless number of predictions.

  1. Key differentiators of the BigML platform include:
  • Support for both supervised and unsupervised learning techniques:  in addition to classification and regression tasks solved by interpretable decision trees or ensembles for top tier performance,  BigML supports cluster analysis and anomaly detection.  And our 2015 roadmap is chock full of added algorithms and techniques for data exploration.
  • Best-of-market interface and visualizations: “Beautiful” “wow” and “amazing” are typical reactions I’ve heard while presenting BigML to customers and conferences.  Check it out for yourself and let us know of another interface that is as rich, enjoyable and intuitive as BigML.
  • Full-featured REST API for programmatic access to advanced ML capabilities, with bindings in several languages:  as beautiful as our interface may be, the brawn and brains of BigML rests in our open API that developers and analysts alike can use to quickly create predictive workflows and other machine learning tasks.
  • Easy sharing of resources and models, including the ability to export models from BigML locally and/or for incorporation into related systems & services:  want to export a model from Azure or Amazon ML?  Good luck with that.  BigML makes it easy to export your models via the interface or API, and you’re free to use your models wherever you wish.
  • BigML Private Deployments can be implemented in any cloud and/or on premise:  As BigML penetrates deeper into the enterprise, our willingness and ability to run in a corporate datacenter has become a critical differentiator.  In addition, we’ve implemented BigML not just on AWS, but also in the Azure and other public and private clouds.
  • In-platform feature engineering and data transformations:  BigML’s Flatline makes it easy to extend and create new features for you dataset, without having to go back to your source – both in the BigML interface and programatically using a rich set of predefined, ML-aware functions or building your own.
  1. BigML is suitable for developers and enterprises alike:
  • Pricing starts at $30/mo for individual users & developers – and you can actually use BigML for free in our Developer mode for tasks under 16MB.
  • Enterprises can purchase fully loaded “custom” subscriptions (bundled with training, support and more) and/or implement a BigML Private Deployment – either in the cloud or behind their firewall
  • All of these approaches (subscriptions or Private Deployments) include unlimited machine learning tasks along with the ability to export models.
  • BigML never charges subscribers for predictions against your own models (in contrast to Azure and Amazon)
  • With BigML subscriptions you can train models as many times as you want — and in parallel — at no extra fee
  1. BigML offers customers both an advanced analytics platform as well as a foundation for development and deployment of predictive applications:
  • It was almost two years ago when Mike Gualtieri at Forrester stated “predictive apps are the next big thing” – and we here at BigML are seeing the reality of that vision on a daily basis both with ISVs and with enterprise developers.
  • As BigML models can be exported, they can easily be incorporated into apps and services – enabling developers to focus on their solution rather than in creating and maintaining ML algorithms
  • BigML offers expert services (directly and through our partners) to help with development and deployment of predictive apps

Beyond the tangible differences listed above, as a nimble, hungry company BigML will constantly innovate at a furious pace to meet and exceed our customers’ needs.  We’re passionate about supporting our users and engage with our enterprise customers on a very integrated basis to ensure not only the success of their implementations, but also that our platform evolves according to current and emerging business requirements.

Want to learn more about BigML and/or get an update on our latest & greatest features?  Contact us and we’ll be happy to run you through a demonstration and discuss our various engagement options.  Or, you can simply get started today!

PAPIs Connect: Europe’s First Machine Learning Event for Decision Makers

A few weeks ago we told you about PAPIs’15, the 2nd International Conference on Predictive APIs and Apps, taking place on August 6-7, 2015 in Sydney, Australia. BigML was a proud sponsor of PAPIs’14 and we look forward to meeting the community again in August.

We’ll also have more opportunities to meet with predictive APIs and predictive apps enthusiasts with the new PAPIs Connect series of events. PAPIs Connect complements the annual PAPIs conference by focusing more on business cases and applications with the aim of educating decision makers about the possibilities of machine learning. BigML will be sponsoring the first edition of PAPIs Connect, which will take place on May 21, 2015 in Paris, France.

PAPIs Connect'15

For the predictive revolution to happen, it is essential to have tools like BigML that lower the barrier to entry for machine learning. Knowing how to use this new technology is not enough, though: we also need to connect it to the domains in which it can have an impact. To do this, it is important to know how to target the right problems that will allow us to create business value from data through machine learning.

PAPIs Connect attendees will gain a business understanding of machine learning and of its importance for their organisations. They will discover what others are doing with predictive technologies, which will likely inspire them to develop their own use cases. Connect is also a great opportunity to meet thought leaders and experts who have used data to deliver an impact on their organizations. Moreover, BigML’s VP of Data Science David Gerster will be showcasing the unique automatic anomaly detection capability that was recently introduced by BigML!

You can see a preliminary version of the program on Lanyrd and can register for the Paris event at the early bird rate until April 17th. In addition, if you have an interesting case study or application built using BigML that you’d like to share with the rest of the world, please let us know and we’ll get you invited to PAPIs Connect in Paris or PAPIs’14 in Sydney!

BigML Roadshow Down Under!

This week, I had the honor to present at AIIA’s cross-industry luncheon here in Melbourne thanks to the support of BigML’s local partner GCS Agile. The Australian Information Industry Association (AIIA) is the peak representative body and advocacy group for the ICT (Information and Communications Technology) industry in Australia. For over 35 years it has been their mission to advocate, promote, represent and grow the ICT industry in Australia as a not-for-profit organization, with over 400 member organizations covering a large spectrum between hardware, software, and services companies.

AIIA Event Melbourne

This year, the AIIA is running three cross-industry events under the umbrella theme of ‘Building the Digital Economy’. Featuring the utility, airport, transport, logistics, retail and finance sectors with separate sessions exploring intelligent operations, connectedness and digitization of the customer experience. These themes align with the Victorian government themes of achieving sustainability, productivity and citizen engagement through technology as discussed in their 2014 ICT Strategy and Digital Strategy.

‘Intelligent Operations’ is a mouthful term, but it is really meant to describe how intelligent technologies, including machine learning and predictive analytics can be used by businesses to drive operational efficiencies, employee productivity and improved customer service. Guest speakers at our luncheon included Paul Bunker, Manager, Business Systems & ICT, Melbourne Airport; Sue O’Connor, Deputy Chair, Goulburn Valley Water Corporation as well as myself (Atakan Cetinsoy, V.P. – Predictive Applications, BigML). After Rebecca Campbell-Burns of AIIA set the stage for the afternoon, Mr. Bunker took the podium making a strong case on how the Melbourne International Airport’s track record of operational excellence has added to the continued economic vibrancy of the state of Victoria stressing that they run a 24/7 operation, where cargo planes take to the air precious commodities to Asian destinations every night after the passenger airliner traffic subsides. Managing physical assets efficiently in this fast-paced context, while targeting a world-class traveler experience from the point of arrival until departure requires an analytical blanket that can adapt to sudden changes that may be caused by inclement weather or tightened security, which makes for a very interesting predictive analytics challenges.

Sue O’Connor’s presentation focused on the need for Goulburn Valley Water’s efforts to maintain a very affordable price point for drinking water, the most basic of human needs, at a time of environmental challenges, all the while making the necessary infrastructure investments to ensure the ability to meet growing demand now and in the future despite tight capital and operational expenditure budgets.  Sue went on to stress they intend to invest in Internet-enabled sensor networks to the extent that there is a clear business case and attractive ROI.

As I alluded to in my presentation, utility and aviation industries have a huge economic upside ($95 billion USD in savings as per a recent GE study) in efficiency terms from being able to better manage their existing infrastructure with the help of real-time sensor measurements.  As long as there is a way to analyze and interpret this tsunami of data in order to detect key signals business value can be drawn in multiple ways. For instance, it may be wise to prioritize big data initiatives targeting cost savings first due to clear return on investment.  Predictive maintenance schemes can avoid unnecessary dispatches of field maintenance personnel saving utilities significant amounts in costs.  However, sensor data can also be interpreted in ways that help launch completely new context sensitive value added services that can create new revenue streams all together. Luckily, machine learning is here to help with all these use cases. BigML’s “API first” approach to massively scaling carefully curated and well proven machine learning algorithms has been designed to streamline the process from raw data ingestion to real predictive insights. If interested on the topic, you can view my presentation deck on Slideshare.

Up next for us is a trip to Sydney, where we will be presenting at two different events on Wednesday (March 25, 2015). Feel free to come by and join us at either forum by following the links below.

We will do BigML demos followed by interactive discussions on the promise of machine learning in Australia.  It should be fun!

Topic Modeling Coming to BigML

Machine learning and data mining play very nicely with data in a row-column format, where each row represents a data point and each column represents information about that point.  It’s a natural format, and is of course the basis for things like spreadsheets, databases, and CSV files.

But what if your data isn’t so conveniently formatted?  Let’s say you have an arbitrary pile of documents, like product reviews, and you’d like to classify each one.  A simple thing to do would be to use word counts as features, but then you’re forced to make arbitrary decisions about which words are important.  If you just use all words, you end up with thousands or maybe tens of thousands of features, which generally decreases the efficiency of machine learning.  Moreover, simply counting words gives no information about the context in which the word was used.

Gettysburg Topic Markup

Does machine learning know a better way?

Thankfully, there are technologies called topic models which take some steps towards solving this problem.  The general idea is to look for “topics” in the data, which are essentially groups of words that often occur together (this is a gross oversimplification, but gives the correct flavor).  For example, in a collection of news articles, you may discover a topic that has the words “Obama”, “Congress”, and “President”, which would correspond to the real-world topic of politics.  We can then assign each document a score for each topic, indicating that this document is “well-explained” by that topic.  When we do this, we transform possibly tens of thousands of words into a small number (~10-100) of features, each one packed with information.

This is a fairly general way of thinking about this problem.  For example, you could use the same technology on shopping baskets (arbitrary lists of product serial numbers, say), and the “topics” would be groups of serial numbers that are often purchased together.  The main limitation on the usefulness of this is the average length of each document.  Because we’re relying on word co-occurance, we’d like our documents to be as long as possible so that we have lots of co-occurances to work with.  Twitter-length documents is around the point where this stops being very useful.

All in all, topic modeling is basically just a fancy, automated form of feature engineering that often works nicely on arbitrarily structured documents.

Enter AdFormare

As a proof of concept, I’ve developed a small service called AdFormare.  To work with the website, you upload a collection of documents and we do some processing to figure out the topics in the datasets, and the topic scores for each document.  As a bonus, we produce a nice visualization that shows you things like which topics often occur together, and shows you examples of documents with high scores for each topic.

AdFormare Topics

Without going too deeply into it, here’s a sample visualization produced from a large collection of movie reviews:

And here’s a little tutorial that tells you what you’re looking at:

Coming Soon To BigML

We’re going to integrate the guts of this technology into BigML, so you can do topic modeling on the text fields in your BigML datasets, but I’m soliciting people to try this out on their own document collections so we can work out the bugs before we deploy.  If you’ve got a collection of documents you’d like to see processed like this, by all means e-mail me (

The Need for Machine Learning is Everywhere!


In my work, which is predominantly information technology, the need for Machine Learning is everywhere. And I don’t mean just in the somewhat obvious ways like security or log file analysis. Consider, my work experience goes back to the tail end of the mainframe era. In fact one of my first jobs was networking together what were at the time considered powerful desktop computers as part of a pilot project to replace a mainframe. Since then I’ve seen the birth of the internet and the gradual transition of computing resources to the cloud. The difference between those two extremes is that adding capacity used to mean messing around in ceiling tiles for hours, where now when I need additional capacity I run a few lines of code and instantiate machines nearly instantly in the cloud.

This ability to scale up quickly to meet rising demand, and I would argue even more importantly, to scale down to save costs is a huge advantage. Everyone who is familiar with the cloud knows this, but there is more. Once the allocation of resources is programmatic, the next logical step is to make that allocation intelligent. To not just respond to current demand, but to be able to predict demand in real time and have an optimal infrastructure running at the exact instant that it is needed.


Indeed, the need for Machine Learning is everywhere!

But what is driving this demand for Machine Learning, why now? One factor, as in the previous example, is the ready availability of cheap machine power, in particular the cloud, which is continually making computation cheaper. As computation gets cheaper, the impossible becomes possible. And as things become possible, people tend to start doing them and you end up with people like me making intelligent infrastructure.

But another related factor is the explosive growth of data in the last few years. The interesting thing is that there has been a lot of focus on collecting, storing and even doing computation with so called Big Data, but much less discussion of how to actually derive useful insights from it. As it turns out, Machine Learning is particular well suited to the task.

So as companies that have a big data initiative are realizing this, it is driving a big demand for Machine Learning. How big? To answer that, I think it is important to realize that this is not just about mining data to improve the company bottom line, although that’s certainly a probable outcome. No, this is a question of survival. We often see startups that are using a data centric approach to disrupt existing markets. Entrenched companies that don’t understand this trend risk extinction.

So we have a growing demand for Machine Learning, but what work needs to be done? Do we need new and better algorithms? Probably – I mean, it’s always possible that someone will invent a new algorithm tomorrow that blows everything else away, and it would be really hard to argue that this is a bad thing. On the other hand, there are already a lot of really good algorithms available, and lots of problems ready to be solved.

In fact, I recently read a paper titled “Do we need hundreds of classifiers to solve real world problems?”. In this paper, they evaluated the performance of 179 classifiers on a variety of datasets. That’s a 179 different algorithms that are available just to handle a classification problem. Interestingly, the best performer in that paper was the RDF, which is hardly new.

Now don’t get me wrong, the need for newer algorithms to advance the science of Machine Learning will never go away. But there is a huge amount of work that needs to be done right now to move existing algorithms from the lab and into the practical world. To make the existing algorithms more robust and consumable. This work is more important than the elusive perfect model algorithm because the reality is that for most projects, a perfect model will not be the final product; the model will form only a piece of the entire application that most companies need to implement and it’s often more important to deliver results. After all, data often has only a temporal value.

The need for Machine Learning is everywhere, and BigML is here to deliver it.

BigML + Informatica = Connected Machine Learning for the Enterprise

We’re very excited to share news that we’ve collaborated with Informatica to release the Informatica Connector for BigML. This connector was debuted yesterday at Informatica’s Data Mania event, where BigML was named a winner of their Connect-a-Thon competition!BigML INFA Connector

The connector, which is being made available through the Informatica Cloud Marketplace, can be used to access, blend, structure and connect data from any app in the Informatica Cloud directly to a source in BigML through the use of a simple visual wizard flow. This opens the gates to data stored in your ERP, CRM, HR, Marketing Automation, Business Intelligence, Procurement, Financial or many other mission-critical enterprise systems from marquee vendors (e.g. Salesforce, Oracle, NetSuite, Workday, Birst, Concur, Coupa etc.) to be pulled directly into BigML.  This of course augments our current source import options such as pulling from local files, using web links for remote resources and/or our web-based connectors with Dropbbox and the recently announced Google Cloud integration.

In addition, you will be able to use the connector in conjunction with BigML’s API to incorporate actionable models created in BigML into your enterprise applications and services.  As you know, creation of such end-to-end machine learning workflows is a critical component for development and deployment of predictive apps. With BigML Private Deployments available either on-premises or in the cloud, BigML is a highly versatile machine learning platform for enterprises of all shapes and sizes; the Informatica Connector makes our offering even more powerful and flexible.

We’ll be sharing more details on the connector and our partnership with Informatica in the near future, so stay tuned. In the interim, feel free to contact us to learn how you can tap into this exciting new functionality today!

Machine Learning Coming to your Mac OS X


Here at BigML our mission is to make machine learning easy, beautiful and understandable to everyone. We work hard to make sure that BigML’s REST API and web-based interface offer very intuitive workflows for many types of machine learning tasks. Today, we are proud to announce the BigML native app for Mac OS X, which streamlines machine learning workflows even further. With BigML for Mac OS X you can now generate predictions from your data by just dragging and dropping a file. It’s as simple as that: you do not even need to click once!

There are several key points we’d like to emphasize before giving you more details:

  • Optimized for Cloud Computing: Building machine-learned models is a computationally expensive task because it tends to go through many iterations in an effort to achieve a higher accuracy model. BigML leverages the Cloud’s power, and the native Mac OS X app is capable of handling all related cloud resources necessary.

  • Fast and Cost-free Predictions: Using our Mac OS X client, you can get predictions off of your local data not only faster than using cloud-only alternatives, but as a bonus, it will also cost you a sum total of NOTHING! On top of that, BigML models are white-box, meaning that you can interact with them to better explain why a specific prediction was made.

  • Workflow Templates: Creating Machine Learning models from your data involves some basic steps going from raw data to a final model structure. The app takes care of the workflow and creates all intermediate resources. It will also let you access, modify or reuse those resources to perform other evaluations or optimizations.

Introducing BigML for Mac OS X

BigML for Mac OS X goes to great lengths to ensure that you have all the resources you need in a single, comprehensive view.

BigML for Mac OS X

As seen above, BigML for Mac OS X’s main window has three areas:

  • Project Area: Allows you to create a new project or select an existing one that you can modify.
  • Workflow Area: This section lets you monitor the current state of any on-going operation, select a workflow type and start an operation. The Workflow Area is an “active” area, in that you can drag&drop files or BigML resources on to it to have them processed.
  • Resource Browser: Makes it possible to look up any existing resource (i.e. datasets, models, clusters etc.), inspect their state, and reuse them to start new workflows.

The most important part of BigML’s main window is the Workflow Area. If desired, it is possible to shrink down the Resource Browser area so it does not take any space on your desktop.

Your First Prediction

Creating your first prediction takes only a few seconds. Just locate a data file and drag it on to the central workflow area. After dragging the data file on the workflow area, you will observe how BigML connects to your remote account to create all intermediate resources such as a data source, a dataset, and finally a model, which will be eventually used to generate your predictions. The advantage of BigML for Mac OS X is that you do NOT necessarily need to understand how a model is created and what intermediate steps are taken (or, for that matter, how predictions are computed using the model): The app will take care of that for you.

Wine SaleS Predictions

Eventually, the prediction window will be displayed. There, you will be able to change your model input fields to produce a new prediction. BigML is able to recalculate a prediction “on the fly” while you change the values for the variables shown in the prediction window. It is important to keep in mind that your models are created remotely in the cloud, but they are also replicated locally so that the predictions based on your inputs can be computed locally. That means no network access is required in making new predictions, nor any BigML credits are spent.

Beyond Simple Workflows

BigML for Mac OS X  not only supports tree-based models but also enables cluster analysis (and very soon anomaly detection). Switching from creating models to creating clusters is very easy: you just have to select the corresponding workflow. Now, if you drag the same file as before on BigML main window, you will be able to make new predictions using BigML’s unsupervised clustering algorithms. BigML for Mac OS X is smart enough to automatically update all the related resources and workflows in the background, whenever you make UI selection changes.  This saves you the headache of manually maintaining the proper state of each resource and instead lets you concentrate on getting valuable insights from your data.  In the near future, you won’t only be able to configure your own workflows, but also to personalize them.

BigML Cluster Workflow

Remote and Local Resources

As mentioned above, when you drag a source file over to BigML, a set of resources is created remotely that is also mirrored locally. The Resource Browser allows you to keep an eye on the resources as you create, delete or rename them, and even when you use them to create new predictions, so you do not have to start over every time.

To access the Resource Browser, just click the down-arrow in the bottom-right corner of the BigML main window. You can select a specific resource type in the resource browser to see what resources of that type you have created. If you right-click on a resource, you will access a pop-up menu offering a few options such as renaming the resource, deleting it from the server etc. Finally, you can drag any resource in the workflow area, it will be used as a starting point to create a new model or cluster. This gives you an alternate way to create a prediction.

How Can I Get It?

We are about to kick off the BigML for Mac OS X private beta. If you are interested, drop a note to and we’ll include you in our beta testers list as soon as the private beta starts.

A Note for Developers

BigML for Mac OS X has been implemented using BigML iOS bindings. That is, just plain vanilla BigML’s REST API.  So if you are in hacking mood, imagine how easy powering your Mac OS X application with machine learning can be.

What’s Next? Tell Me about the ‘Big Picture’ Already

You bet! It is our conviction that as soon as 5 years from now, we will be living in a world where interfacing with machine-learned models will be as natural and seamless for a business analyst as it is for any of us to interface with an iPhone today.  If we succeed, it’ll be all as natural (and taken for granted) as the air around us. BigML’s native Mac OS X client may be one small step in this direction, but make no mistake about it…in hindsight, it may also prove to be one giant step for 21st century business in the not so distant future.  Will you be there with us?  Or will you still be staring at an Excel spreadsheet? The choice is yours!

Divining the ‘K’ in K-means Clustering

The venerable K-means algorithm is the a well-known and popular approach to clustering. It does, of course, have some drawbacks. The most obvious one being the need to choose a pre-determined number of clusters (the ‘k’). So BigML has now released a new feature for automatically choosing ‘k’ based on Hamerly and Elkan’s G-means algorithm.

The G-means algorithm takes a hierarchical approach to detecting the number of clusters. It repeatedly tests whether the data in the neighborhood of a cluster centroid looks Gaussian, and if not it splits the cluster. A strength of G-means is that it deals well with non-spherical data (stretched out clusters). We’ll walk through a short example using a 2 dimensional dataset with two clusters, each has a unique covariance (stretched in different directions).

G-means starts with a single cluster. The cluster’s centroid will be the same as you’d get if you ran K-means with k=1.


G-means then tests the quality of that cluster by first finding the points in its neighborhood (nearest to the centroid).  Since we only have one cluster right now, that’s everything. Using those points it runs K-means with k=2 and finds two candidate clusters. It then creates a vector between those two candidates. G-means considers this vector to be the most important for clustering the neighborhood. It projects all the points in the neighborhood onto that vector.


Finally, G-means uses the Anderson-Darling test to determine whether the projected points have a Gaussian distribution. If they do the original cluster is kept and the two candidates are rejected. Otherwise the candidates replace the original. In our example the distribution is clearly bimodal and fails the test, so we throw away the original and adopt the two candidate clusters.


After G-means decides whether to replace each cluster with its candidates, it runs the K-means update step over the new set of clusters until their positions converge. In our example, we now have two clusters and two neighborhoods (the orange points and the blue points). We repeat the previous process of finding candidates for each neighborhood, making a vector, projecting the points, and testing for a Gaussian distribution.


This time, however, the distributions for both clusters look fairly Gaussian.


When all clusters appear to be Gaussian, no new clusters are added and G-means is done.

The original G-means has a single parameter which determines how strict the Anderson-Darling test is when determining whether a distribution is Gaussian or not. In BigML this is the critical value parameter. We allow ranges from 1 to 20. The smaller the critical value the more strict the test which generally means more clusters.

Our version of G-means has a few changes from the original as we built on top of our existing K-means implementation. The alterations include a sampling/gradient-descent technique called mini-batch k-means to more efficiently handle large datasets. We also reused a technique called K-means|| to quickly pick quality initial points when selecting the candidate clusters for each neighborhood.

BigML’s version of G-means also alters the stopping criteria. In addition to stopping when all clusters pass the Anderson-Darling test, we stop if there are multiple iterations of new clusters introduced without any improvement in the cluster quality. The intent is to prevent situations where G-means struggles on datasets without clearly differentiated clusters. This can results in many low utility clusters. This part of our algorithm is tentative, however, and likely to change. We also plan to offer a ‘classic’ mode that stops only when all clusters pass the Anderson-Darling test.

All that said, we’ve been happy with how well G-means handles datasets with complicated underlying structure. We hope you’ll find it useful too!


%d bloggers like this: