The venerable K-means algorithm is the a well-known and popular approach to clustering. It does, of course, have some drawbacks. The most obvious one being the need to choose a pre-determined number of clusters (the ‘k’). So BigML has now released a new feature for automatically choosing ‘k’ based on Hamerly and Elkan’s G-means algorithm.
The G-means algorithm takes a hierarchical approach to detecting the number of clusters. It repeatedly tests whether the data in the neighborhood of a cluster centroid looks Gaussian, and if not it splits the cluster. A strength of G-means is that it deals well with non-spherical data (stretched out clusters). We’ll walk through a short example using a 2 dimensional dataset with two clusters, each has a unique covariance (stretched in different directions).
G-means starts with a single cluster. The cluster’s centroid will be the same as you’d get if you ran K-means with k=1.
G-means then tests the quality of that cluster by first finding the points in its neighborhood (nearest to the centroid). Since we only have one cluster right now, that’s everything. Using those points it runs K-means with k=2 and finds two candidate clusters. It then creates a vector between those two candidates. G-means considers this vector to be the most important for clustering the neighborhood. It projects all the points in the neighborhood onto that vector.
Finally, G-means uses the Anderson-Darling test to determine whether the projected points have a Gaussian distribution. If they do the original cluster is kept and the two candidates are rejected. Otherwise the candidates replace the original. In our example the distribution is clearly bimodal and fails the test, so we throw away the original and adopt the two candidate clusters.
After G-means decides whether to replace each cluster with its candidates, it runs the K-means update step over the new set of clusters until their positions converge. In our example, we now have two clusters and two neighborhoods (the orange points and the blue points). We repeat the previous process of finding candidates for each neighborhood, making a vector, projecting the points, and testing for a Gaussian distribution.
This time, however, the distributions for both clusters look fairly Gaussian.
When all clusters appear to be Gaussian, no new clusters are added and G-means is done.
The original G-means has a single parameter which determines how strict the Anderson-Darling test is when determining whether a distribution is Gaussian or not. In BigML this is the critical value parameter. We allow ranges from 1 to 20. The smaller the critical value the more strict the test which generally means more clusters.
Our version of G-means has a few changes from the original as we built on top of our existing K-means implementation. The alterations include a sampling/gradient-descent technique called mini-batch k-means to more efficiently handle large datasets. We also reused a technique called K-means|| to quickly pick quality initial points when selecting the candidate clusters for each neighborhood.
BigML’s version of G-means also alters the stopping criteria. In addition to stopping when all clusters pass the Anderson-Darling test, we stop if there are multiple iterations of new clusters introduced without any improvement in the cluster quality. The intent is to prevent situations where G-means struggles on datasets without clearly differentiated clusters. This can results in many low utility clusters. This part of our algorithm is tentative, however, and likely to change. We also plan to offer a ‘classic’ mode that stops only when all clusters pass the Anderson-Darling test.
All that said, we’ve been happy with how well G-means handles datasets with complicated underlying structure. We hope you’ll find it useful too!
Following up on the success of its inaugural event last year, PAPIs.io 2015 is fast approaching upon us. This year’s event will take place in down under in the beautiful “harbour city” of Sydney. It is conveniently scheduled on the preceding Thursday and Friday (6-7 August, 2015) before KDD, the ACM conference on knowledge discovery and data mining which attracts 2000+ Big Data practitioners and researchers. As a founding member and initial sponsor of PAPIs.io, BigML will be participating in this year’s event too.
PAPIs.io is a unique event in that it has been able to bring together data scientists, developers and practitioners from 20+ countries representing many different industries and educational institutions. Past participants included large tech companies such as Amazon, AXA, Banc Sabadell, BBVA, IBM, ING, Intel, Microsoft, Samsung, SAP as well as leading startups in the field (e.g. BigML, Dataiku, Indico, RapidMiner) to discuss all things Predictive APIs and Predictive Apps. The very hands on and interactive approach of the agenda is centered on addressing the challenges of building real-world predictive applications based on a growing number of Predictive APIs that are making Machine Learning more and more accessible to developers. As a bonus, this year’s event will also introduce a technical track.
Here is a reminder on some of the real life predictive applications that were showcased in great detail in last year’s event:
- Real-time Online Ad Bidding Optimization
- Overcoming Challenges in Sentiment Analysis
- Winning Kaggle’s Yandex Personalized Web Search Challenge
- Forecasting Bitcoin Exchange Rates
- Paris Area Transportation Optimization via Predictive Analytics
- Personalized Card-linked Offers for Consumers
- Bikesharing Optimization and Balancing
- Office 365 Infrastructure Health Engine
We have recently announced our Dynamic Scatterplot capability, which is one of many goodies to have come out of BigML Labs. You can utilize dynamic scatterplots to do a deeper dive into and interact with your multidimensional data points before or after your modeling.
In this post, we will visualize the clusters we build based on Numbeo’s Quality of Life metrics per country . This dataset only has 86 observations; each recording the following quality of life metrics per country:
- Quality of Life Index
- Purchasing Power Index
- Safety Index
- Health Care Index
- Consumer Price Index
- Property Price to Income Ratio
- Traffic Commute Time Index
- Pollution Index
Quality of Life Index is a proprietary value calculated by Numbeo based on a weighted average of the other indicator fields in the dataset e.g. Safety Index, Pollution Index etc. Therefore, we removed this field prior to our clustering to get a better sense of how clusters are formed without any subjective weighting measures.
We used G-Means clustering with default settings to analyze all 86 records in the dataset. The process ended up with 2 clusters with 45 countries in Cluster 0 and 41 in Cluster 1 respectively — a pretty even split all in all. A quick gaze at the descriptive stats on the side panel shows that Cluster 1 tends to have a higher representation of wealthier, more developed nations, whereas Cluster 0 mainly consists of developing nations.
This is a great start, but what if you want to dive deeper into the make up of each cluster? Well, in that case, BigML already offers you the ability to build a separate decision tree model from each cluster as an option before or even after you create your clusters. As your clusters are created, so are the corresponding trees that you can traverse to better understand which variables better explain the grouping of instances in a given cluster.
For example, the screenshot below reveals that Purchasing Power Index had the most influence for Cluster 0, where any country with PPI less than 42 (the short right branch) was automatically classified as belonging to Cluster 0 among other more complex rules (shown on the more complex left branch).
Now, we have a better idea about the method behind our clusters. However, at times, we may need to dive even deeper into the data and see how individual records are laid out on a plane in relation to each other much like the cluster visualization itself, but applied to individual instances. This is especially useful if there are thousands or more data points to be analyzed.
Our brand new Dynamic Scatterplot feature let’s you do just that. Once you navigate to the Dynamic Scatterplot screen, BigML asks you to specify which dataset it needs to use for plotting. As you type letters, matching datasets appear in the dropdown. After you select your dataset, you can pick the dimensions you would like to visualize. Up to 3 at a time is allowed between X and Y axes as well as the color coding.
The example image below depicts how each country in our Numbeo dataset is positioned according to Purchasing Power Index (X axis), Health Care Index (Y axis) and Cluster identifier (Color dimension). The familiar Data Inspector panel on the right hand side shows the values for a particular data point you can mouse over.
As you can see, even though our cluster analysis took into account all available fields, the dispersion in this visualization still shows a pretty obvious concentration of Cluster 0 (Dark blue) on the left bottom quadrant and Cluster 1 (Light blue) on the right top quadrant. This confirms our gut feel expectation that countries with higher purchasing power would also have higher quality healthcare.
However, there are interesting exceptions to note. For instance, the dark blue dot near the coordinate (40,80) is Thailand. (Please note that we have manually superimposed relevant country flags on the actual output.) Thailand is a developing nation. Nevertheless, it is punching much above its weight in terms of health care services. A little research reveals that there is a growing healthcare tourism industry in Bangkok drawing many foreigners seeking more affordable care. Similarly, Dominican Republic is also presents us with an interesting case.
We then get curious about the group of dots that have relatively high purchasing power (PPI>=80), yet not as high a healthcare score (HCI<=60) as one would expect at that level of purchasing power. The zoom in feature of Dynamic Scatterplots comes in handy for this. Marking the aforementioned area with our mouse we can instantly visualize just that portion of our chart as follows. (Please note that we have manually superimposed relevant country flags on the actual output.)
The 4 light blue (Cluster 1) dots here represent Puerto Rico, United Arab Emirates, Saudi Arabia and Ireland. These turn out to be wealthier nations with subpar healthcare.
As seen in this straightforward example, playing with the Dynamic Scatterplot is both easy and very teaching at the same time. One cannot always find easy explanations when utilizing Machine Learning techniques, but effective visualizations can help provide additional “color” and confidence to our findings, where other methods may fail.
We hope that you will give a try to this cool new offering from BigML as part of your next data mining project. As always, please let us know how we can improve it further. The best part is, it comes FREE with all existing subscription levels, so have at it!
It is no surprise by now that we are having to deal with lots of data in many different formats in our everyday life. From Database Managers taming growing quantities of data in large companies to the handy spreadsheet that can work just fine for small tasks or personal use, BigML has been on a mission to bring Machine Learning predictions to every dataset. In that spirit, we continue this week’s Google integration theme with news on our upcoming Google Sheets add-on.
Google Sheets is a truly wonderful tool to store your datasets. It is fully functional as a spreadsheet, but it turns out that you can still improve its utility by taking advantage of add-ons. The add-ons are macro-like Google Apps Scripts that can interact with the contents of your Google Sheet (or Docs or Forms) and automate repetitive tasks, or connect to other services. At BigML, we’ve built our own add-on that will let you use your models or clusters to add predictions to your data in Google Sheets.
BigML users already know how easy it is to start making predictions in BigML. Basically, you register, upload your data to BigML (it can be in CSV local or remote files, Excel spreadsheets, inline data etc.) and in one click build a dataset, where all the statistical information is summarized. With a second click, you can build a model, where the hidden patterns in your data are unveiled. Those rules can later be used to predict the contents of empty fields in new data instances. With our new add-on, it’s now possible to perform those predictions directly in your Google Sheet.
The wine shop use case
The first time you login to BigML, you land in a development area with a bunch of sample data sources available for you to play with at no cost. Let’s use one of these to build an example: the fictional wine sales dataset. It contains historical wine sales figures and the related features for each wine such as the country, type of grape, rating, origin, and price. Imagine you want to carry new wines in your store. It would be great to have an estimate of the total sales you can expect from each new wine, so that you can choose the ones that will sell better, right?
Using the above dataset, you can easily create a BigML decision tree model that can predict the total sales for a wine given its features. Thus, for every new wine, you can use the model to compute the expected total sales and choose the new wines most likely to maximize your revenue. But what if your list of new wines is in a Google Sheet? Good news! You can also use your BigML model from within your Google sheet to quickly compute the predicted sales values for the new wines.
Using BigML models from Google Sheets
To use this new functionality, you’ll need to first install the BigML add-on (coming soon). Once installed, it will appear under the add-ons menu as seen below. You can now choose the ‘Predict’ submenu item, which will display the form needed to access all your models and clusters in BigML (provided that you’ve authenticated with your BigML credentials). In this case, you’ll sort through your list of models and select the one that was built on your historical wine sales data. Finally, you’ll be ready to add predictions to the empty cells in your Google Sheet.
To do this, select the range of cells that contain the information available for your new wines list. Pressing the ‘Predict’ button on the right-hand side panel, the prediction for each row will be placed in the next empty cell on the right, and the associated error will also be appended in a second column. In this example, the prediction has been a number, but you can add predictions for categorical fields just as easily:
So how does the BigML add-on work behind the scenes? The add-on code is executed in Google Apps Script servers. Google Apps Script code can connect to BigML and download your models to those servers (after validating your credentials in BigML). It also can interact with your Google Sheet. The BigML model you choose is downloaded to the Google Apps Script server environment, where the script runs each row in your selected range through the model and updates the cells in your sheet with the computed predictions. Thus, no data in your sheet has to reach BigML to obtain predictions. It stays in Google servers the whole time. This video shows the basic steps for this and other examples dealing with categorical models or clusters.
Our add-on will be visible under the add-ons menu in Google Sheets as soon as the Google add-ons approval process is completed. We will update this post accordingly, however if you want to be an early adopter just let us know today!
Attention Google power users: We have made a number of improvements to make BigML more compatible with Google services for your convenience. Google Cloud is becoming the fastest growing cloud provider. As such, we have been receiving requests from users all over the world, so we finally got our clue.
For starters, in addition to Amazon and GitHub, you can now login to BigML with your Google ID. Click on the Google option under the Login button and you will be authenticating right in to start with your machine learning project.
Since our aim is to make it super easy to upload your data to BigML regardless of your cloud provider, we have added both Google Drive and Google Storage support as well. Similar to our integrations with Azure Marketplace and Dropbox, connecting to your cloud storage only takes few clicks starting from the cloud icon located on the Sources tab.
The first time you go through this flow, you will be asked to allow BigML to access your Google Drive or Storage, which automatically generates and displays an access token.
Next time you want to access one of your data sources stored on Google, you can use the same menu on the Sources tab and it will bring up all your folders on a modal window as shown below.
Select the one you are interested in and it will be uploaded as a new source on BigML right away. So you are off to the races with your machine learning project just like that.
Let us know how this works out for you. If you like it, please give a shout out to other fellow Google users, so they too can take advantage of it.
We had a great a turnout for our 2015 Winter Release webinar earlier this week. Poul Petersen, BigML’s CIO, demoed many features including G-Means Clustering, Dynamic Scatterplot Visualization, BigML Projects, Google Integration and Sample Service — all recently deployed in the BigML platform. Thank you to all those who attended!
In case you missed it, you can view the webinar in its entirety below and you can also check it out on the BigML Training page, where it will be posted shortly along with our other video resources.
This past week BigML was a proud sponsor of the 29th Annual AAAI Conference in Austin, TX. The conference drew over 1,200 attendees with members from across the AI community: AI researchers to practitioners to scientists and engineers. The program was filled with incredible presentations, discussions and professionals passionate about what they do. The main conference hall was buzzing at all hours of the day and BigML was right in the middle of it giving live demonstrations of our powerful platform and discussing all facets of machine learning with students, professors, and seasoned AI professionals, alike.
Aside from the generous servings of coffee and donuts, we are very fortunate to have interacted with such a talented community! Read more about our adventures below!
ACM SIGAI Career Panel
At one point in our lives we’ve all been there, searching for a job in an often precarious job market. Writing and re-writing resumes, submitting job applications, exhausting your personal network and hoping for that phone call back. It can be a daunting and unflattering experience but once you receive that job offer, all the hard work you put in comes full circle. Once on the other side of the desk, so to speak, it’s often easy to overlook your past experience. However, it’s always important to give back and put yourself in a position where you can help those who are sitting exactly where you once were.
BigML had the opportunity to do just that this week. Poul Petersen, CIO of BigML, was selected to be a panelist for the AMC SIGAI Career Panel hosted by SIGAI in Austin. Poul was one of the experts seated alongside Peter Clark from the Allen Institute for AI, Eric Horvitz from Microsoft Research, and Ben Kuperman from Oberlin College; each bringing their own unique perspective to the table. The goal of the event was for the panelists to engage with the 40+ senior PhD’s and post-doctorates in order to address their questions and concerns about the AI job market, applying for the right position, career planning and positioning for success. All too often workshops such as these are overlooked and we are proud to have participated in helping the next generation of AI professionals take the appropriate steps in reaching their dream job! Below are a few quick insights regarding the size and importance of AI in the job market:
- There are more than 26,000 computer and information science researchers in the US
- Artificial Intelligence is growing by 20% annually
- Since 2012, more than 170 AI dedicated start-ups have entered the market
- VC’s pledged $188M to AI companies in Q4 2014
- Top U.S. salaries are reported in California, Texas, the Northwest, and the Northeast (MA to VA) with an overall median base salary of $105k.
BigML Meet-Up: Austin Data Geeks!
Not only is the AI scene ingrained in the city, but the technology industry itself has also found a home in Austin. In recent years, Austin was the foundation of four start-ups that were either acquired or reached public offering at more than $1 billion each. In addition, in Q1 2014 Austin based start-ups received more than $350 million in VC funding. This city is bursting with opportunity.
So, it’s no wonder that we were thrilled to participate in a MeetUp in such a vibrant and active community. The amount of talent and excitement around technology made it a perfect fit for us. BigML sat down with the Austin Data Geeks at RackSpace to discuss our platform. The meet-up drew 60+ attendees from Austin-area technology companies, start-ups and all around tech enthusiasts! The session was very interactive and Poul Petersen, CIO of BigML, gave a presentation showcasing our intuitive interface can help shorten the time to actionable insights even for the most hardcore data wranglers out there.
We left Austin with a smile on our face — and stomachs full of Texas BBQ — after a fantastic week of emerging ourselves in the AI world. We look forward to all that’s coming our way and you should too. BigML has plenty to offer and we are adding new features to our platform everyday. Join us on February 11 for our 2015 Winter Release webinar and see what’s new!
If you haven’t yet experienced BigML, we’d like to remind you that you can try it for FREE with tasks under 16MB in development mode. If you would like to add capacity, keep in mind that we always offer plans for as low as $15/month with the 50% special discount for students, public researchers, or NGOs.
We look forward to circling back with all of you, and until then, keep predicting!
BigML is kicking off 2015 with many new great capabilities that are included in our Winter 2015 Release, which we’ll be sharing with you in a webinar on February 11 at 9:00 Pacific / 17:00 GMT.
If you are located in Australia, New Zealand or elsewhere in the Pacific Rim, we’ll be co-hosting a separate webinar along with our alliance partner GCS Agile. It will be on February 12 at Noon Australia Eastern Daylight Time / 4AM GMT. Please register for our ANZ/APAC webinar to reserve your seat.
Some key highlights from the release include:
BigML’s new Sample Service provides fast access to datasets that are kept in an in-memory cache which enables a variety of sampling, filtering and correlation techniques. We have leveraged BigML’s sample service to create a Dynamic Scatterplot visualization that we’ve released into BigML Labs, and which we’ll showcase on the webinar.
This latest addition to BigML’s unsupervised learning algorithms is ideal for when you may not know how many clusters you wish to build from your dataset.
We’re happy to introduce Projects to help you organize your machine learning resources. On the webinar we’ll show you how to create a project from a new data source and how to manage your associated tasks and workflows.
With the Winter Release, you’ll now be able to add sources to BigML through Google Cloud Storage and Google Drive, similar to our prior integrations with Dropbox and Azure Data Marketplace. You can also now log into BigML using your Google ID.
Our team is constantly working on innovative applications built on top of BigML’s API. We’re now unveiling several of these in early access through our “BigML Labs”. Join us to see in action two of our latest applications codenamed BigML GAS and BigML X.
We’ve also made many UI tweaks, API bindings updates, BigMLer enhancements and general improvements that we’ll highlight in the webinar as we show off the Winter Release.
Once again, webinar space is limited, so please register today!
As faithful BigML blog readers will recall, last year we used the BigML platform to build a variety of analyses to predict the outcome of Super Bowl XLVIII. You can read full details here, but a short summary is this:
- The Denver Broncos were favored to beat the Seattle Seahawks by 2.5 points
- BigML predicted the Seahawks not just to cover the point spread, but also to win the game outright
- BigML also predicted that Seattle would win by the exact score of 43-8
Well, the last point isn’t true but the first two are. We weren’t the only machine learning or analytics vendor aiming to pick the Super Bowl. SAP took at crack at it (perhaps they should stick to futbol), as did a Microsoft researcher, as did Facebook. The difference between these predictions and our own is that our prediction was not only bolder (by picking not only against the spread but also an outright Seattle victory), but it was accurate.
Our data source and approach:
We used team rankings data again from Football Outsiders, which features their DVOA (Defensive-adjusted Value Over Average) system. I like DVOA as it’s a more advanced way of judging a team’s performance. Football Outsider’s very short summary of DVOA is “DVOA measures a team’s efficiency by comparing success on every single play to a league average based on situation and opponent” — so rather than just looking at raw statistics, we have a much more nuanced basis. This is especially critical since we’re working with a narrow range of features in addition to a ridiculously small number of data points.
If you’re interested in learning more about the DVOA approach (and to get a snapshot in the type thinking behind advanced sports analytics in general), this summary from Football Outsiders is a great read.
What fields were selected for the ensembles?
We used Football Outsider’s NFL-wide rankings for Team Efficiency–both weighted (factoring in late-season performance) and non-weighted; Team Offense & Defense ranks (weighted & non-weighted), Special Teams ranks, the point spread, and the over/under total. We also included historical points scored for both the AFC & NFC team from past Super Bowls. You can view and clone the full dataset that we built here, although you’ll have to deselect some of the fields if you want to replicate this specific ensemble.
The same caveats as last year pertain — namely that these predictions are based on a mere 25 rows of data and are only using aggregate season rankings for team performance, whereas professional handicapping systems will incorporate much more nuanced data. If you want to get a professional’s point of view, you can check out the Prediction Machine or many other pundits who leverage machine learning to try and beat the odds and/or help out your average wagering enthusiast.
Evaluating our models:
As the data is so limited it’s not surprising that single models for any outcome evaluated poorly — we had to build 100-model ensembles to achieve evaluations that performed better than mode or random guessing. Even then, it’s important to note that the hold-out set (using a standard 80/20 train/test split) only has a handful of instances. I was curious to see how my evaluations stacked up, so I used BigML’s handy new tweak to the interface that allows you to sort evaluations by their performance (I also leveraged the search feature to filter for “superbowl ensemble”):
Cut to the chase: who’s going to win?
Noting our prior caveats regarding sample size and mixed evaluations, here are our predictions against the 100-tree ensembles for each bet (all using plurality weighting):
- NFC (Seattle Seahawks) points: 39.37 ±14.6
- AFC (New England Patriots) points: 19.41 ±6.83
- Outright Winner: Seattle Seahawks (73.13% confidence)
- ATS (Against the Spread) Winner: Seattle Seahawks (66.30% confidence)
- Underdog/Favorite prediction: Underdog (Seahawks) (61.33% confidence)
- Over/under 49 prediction: Over (69.51% confidence)
There you have it: Seattle will beat the odds again and in taking home the championship will become the first repeat Super Bowl champions in a decade. It also looks to be a high-scoring affair, although the NFC total is undoubtedly skewed by the 43 points Seattle put up last year.
What to do now?
So should you run to your local betting hall or online sportsbook and bet your life savings on the Seahawks? Of course not — at least not based on this blog post. As stated above, these Super Bowl models have been built with very limited data, and perhaps have been subconsciously biased by the author’s Seahawks fanship. Last but not least, there’s the Beli-cheat / Deflate-gate factor — who knows what’s going to come from the team that has pushed all limits in the name of securing victories?
What we’d love to see you do is leverage BigML to build your own predictive model for the Super Bowl. For example, you could access data on all past games (not just super bowls) and use BigML’s clustering algorithm to find most similar matchups and then make a prediction based on those results. Or, you can simply access more data — factoring in games beyond the super bowl, but also more advanced statistics from outcomes on team trends, match-up trends (e.g., what happens in *any* game when the top-ranked defense faces the 5th ranked offense), individual player match-ups, etc. Are you into the “wisdom of crowds”? Pull social media data and leverage BigML’s text analysis to see what the general public is thinking.
Interested? We’re happy to help you out with your efforts. For starters, we’ll give you a free one-month Pro subscription – simply enter coupon code “XLIX” in the payment form. Stumped? You can always send us email at firstname.lastname@example.org, or join our weekly “Meet the Team” hangout next Wednesday at 9:30 AM Pacific.
No matter if you’re wagering or simply cheering for one team or the other, we hope you enjoy the Super Bowl.
We’re excited for a great visit to Austin, Texas in conjunction with BigML’s sponsorship of the AAAI-15 conference, which will take place during the week of January 25. The AAAI (Association for the Advancement of Artificial Intelligence) conference takes place every year and brings together the brightest minds and viewpoints on artificial intelligence from research, industry and academia. This is the first ever Winter edition of the event, which certainly reflects global growth and awareness of AI, machine learning and related disciplines.
For starters, Poul Petersen will be sitting on the SIGAI (the ACM Special Interest Group on Artificial Intelligence) career panel on Monday the 26th alongside Oren Etzioni from the Allen Institute for Artificial Intelligence, Eric Horvitz of Microsoft Research, and Ben Kuperman of Oberlin College.
From Tuesday through Thursday BigML will be sponsoring and demoing at the AAAI-15 conference itself, and on Tuesday evening we’re very pleased to be a guest of the Austin Data Geeks MeetUp! We’ll additionally be meeting with customers and partners in the area — so be sure to catch us at one of the events listed above and/or ping us at email@example.com to set up a time to chat.