We have recently announced our Dynamic Scatterplot capability, which is one of many goodies to have come out of BigML Labs. You can utilize dynamic scatterplots to do a deeper dive into and interact with your multidimensional data points before or after your modeling.
In this post, we will visualize the clusters we build based on Numbeo’s Quality of Life metrics per country . This dataset only has 86 observations; each recording the following quality of life metrics per country:
- Quality of Life Index
- Purchasing Power Index
- Safety Index
- Health Care Index
- Consumer Price Index
- Property Price to Income Ratio
- Traffic Commute Time Index
- Pollution Index
Quality of Life Index is a proprietary value calculated by Numbeo based on a weighted average of the other indicator fields in the dataset e.g. Safety Index, Pollution Index etc. Therefore, we removed this field prior to our clustering to get a better sense of how clusters are formed without any subjective weighting measures.
We used G-Means clustering with default settings to analyze all 86 records in the dataset. The process ended up with 2 clusters with 45 countries in Cluster 0 and 41 in Cluster 1 respectively — a pretty even split all in all. A quick gaze at the descriptive stats on the side panel shows that Cluster 1 tends to have a higher representation of wealthier, more developed nations, whereas Cluster 0 mainly consists of developing nations.
This is a great start, but what if you want to dive deeper into the make up of each cluster? Well, in that case, BigML already offers you the ability to build a separate decision tree model from each cluster as an option before or even after you create your clusters. As your clusters are created, so are the corresponding trees that you can traverse to better understand which variables better explain the grouping of instances in a given cluster.
For example, the screenshot below reveals that Purchasing Power Index had the most influence for Cluster 0, where any country with PPI less than 42 (the short right branch) was automatically classified as belonging to Cluster 0 among other more complex rules (shown on the more complex left branch).
Now, we have a better idea about the method behind our clusters. However, at times, we may need to dive even deeper into the data and see how individual records are laid out on a plane in relation to each other much like the cluster visualization itself, but applied to individual instances. This is especially useful if there are thousands or more data points to be analyzed.
Our brand new Dynamic Scatterplot feature let’s you do just that. Once you navigate to the Dynamic Scatterplot screen, BigML asks you to specify which dataset it needs to use for plotting. As you type letters, matching datasets appear in the dropdown. After you select your dataset, you can pick the dimensions you would like to visualize. Up to 3 at a time is allowed between X and Y axes as well as the color coding.
The example image below depicts how each country in our Numbeo dataset is positioned according to Purchasing Power Index (X axis), Health Care Index (Y axis) and Cluster identifier (Color dimension). The familiar Data Inspector panel on the right hand side shows the values for a particular data point you can mouse over.
As you can see, even though our cluster analysis took into account all available fields, the dispersion in this visualization still shows a pretty obvious concentration of Cluster 0 (Dark blue) on the left bottom quadrant and Cluster 1 (Light blue) on the right top quadrant. This confirms our gut feel expectation that countries with higher purchasing power would also have higher quality healthcare.
However, there are interesting exceptions to note. For instance, the dark blue dot near the coordinate (40,80) is Thailand. (Please note that we have manually superimposed relevant country flags on the actual output.) Thailand is a developing nation. Nevertheless, it is punching much above its weight in terms of health care services. A little research reveals that there is a growing healthcare tourism industry in Bangkok drawing many foreigners seeking more affordable care. Similarly, Dominican Republic is also presents us with an interesting case.
We then get curious about the group of dots that have relatively high purchasing power (PPI>=80), yet not as high a healthcare score (HCI<=60) as one would expect at that level of purchasing power. The zoom in feature of Dynamic Scatterplots comes in handy for this. Marking the aforementioned area with our mouse we can instantly visualize just that portion of our chart as follows. (Please note that we have manually superimposed relevant country flags on the actual output.)
The 4 light blue (Cluster 1) dots here represent Puerto Rico, United Arab Emirates, Saudi Arabia and Ireland. These turn out to be wealthier nations with subpar healthcare.
As seen in this straightforward example, playing with the Dynamic Scatterplot is both easy and very teaching at the same time. One cannot always find easy explanations when utilizing Machine Learning techniques, but effective visualizations can help provide additional “color” and confidence to our findings, where other methods may fail.
We hope that you will give a try to this cool new offering from BigML as part of your next data mining project. As always, please let us know how we can improve it further. The best part is, it comes FREE with all existing subscription levels, so have at it!
It is no surprise by now that we are having to deal with lots of data in many different formats in our everyday life. From Database Managers taming growing quantities of data in large companies to the handy spreadsheet that can work just fine for small tasks or personal use, BigML has been on a mission to bring Machine Learning predictions to every dataset. In that spirit, we continue this week’s Google integration theme with news on our upcoming Google Sheets add-on.
Google Sheets is a truly wonderful tool to store your datasets. It is fully functional as a spreadsheet, but it turns out that you can still improve its utility by taking advantage of add-ons. The add-ons are macro-like Google Apps Scripts that can interact with the contents of your Google Sheet (or Docs or Forms) and automate repetitive tasks, or connect to other services. At BigML, we’ve built our own add-on that will let you use your models or clusters to add predictions to your data in Google Sheets.
BigML users already know how easy it is to start making predictions in BigML. Basically, you register, upload your data to BigML (it can be in CSV local or remote files, Excel spreadsheets, inline data etc.) and in one click build a dataset, where all the statistical information is summarized. With a second click, you can build a model, where the hidden patterns in your data are unveiled. Those rules can later be used to predict the contents of empty fields in new data instances. With our new add-on, it’s now possible to perform those predictions directly in your Google Sheet.
The wine shop use case
The first time you login to BigML, you land in a development area with a bunch of sample data sources available for you to play with at no cost. Let’s use one of these to build an example: the fictional wine sales dataset. It contains historical wine sales figures and the related features for each wine such as the country, type of grape, rating, origin, and price. Imagine you want to carry new wines in your store. It would be great to have an estimate of the total sales you can expect from each new wine, so that you can choose the ones that will sell better, right?
Using the above dataset, you can easily create a BigML decision tree model that can predict the total sales for a wine given its features. Thus, for every new wine, you can use the model to compute the expected total sales and choose the new wines most likely to maximize your revenue. But what if your list of new wines is in a Google Sheet? Good news! You can also use your BigML model from within your Google sheet to quickly compute the predicted sales values for the new wines.
Using BigML models from Google Sheets
To use this new functionality, you’ll need to first install the BigML add-on (coming soon). Once installed, it will appear under the add-ons menu as seen below. You can now choose the ‘Predict’ submenu item, which will display the form needed to access all your models and clusters in BigML (provided that you’ve authenticated with your BigML credentials). In this case, you’ll sort through your list of models and select the one that was built on your historical wine sales data. Finally, you’ll be ready to add predictions to the empty cells in your Google Sheet.
To do this, select the range of cells that contain the information available for your new wines list. Pressing the ‘Predict’ button on the right-hand side panel, the prediction for each row will be placed in the next empty cell on the right, and the associated error will also be appended in a second column. In this example, the prediction has been a number, but you can add predictions for categorical fields just as easily:
So how does the BigML add-on work behind the scenes? The add-on code is executed in Google Apps Script servers. Google Apps Script code can connect to BigML and download your models to those servers (after validating your credentials in BigML). It also can interact with your Google Sheet. The BigML model you choose is downloaded to the Google Apps Script server environment, where the script runs each row in your selected range through the model and updates the cells in your sheet with the computed predictions. Thus, no data in your sheet has to reach BigML to obtain predictions. It stays in Google servers the whole time. This video shows the basic steps for this and other examples dealing with categorical models or clusters.
Our add-on will be visible under the add-ons menu in Google Sheets as soon as the Google add-ons approval process is completed. We will update this post accordingly, however if you want to be an early adopter just let us know today!
Attention Google power users: We have made a number of improvements to make BigML more compatible with Google services for your convenience. Google Cloud is becoming the fastest growing cloud provider. As such, we have been receiving requests from users all over the world, so we finally got our clue.
For starters, in addition to Amazon and GitHub, you can now login to BigML with your Google ID. Click on the Google option under the Login button and you will be authenticating right in to start with your machine learning project.
Since our aim is to make it super easy to upload your data to BigML regardless of your cloud provider, we have added both Google Drive and Google Storage support as well. Similar to our integrations with Azure Marketplace and Dropbox, connecting to your cloud storage only takes few clicks starting from the cloud icon located on the Sources tab.
The first time you go through this flow, you will be asked to allow BigML to access your Google Drive or Storage, which automatically generates and displays an access token.
Next time you want to access one of your data sources stored on Google, you can use the same menu on the Sources tab and it will bring up all your folders on a modal window as shown below.
Select the one you are interested in and it will be uploaded as a new source on BigML right away. So you are off to the races with your machine learning project just like that.
Let us know how this works out for you. If you like it, please give a shout out to other fellow Google users, so they too can take advantage of it.
We had a great a turnout for our 2015 Winter Release webinar earlier this week. Poul Petersen, BigML’s CIO, demoed many features including G-Means Clustering, Dynamic Scatterplot Visualization, BigML Projects, Google Integration and Sample Service — all recently deployed in the BigML platform. Thank you to all those who attended!
In case you missed it, you can view the webinar in its entirety below and you can also check it out on the BigML Training page, where it will be posted shortly along with our other video resources.
This past week BigML was a proud sponsor of the 29th Annual AAAI Conference in Austin, TX. The conference drew over 1,200 attendees with members from across the AI community: AI researchers to practitioners to scientists and engineers. The program was filled with incredible presentations, discussions and professionals passionate about what they do. The main conference hall was buzzing at all hours of the day and BigML was right in the middle of it giving live demonstrations of our powerful platform and discussing all facets of machine learning with students, professors, and seasoned AI professionals, alike.
Aside from the generous servings of coffee and donuts, we are very fortunate to have interacted with such a talented community! Read more about our adventures below!
ACM SIGAI Career Panel
At one point in our lives we’ve all been there, searching for a job in an often precarious job market. Writing and re-writing resumes, submitting job applications, exhausting your personal network and hoping for that phone call back. It can be a daunting and unflattering experience but once you receive that job offer, all the hard work you put in comes full circle. Once on the other side of the desk, so to speak, it’s often easy to overlook your past experience. However, it’s always important to give back and put yourself in a position where you can help those who are sitting exactly where you once were.
BigML had the opportunity to do just that this week. Poul Petersen, CIO of BigML, was selected to be a panelist for the AMC SIGAI Career Panel hosted by SIGAI in Austin. Poul was one of the experts seated alongside Peter Clark from the Allen Institute for AI, Eric Horvitz from Microsoft Research, and Ben Kuperman from Oberlin College; each bringing their own unique perspective to the table. The goal of the event was for the panelists to engage with the 40+ senior PhD’s and post-doctorates in order to address their questions and concerns about the AI job market, applying for the right position, career planning and positioning for success. All too often workshops such as these are overlooked and we are proud to have participated in helping the next generation of AI professionals take the appropriate steps in reaching their dream job! Below are a few quick insights regarding the size and importance of AI in the job market:
- There are more than 26,000 computer and information science researchers in the US
- Artificial Intelligence is growing by 20% annually
- Since 2012, more than 170 AI dedicated start-ups have entered the market
- VC’s pledged $188M to AI companies in Q4 2014
- Top U.S. salaries are reported in California, Texas, the Northwest, and the Northeast (MA to VA) with an overall median base salary of $105k.
BigML Meet-Up: Austin Data Geeks!
Not only is the AI scene ingrained in the city, but the technology industry itself has also found a home in Austin. In recent years, Austin was the foundation of four start-ups that were either acquired or reached public offering at more than $1 billion each. In addition, in Q1 2014 Austin based start-ups received more than $350 million in VC funding. This city is bursting with opportunity.
So, it’s no wonder that we were thrilled to participate in a MeetUp in such a vibrant and active community. The amount of talent and excitement around technology made it a perfect fit for us. BigML sat down with the Austin Data Geeks at RackSpace to discuss our platform. The meet-up drew 60+ attendees from Austin-area technology companies, start-ups and all around tech enthusiasts! The session was very interactive and Poul Petersen, CIO of BigML, gave a presentation showcasing our intuitive interface can help shorten the time to actionable insights even for the most hardcore data wranglers out there.
We left Austin with a smile on our face — and stomachs full of Texas BBQ — after a fantastic week of emerging ourselves in the AI world. We look forward to all that’s coming our way and you should too. BigML has plenty to offer and we are adding new features to our platform everyday. Join us on February 11 for our 2015 Winter Release webinar and see what’s new!
If you haven’t yet experienced BigML, we’d like to remind you that you can try it for FREE with tasks under 16MB in development mode. If you would like to add capacity, keep in mind that we always offer plans for as low as $15/month with the 50% special discount for students, public researchers, or NGOs.
We look forward to circling back with all of you, and until then, keep predicting!
BigML is kicking off 2015 with many new great capabilities that are included in our Winter 2015 Release, which we’ll be sharing with you in a webinar on February 11 at 9:00 Pacific / 17:00 GMT.
If you are located in Australia, New Zealand or elsewhere in the Pacific Rim, we’ll be co-hosting a separate webinar along with our alliance partner GCS Agile. It will be on February 12 at Noon Australia Eastern Daylight Time / 4AM GMT. Please register for our ANZ/APAC webinar to reserve your seat.
Some key highlights from the release include:
BigML’s new Sample Service provides fast access to datasets that are kept in an in-memory cache which enables a variety of sampling, filtering and correlation techniques. We have leveraged BigML’s sample service to create a Dynamic Scatterplot visualization that we’ve released into BigML Labs, and which we’ll showcase on the webinar.
This latest addition to BigML’s unsupervised learning algorithms is ideal for when you may not know how many clusters you wish to build from your dataset.
We’re happy to introduce Projects to help you organize your machine learning resources. On the webinar we’ll show you how to create a project from a new data source and how to manage your associated tasks and workflows.
With the Winter Release, you’ll now be able to add sources to BigML through Google Cloud Storage and Google Drive, similar to our prior integrations with Dropbox and Azure Data Marketplace. You can also now log into BigML using your Google ID.
Our team is constantly working on innovative applications built on top of BigML’s API. We’re now unveiling several of these in early access through our “BigML Labs”. Join us to see in action two of our latest applications codenamed BigML GAS and BigML X.
We’ve also made many UI tweaks, API bindings updates, BigMLer enhancements and general improvements that we’ll highlight in the webinar as we show off the Winter Release.
Once again, webinar space is limited, so please register today!
As faithful BigML blog readers will recall, last year we used the BigML platform to build a variety of analyses to predict the outcome of Super Bowl XLVIII. You can read full details here, but a short summary is this:
- The Denver Broncos were favored to beat the Seattle Seahawks by 2.5 points
- BigML predicted the Seahawks not just to cover the point spread, but also to win the game outright
- BigML also predicted that Seattle would win by the exact score of 43-8
Well, the last point isn’t true but the first two are. We weren’t the only machine learning or analytics vendor aiming to pick the Super Bowl. SAP took at crack at it (perhaps they should stick to futbol), as did a Microsoft researcher, as did Facebook. The difference between these predictions and our own is that our prediction was not only bolder (by picking not only against the spread but also an outright Seattle victory), but it was accurate.
Our data source and approach:
We used team rankings data again from Football Outsiders, which features their DVOA (Defensive-adjusted Value Over Average) system. I like DVOA as it’s a more advanced way of judging a team’s performance. Football Outsider’s very short summary of DVOA is “DVOA measures a team’s efficiency by comparing success on every single play to a league average based on situation and opponent” — so rather than just looking at raw statistics, we have a much more nuanced basis. This is especially critical since we’re working with a narrow range of features in addition to a ridiculously small number of data points.
If you’re interested in learning more about the DVOA approach (and to get a snapshot in the type thinking behind advanced sports analytics in general), this summary from Football Outsiders is a great read.
What fields were selected for the ensembles?
We used Football Outsider’s NFL-wide rankings for Team Efficiency–both weighted (factoring in late-season performance) and non-weighted; Team Offense & Defense ranks (weighted & non-weighted), Special Teams ranks, the point spread, and the over/under total. We also included historical points scored for both the AFC & NFC team from past Super Bowls. You can view and clone the full dataset that we built here, although you’ll have to deselect some of the fields if you want to replicate this specific ensemble.
The same caveats as last year pertain — namely that these predictions are based on a mere 25 rows of data and are only using aggregate season rankings for team performance, whereas professional handicapping systems will incorporate much more nuanced data. If you want to get a professional’s point of view, you can check out the Prediction Machine or many other pundits who leverage machine learning to try and beat the odds and/or help out your average wagering enthusiast.
Evaluating our models:
As the data is so limited it’s not surprising that single models for any outcome evaluated poorly — we had to build 100-model ensembles to achieve evaluations that performed better than mode or random guessing. Even then, it’s important to note that the hold-out set (using a standard 80/20 train/test split) only has a handful of instances. I was curious to see how my evaluations stacked up, so I used BigML’s handy new tweak to the interface that allows you to sort evaluations by their performance (I also leveraged the search feature to filter for “superbowl ensemble”):
Cut to the chase: who’s going to win?
Noting our prior caveats regarding sample size and mixed evaluations, here are our predictions against the 100-tree ensembles for each bet (all using plurality weighting):
- NFC (Seattle Seahawks) points: 39.37 ±14.6
- AFC (New England Patriots) points: 19.41 ±6.83
- Outright Winner: Seattle Seahawks (73.13% confidence)
- ATS (Against the Spread) Winner: Seattle Seahawks (66.30% confidence)
- Underdog/Favorite prediction: Underdog (Seahawks) (61.33% confidence)
- Over/under 49 prediction: Over (69.51% confidence)
There you have it: Seattle will beat the odds again and in taking home the championship will become the first repeat Super Bowl champions in a decade. It also looks to be a high-scoring affair, although the NFC total is undoubtedly skewed by the 43 points Seattle put up last year.
What to do now?
So should you run to your local betting hall or online sportsbook and bet your life savings on the Seahawks? Of course not — at least not based on this blog post. As stated above, these Super Bowl models have been built with very limited data, and perhaps have been subconsciously biased by the author’s Seahawks fanship. Last but not least, there’s the Beli-cheat / Deflate-gate factor — who knows what’s going to come from the team that has pushed all limits in the name of securing victories?
What we’d love to see you do is leverage BigML to build your own predictive model for the Super Bowl. For example, you could access data on all past games (not just super bowls) and use BigML’s clustering algorithm to find most similar matchups and then make a prediction based on those results. Or, you can simply access more data — factoring in games beyond the super bowl, but also more advanced statistics from outcomes on team trends, match-up trends (e.g., what happens in *any* game when the top-ranked defense faces the 5th ranked offense), individual player match-ups, etc. Are you into the “wisdom of crowds”? Pull social media data and leverage BigML’s text analysis to see what the general public is thinking.
Interested? We’re happy to help you out with your efforts. For starters, we’ll give you a free one-month Pro subscription – simply enter coupon code “XLIX” in the payment form. Stumped? You can always send us email at email@example.com, or join our weekly “Meet the Team” hangout next Wednesday at 9:30 AM Pacific.
No matter if you’re wagering or simply cheering for one team or the other, we hope you enjoy the Super Bowl.
We’re excited for a great visit to Austin, Texas in conjunction with BigML’s sponsorship of the AAAI-15 conference, which will take place during the week of January 25. The AAAI (Association for the Advancement of Artificial Intelligence) conference takes place every year and brings together the brightest minds and viewpoints on artificial intelligence from research, industry and academia. This is the first ever Winter edition of the event, which certainly reflects global growth and awareness of AI, machine learning and related disciplines.
For starters, Poul Petersen will be sitting on the SIGAI (the ACM Special Interest Group on Artificial Intelligence) career panel on Monday the 26th alongside Oren Etzioni from the Allen Institute for Artificial Intelligence, Eric Horvitz of Microsoft Research, and Ben Kuperman of Oberlin College.
From Tuesday through Thursday BigML will be sponsoring and demoing at the AAAI-15 conference itself, and on Tuesday evening we’re very pleased to be a guest of the Austin Data Geeks MeetUp! We’ll additionally be meeting with customers and partners in the area — so be sure to catch us at one of the events listed above and/or ping us at firstname.lastname@example.org to set up a time to chat.
When people first see a demo of BigML, there is often a sense that it is magical. The surprise likely stems from the fact that this isn’t how people are accustomed to computers working; rather than acting like a calculator creating a fixed outcome, the process is able to generalize from patterns in the data and make seemingly sentient predictions.
However it is important to understand that predictive analytics is not magic, and although the algorithm is learning on a very basic level, it can only extract meaning from the data you give it. It does not have the wealth of intuition that a human has, whether that’s good or bad, and subsequently the success of the algorithm can often hinge on how you engineer the input features.
Let’s consider a very simple learning task — please keep in mind that this is a contrived example to make explaining the problem of feature engineering clear, and does not necessarily represent an actual useful end result itself.
Assume you are working on a navigational system, and at some point in the system you would like a way to predict the principal direction of a highway knowing only its assigned number. For example, if a user wants to go north, and there are two nearby highways, Interstate 5 and Interstate 84, which should they take?
Now, you could use a list of known highways, but this would require you to regularly update the list as new highways are built, or removed. Instead, if there was a pattern relating principal direction to highway number, this might be a useful thing for you device to know.
So, let’s take a list of primary interstates in the US and let BigML train a model to predict the principal direction: East-West or North-South. The resulting tree looks like this:
In the highlighted node, you can see that the learning algorithm has discovered that if the highway number is greater than 96, then the highway is principally North-South. And indeed, if we look at the dataset there are only two highways that match this pattern, 97 and 99 and they are both North-South and so this pattern is relevant.
However, as you navigate around the tree it becomes obvious that each split is simply creating bounds that eventually isolate a single highway or a small group of highways, at which point the prediction is less of a generalization and more of a truism:
In other words, the model doesn’t seem to be generalizing in a meaningful way from the highway number to the principal direction.
Now if you are familiar with the US highway numbering system then you might know that there is significance in whether the highway number is even or odd. Lets re-engineer our dataset to include this property and see if the model changes. We can do this by selecting the “Add Fields to Dataset” option:
We’ll call the new field “isEven” and define it with the JSON s-expression:
[ "=", 0, [ "mod", [ "field", "Highway Number" ], 2 ]]
Reading from the inside brackets, we take the field named “Highway Number” and compute this value mod 2. If it equals 0, then this expression will return True meaning the highway number is an even integer, and False if odd:
Now we re-build the model including this new feature:
And now we get a very simple tree which generalizes to the following rules:
IF isEven = false THEN direction = North-South IF isEven != false THEN direction = East-West
This is a much more useful generalization! But what is happening here? Why didn’t the machine learning algorithm find this pattern in the first tree?
Remember the first dataset: all we gave the algorithm to learn from was an integer. And the only thing the algorithm knows about integers is that they have a natural order. That’s it. And so, it tried to find a pattern relating the natural order of the integers to the principal highway direction.
As humans, we potentially know a *lot* more about integers: some are squares, some are prime, some are perfect, and some are even. In the second dataset, we added some of this additional information about integers, specifically the even-ness, to the algorithm. By engineering this feature, we gave the algorithm the extra information it needed to find the pattern. In other words:
1) The “Feature Engineering” was adding the even/odd property.
2) The “Machine Learning” was the discovery that the even/odd property determines the principal direction.
The insight here is that learning algorithm can only discover the patterns that we provide in the data, either intentionally or accidentally.
In this rather contrived example, it might seem circular. That is, we start with an insight that even/odd has meaning, add that property, and then discover that even/odd has meaning. However, it is important to remember that this is a very simple example. When working with real data you may have hundreds or thousands of features and the patterns will be much more nuanced.
In that real world case, the importance of feature engineering is to use domain specific knowledge and human insight to ensure that the data contains relevant indicators for the prediction task. And, in that case, the beauty of machine learning is that it discovers the relevant patterns and filters out the incorrect human insights.
If you would like to run this example in your own account, here is a little python script which reproduces all the steps in development mode (FREE):
Arguments missing when calling a function,
mispelling keys setting some advanced option,
accessing attributes your object lacks:
These are a few of my favorite bugs.
Non-closing commas on deep JSON structures,
long late-night hacking that ends up in rapture:
Found a solution, that actually sucks.
These are a few of my favorite bugs.
Wrong file permissions on root system folders,
new model’s methods that break on the olders,
out of range values on sliders and bars:
These are a few of my favorite bugs.
When the job fails,
and the app blocks,
or the cloud turns black,
I simply remember my favorite bugs
and then I don’t feel so bad.
When the job fails,
and the app blocks,
or the cloud turns black,
I simply remember it’s wintermute’s fault
and then I don’t care at all!