Not a day goes by without an important figure from the technology world acknowledging the immense potential of large-scale machine learning applications in virtually every industry. The award-winning researcher and scientist Professor Pedro Domingos of University of Washington recently mentioned “Algorithms increasingly run our lives. They find books, movies, jobs, and dates for us, manage our investments, and discover new drugs.” He then went on to add “The race is on to invent the ultimate learning algorithm: one capable of discovering any knowledge from data, and doing anything we want, before we even ask.”
Machine learning’s impending impact on jobs that so far were considered exclusive to highly skilled humans will likely remain a controversial yet unavoidable topic for the foreseeable future. Take for example Early Stage Startup Investing. Silicon Valley venture capitalists have had a big hand in funding startups that are strategically leveraging machine learning to radically disrupt old world industries like hospitality and transportation services. This has helped create the much celebrated “Unicorns” that are defining the pace of innovation in our evolving digital consumer economy, where it makes more business sense to optimize the utilization of existing assets than to own and operate them outright.
As brilliant as this strategy has been, the utilization of machine learning to drive decision making in early stage investing itself has remained mysteriously absent. A controversial recent article by Arlo Gilbert very succinctly articulates on this “dirty secret” of Silicon Valley, which stubbornly remains very human and relationship driven. As Gilbert puts it “A computer doesn’t worry about its next fund.”
Given the backdrop of a typical exponential technology, it is not too far-fetched to think that this kind of freedom from existing human biases combined with increasingly powerful predictive models crunching increasingly sophisticated new sources of structured and unstructured data may force the VCs hands to have a stance on whether they should embrace automatic start-up investing sooner than they realize. This also happens to be exactly what Kirk Kardashian of Fortune has investigated in his recent article titled “Could algorithms help create a better venture capitalist?”
At the heels of these developments, Telefónica’s global entrepreneurship and innovation network Telefónica Open Future_ (TOF_) and BigML are proudly partnering to build the World’s First Automated Early Stage Investing Platform to help usher a new way of technology investing. To make this a reality, the alliance will take advantage of PAPIs conference events series as they offer the best platform to introduce new predictive applications to the world:
- In all future PAPIs CONNECT events as well as the annual PAPIs conference, there will be a new startup battle section devoted to best startups actively working on state-of-the-art predictive apps.
- Both the invited startups as well as the eventual winner will be automatically selected by TOF_and BigML’s new algorithm. The winning startup will be offered an investment by one the TOF_ funds thereby making PAPIs the world’s first event where an algorithm automatically selects the contestants and the winner.
Machine learning innovators gathering in Sydney for this year’s International Conference on Predictive APIs and Apps are uniquely positioned to ride this new wave of change. The event is taking place in Sydney on 6 & 7 August 2015 and it is bringing together leaders from Amazon Machine Learning, BigML, Google Prediction API and Microsoft Azure ML on the same stage. As such, PAPIs events participants and followers who have a machine learning driven startup idea or a relevant ongoing project can now gain additional exposure and perhaps also secure a seed round investment from an established technology player with the foresight to go against the grain in true maverick fashion. As the original sponsor of PAPIs, BigML is looking forward to meet with our future contestants.
So far we are very pleased with the welcoming messages and the genuine interest you have expressed towards BigML’s new European headquarters in Valencia. We would like to capitalize on this and return the favor by announcing that we are organizing the 1st Valencian Summer School in Machine Learning on September 15 and 16 in collaboration with the Universitat Politècnica de València, the Universitat de València, and Las Naves. Besides adding to the stock of quality machine learning specialists in Europe, we think that this event will make for a great opportunity for us to learning your industry context, goals and strategic outlook so we can better shape BigML’s scalable, consumable, and highly programmable cloud-based machine learning platform to your fast evolving needs.
As Kevin Kelly of Wired magazine recently wrote, “The business plans of the next 10,000 startups are easy to forecast: take X and add AI.” We agree with his take wholeheartedly. These startups will be the future Microsofts and Googles and they will rely on Machine Intelligence at an unprecedented rate in doing so. We believe that the rate and magnitude of the innovations on this front will make yesteryears’ information retrieval and decision support systems look pedestrian in comparison.
In that vein, our FREE invitation only two-day summer school is perfect for advanced undergraduates as well as graduate students and industry practitioners seeking a quick, practical, and hands-on introduction to Machine Learning. It will also serve as a good introduction to the kind of work that students can expect if they enroll in advanced masters both at the Universitat Politècnica de València and the Universitat de València. The primary goal is not only to introduce basic machine learning concepts, techniques, and tools, but also to share real-world experiences and let the students practice with real datasets and help them build their first predictive applications. As such, lectures are going to be very intensive; and will take place at Las Naves from 7:30am to 9:30pm on both September 15th and September 16th.
If this sounds right up your alley, please register today and be considered for one of 42 available spots at this challenging yet fun event. Applications will be evaluated based on a combination of interests, skills, and motivation. Make sure that you also follow our blog and Twitter account for further announcements on the event specifics such as participating lecturers. We wish upon you a perfect Valencian summer, where you get to sharpen your machine learning skills!
The first version of BigML’s add-on for Google Sheets has been released! The BigML add-on provides an easy way to fill the blanks in your spreadsheets using the predictions of models and clusters in BigML. As we explained in a previous post, now you can fill in the columns in your spreadsheeet by using the existing BigML decision tree models to generate predictions based on the sheet data. Thus, using the add-on you can, for instance, score your sales prospects based on the historic sales records in your Google Sheets. Similarly, you can group your customer data into segments according to the clusters they belong to.
Get the add-on running
The add-on is available at the Chrome Web Store or directly from your Google Sheet by using the Get add-ons item in the Add-ons menu. The BigML add-on appears under the Bussiness Tools category. Just click the +Free button to install it, and a new BigML submenu item will appear under the Add-ons menu.
By accepting them,you allow the add-on to acces the data in your Google Sheet and your models and clusters at BigML. The add-on will read the Sheet data and download the BigML models or clusters to Google servers, where all the predictions and cluster labelings will be done. Then the add-on will appear as a sidebar in your Google Sheet. In order to authenticate and use the models and clusters in BigML, you will need to provide your credentials:
that will be stored and used from then on (you can check anytime your credentials at BigML). Then you will be ready to start using the add-on.
Using the add-on step by step
Just click on BigML > Start under the Add-ons menu to see the sidebar that will show your models or clusters in BigML.
In BigML you can build your resources in a development environment (no cost involved) or in production. You can also organize your resources in projects. The search form in the add-on works in both environments, allowing to filter your resources by name or project.
To fill the blanks in any column of your Google Sheet you will need a model that can predict the field in this column from the contents of the rest of columns in the same row. Select from the list the one that best fits your data and click on it. A detailed description of the model will appear in the sidebar, and the model will be ready to use in your Google Sheet.
Finally, select the range of rows that you would like to complete and click the Predict button to see the predictions appear. Note that the columns in your selection are expected to match the fields in the model (listed in the model description on your sidebar).
The blank cells will be filled with the predicted values (that in this case must belong to a list of categories) and a confidence rate that ranges from 0 (no confidence) to 1 (total confidence) will be placed in the last available column. If the column to fill has numerical values in it, the associated model will have a numerical objective field and the predicted values will be also numerical. Then a new column will be added to show the associated error for the prediction.
Using clusters is quite the same. In this case, select the range of rows that contains the instances of data you want to segment and two new columns will be added to your Sheet. The first one will contain the label of the segment that the row belongs to (or centroid name). The second one shows the distance of the data in that row to the centroid, or central point for each segment. You can check this video to see how the add-on works in basic use cases.
First time BigML-GAS users
To use BigML for the first time, you’ll need to Sign up on our web site. As a result, you will land in a development environment with some data sources available to make your first steps in the platform. Still, to start working with your add-on you will need to either:
- create a model from one of the available sources or your own historical data
- clone an existing model from our model gallery.
In order to create your first models you can upload any local or remote CSV file. When uploading from a public Google Sheet, use its export to CSV feature. The corresponding URL can be pasted in the remote URL form and the data in your Google Sheet will be uploaded to BigML.
The data will be transformed into a source, where each column will be described as a field and its type will be inferred from the uploaded contents. Then it only takes one click to generate a dataset, where all statistical information per field is stored. Select the field that you want to fill with predictions (or objective field) and another click will give you a model for your data. This model will be immediately available in your Google Sheets through the BigML add-on.
You can also search the Gallery of models for a model that fits your data. Remember that BigML will use the first row in your selection or range of data as a headers row, and the names of the columns there should match the ones in the model you use to fill the blanks. Once you find a model that suits your needs in the Gallery, you can get it from there clicking the label at the top-right corner:
and it will be cloned in your account, ready to use from your Google Sheets through the add-on.
This is all you need to enrich the information in your Google Sheets using BigML’s add-on. Let BigML bring Machine Learning to your Google Sheets!
So far we are very pleased with the welcoming messages and the genuine interest you have expressed towards BigML’s new European headquarters in Valencia. We would like to capitalize on this and return the favor by announcing that we will be sponsoring three Machine Learning events in the next two weeks. As we revealed yesterday, Professor Geoff Webb of Monash University who has invented the groundbreaking Association Discovery technology ‘Magnum Opus‘ is joining BigML as Technical Advisor. This means he will be traveling to Spain to give two lectures on Scaling Log-linear Analysis to Datasets with Thousands of Variables. The first one will take place at Universitat Politècnica de València on July 8, 2015 at 4PM. The second lecture on the same topic will be held at the Artificial Intelligence Research Institute (IIIA-CSIC) in Bellaterra, Barcelona on July 14, 2015 at 12PM.
If you are dealing with very wide datasets, which a lot of predictive analytics and data mining use cases tend to qualify for these days thanks to a growing set (and amount) of public and private data sources as well as advances on feature engineering, then this lecture will likely be tremendously impactful in helping you to reconsider your previous assumptions and to grow more comfortable in scaling your solutions without having to throw away useful insights simply because of computational constraints. If interested, please simply stop by or drop us a note if you want to know more details. Please be sure to also follow us on our blog and on Twitter for further updates.
Lastly, we’d like to remind you of our upcoming inaugural Machine Learning Valencia Meetup at Las Naves on July 9, 2015, where we will showcase BigML and get to meet and exchange information with the engineering and the machine learning community in the city. We are looking forward to shake hands and make new connections along the way.
Fresh off the news on the opening of our new European headquarters, we are excited to make public that BigML has completed the acquisition of the groundbreaking Association Discovery software Magnum Opus. First released fifteen years ago, and progressively refined since, Magnum Opus has delivered reliable and actionable insights for retailers, financial institutions and numerous scientific applications and embodies the state-of-the-art in the field of association discovery. Consequently, this acquisition is a significant step forward in BigML’s vision to build the world’s premier cloud-based Machine Learning platform including carefully curated, most effective algorithms and data mining techniques that have already proven their mettle on complex real-world predictive analytics problems.
As part of the acquisition, world-renowned expert on Association Discovery and this year’s ACM SIGKDD Sydney Conference program co-chair Geoff Webb has joined BigML as Technical Advisor. Dr. Webb is a Professor of Information Technology Research in the Faculty of Information Technology at Monash University of Melbourne, where he heads the Centre for Data Science. He was editor in chief of the premier data mining journal, Data Mining and Knowledge Discovery, for ten years. He is co-editor of the Springer Encyclopedia of Machine Learning, a member of the advisory board of the Statistical Analysis and Data Mining journal, a member of the editorial board of the Machine Learning journal, and was a foundation member of the editorial board of ACM Transactions on Knowledge Discovery from Data. Dr. Webb is an IEEE Fellow and has received the 2013 IEEE ICDM Service Award and a 2014 Australian Research Council Discovery Outstanding Researcher Award.
Association discovery is one of the most studied tasks in the field of data mining. Stated simply, association mining identifies items that are associated with one another in data. Historically, far more attention has been paid to how to discover associations than to what associations should be discovered. Having observed the shortcomings of the dominant frequent pattern paradigm, Dr. Webb developed the alternative top-k associations approach. Magnum Opus employs the unique k-most-interesting association discovery technique as it allows the user to specify what makes an association interesting and how many associations s/he would like. The available criteria for measuring interest include lift, leverage, strength (also known as confidence), support and coverage. This approach effectively reveals the statistically sound, new and unanticipated core associations in the data whereas most other association discovery tools find so many spurious associations that it is next to impossible to find useful associations amongst the dross. Association mining complements other statistical data mining techniques in a number of ways as it:
- Avoids the problems due to model selection. Most data mining techniques produce a single global model of the data. A problem with such a strategy is that there will often be many such models, all of which describe the available data equally well. Association mining can find all local models rather than a single global model. This empowers the user to select between alternative models on grounds that may be difficult to quantify for a typical statistical system to take into account.
- Scales very effectively to high-dimensional data. The standard statistical approach to categorical association analysis (i.e. log-linear analysis) has complexity that is exponential with respect to the number of variables. In contrast, association mining techniques can typically handle many thousands of variables.
- Concentrates on discovering relationships between values rather than variables. This is a non-trivial distinction. If someone is told that there is an association between gender and some medical condition, they are likely to immediately wish to know which gender is positively associated with the condition and which is not. Association mining goes directly to this question of interest. Further, association between values, rather than variables, can be more powerful (discover weaker relationships) when variables have more than two values.
- Strictly controls the risk of making false discoveries. A serious issue inherent in any attempt to identify associations with classical methods is an extreme risk of false discoveries. These are apparent associations that are in fact only artifacts of the specific sample of data that has been collected. Magnum Opus is the only commercial association discovery software to provide strict statistical control over the risk of making any such errors.
The BigML product team has already started charting the path to a seamless integration of Magnum Opus capabilities into our platform in 2015. This means effective immediately, we will NOT be offering new Magnum Opus licenses or downloads. Existing Magnum Opus licensees will be supported as usual. Additional blog posts, a lecture series by Dr. Webb and more information on the integration timeline will be provided in the coming weeks so please stay tuned.
We are very happy to announce that BigML will be establishing its European Headquarters in Valencia, Spain. BigML had a strong European connection since its early days thanks to the founding team members’ origins in the continent (e.g., our co-founder and CEO Francisco J. Martin grew up in Valencia and got his 5-year degree in Computer Science from Universitat Politècnica de València and half of our team has been working from Valencia or Barcelona). As such, some of the most respected minds in Artificial Intelligence from Spain have been playing a big part in growing the business and making BigML a leader in today’s fast-growing Machine Learning landscape. In furthering our cause we intend to hire up to 15 engineers in the remainder of this year and to strength the ties between Valencia and Corvallis, Oregon.
Part of the decision to select Valencia was indeed driven by the fact that BigML already had several team members there who graduated from both Universitat de València and Universitat Politècnica de València. However, our interest in Valencia goes beyond mere convenience as we also have organic ties with other Spanish cities.
Spain has a large (#13 in the world) and diverse economy yet it has traditionally been underrated from a technology startup ecosystem perspective despite some success stories of its own, especially in the e-commerce space. Nevertheless, much of this high-tech activity has been concentrated in Barcelona and Madrid, which also attract technical talent from outside Spain to fuel their growth.
In contrast, as a focal point on the beautiful “Costa de Azahar”, Valencia has a metro area population exceeding 1.5M that is responsible for a very respectable GDP of $52.7 billion. Always a city thriving on trade with the biggest port on the western Mediterranean coast, Valencia is known as the leading automotive industry hub for Spain. Ford’s recently announced injection of 2.3 billion Euros in investment capital into expanding its Valencia operations is a testament to the long-term economic prospects of the city in this critical sector of the economy. Valencia’s strong economic growth over the last decade spurred by tourism and the construction industry with concurrent development and expansion of telecommunications and transport has been accompanied by a transition to a more service-oriented economy, where currently 84% of the working population is employed in the service sector. Consistent with this ongoing transformation, Valencia is looking to take the initiative to shift the economic mix towards higher-value added goods and services in turn make waves in high-tech by attracting more tech-businesses and tech jobs.
Major ICT companies have realized this and invested in branch offices in Valencia (e.g. HP, IBM). There are also regional SME technological companies specialized in different technological areas such as video game development, 3D, electronics, nanophotonics etc. According to Foundum, Spain’s equivalent of AngelList, Valencia ranks a solid 3rd behind Barcelona and Madrid in terms of startup ecosystem membership. Given its economic standing in the country, its young and well-educated technical talent and ongoing infrastructure investments one can argue Valencia has a lot more upside left in the tank.
Recent years have seen more startup scene momentum with precedents like that of Silicon Valley’s Plug-and-Play Center entering Valencia in 2012 as well as local co-working spaces like Workether. Similarly, VIT Emprende initiated by the City Council’s Foundation InnDEA brings together innovative entrepreneurs of Valencia. Its members exchange knowledge, collaborate on R & D, engage in technology transfer, and establish synergies through networking and maintaining contact with prominent in-the-field of entrepreneurs around Valencia. Furthermore, Iker Marcaide (a member of BigML’s Board of Directors) has founded the international education payments outfit peerTransfer in Valencia. The company remains one of the hottest startups in Europe with its engineering team located in Valencia.
Valencia has 2 public and 2 private universities and more than 100,000 university students, including 15,000 ICT related grade and post-grade students. So these developments are welcome news for the likes of Universitat Politècnica de València (36,000 students and 3,000 professors) looking to supply new technical talent into the ecosystem.
In terms of tech events Valencia is the starting point of Campus Party, which went on to become one of the largest global technology networking events of the world. In addition, there are regular video game development related events and a notable upcoming Health Informatics event.
Following these good examples, one of our first calls in Valencia will be to reach out to the developer and data scientist communities and inform them about what BigML has to offer. We have recently created a free Meetup group for this purpose and will be hosting a special demo and technical recruitment event on July 9, 2015. Please RSVP asap and reserve your spot. Prior experience with Machine Learning is helpful but not required for membership so long as you have the curiosity to learn and develop your data science skills. With your help, we are looking forward to add some AI and Machine Learning flavor as a key ingredient in Valencia’s 21st century economic development paella. Hope you can join in the fun.
Ens veiem a València!
The 2nd annual PAPIs.io conference is getting closer. The August 6th & 7th event in Sydney is scheduled right before the KDD conference and it holds the distinction of being the World’s only conference dedicated to Predictive APIs and Apps. As we blogged recently, the conference program is chock-full of interesting sessions from distinguished speakers from a wide spectrum of industries. They will be covering real life examples of predictive applications and the lessons learnt by developers of such apps. In addition, this year’s conference will include a separate technical track with tutorials on tools and APIs for building predictive apps plus a dedicated research track. There is something for you to take away from PAPIs.io 2015 whether you are a business lead looking for innovative predictive use cases that can improve your KPIs, a developer looking to figure out ways to deploy highly scalable predictive applications with ease or a student or academic eager to keep up to date with some of the most coveted machine learning techniques with proven real-world outcomes.
In case you are still on the fence, the organizers of PAPIs have graciously decided to give away some tickets to aid with your plans. If interested, you can submit a short form and be in the running for this $195 value.
We are looking forward to meet with the newly minted members of the PAPIs community and share memorable moments in Sydney.
In our prior post we talked about clustering and how you can group your data together into segments by typing a single sentence using BigMLer — BigML‘s command line API. The same can be done for Anomaly Detection, so after reading this post you should be able to find the outliers in your data on a single command line.
Anomaly Detection is a technique used to identify the instances in your data that do not conform to the general pattern. Depending on the nature of your problem, pinpointing these instances can really make all the difference: they may be the errors in your dataset, which you would like to exclude before building models. They may represent fraudulent transactions in a credit card database, defective products in a manufacturing context, etc. In these cases, separating the wheat from the chaff is something paramount for your business or research.
In BigML, we added the anomaly detector to our machine learning kit last year. It is an unsupervised tool, so you don’t need to label your data as normal or abnormal. You just upload it to BigML for the anomaly detector to figure that out for you. For a good example, just check David Gerster’s post about anomaly detection and breast cancer biopsies. Now let’s see how easy it is to build an anomaly detector using BigMLer.
Finding anomalies in your data
BigMLer can build an anomaly detector from any CSV data file just like this:
With this simple command, BigMLer will upload your data and build a source from it, create a dataset summarizing all the statistical information per field, and finally build an anomaly detector. The console will show links to these resources as they are created. Their IDs appear in these links and they are also stored for later use in files under the output directory of the command. You can also load your data from remote repositories:
or even stream it from your standard input
How do you then extract the anomalous instances in your dataset from the anomaly detector that you have just created? And how can you know how anomalous they are?
The information about the anomalous instances in your dataset is stored in the JSON that describes the anomaly detector itself, where a list of the top anomalies is enclosed. This is the information that our web interface displays, namely: the list of the instances with the top anomaly scores. Each instance in the dataset is assigned an anomaly score that ranges from
0 (least anomalous) to
1 (most anomalous).
What do these scores mean? The anomaly detector is built using an iforest, that is, a bunch of overfitted decision trees grown from samples of your data. The anomaly score is obtained from this iforest by comparing the medium depth of these trees with the real depth of the node where the instance under test is classified. The rational behind this procedure is pretty simple: the easier it is to single out an instance the more anomalous it is, while the average looking instances that follow the general pattern are hard to tell from each other. Thus, the higher the score, the more an instance is dissimilar from the general pattern.
The anomaly detectors will by default use an iforest of 128 trees and show the top ten anomalies, but these figures can also be changed using the
--forest-size options. You can build an anomaly detector from an existing dataset in BigML using its identifier:
In this example, an iforest of 50 trees showing the top twenty anomalies will be created. The
--anomaly-seed option is added to ensure that the sample’s random pick are deterministic. Now that we have created and tailored our anomaly detector, how will we use it to improve our datasets and models?
Extracting outliers and anomaly scores
Well, the first use of an anomaly detector is extracting a new dataset that contains only the top anomalous instances:
--anomaly option refers to the existing anomaly detector ID, and the
--anomalies-dataset is set to
in to select only the top anomalies. The opposite case, excluding the top anomalies from the dataset used to create the anomaly detector, is also possible by using
--anomalies-dataset out. This can be very useful to get rid of outliers in your dataset, as models built upon cleansed data will most likely perform better.
Sometimes you may prefer to see the score assigned to every instance in your dataset when deciding where the threshold for outliers should be. The best option then is to create a batch anomaly score for the anomaly detector training dataset that can later be downloaded as a CSV file, as in:
or stored as a new dataset with an additional column that contains the anomaly score for each instance:
The anomaly detector can be used to score datasets other than its own training dataset. To compute anomaly scores locally on any test data file:
If you use this command, the anomaly detector code will be downloaded to you computer, and each instance in your test file will be scored locally. The resulting scores will be stored in the my_dir/anomaly_scores.csv CSV file. Similarly, if you would like to score an existing dataset remotely, you can use the
--test-dataset option to set the dataset ID:
As you can see, removing outliers, detecting fraud and improving the quality of your data is just a BigMLer command away. Now it’s your turn to give it a try and join in the fun: we hope you get to bring the power of BigML to your command line before too long!
Building predictive models with machine learning techniques can be very insightful and provide tremendous business value in optimizing resources that are simply impossible to replicate manually or by more traditional statistical methods. It can best add this value when coupled with good data and domain expertise in interpreting the data and the predictions. Predictive modeling is seldom a one-way street, where the first run through the cycle of data wrangling, feature engineering, model building, evaluation and predictions yields perfectly accurate results. This requires the practitioner to go through many more iterations with different input data and model configurations in order to minimize the error while steering clear from overfitting and other types of bias. In this regard, the process does require some “art” that must be appreciated for what it is.
Sometimes predictions may be out of whack despite best tuning efforts given available resources. When this happens, what are the steps one must take in order to improve the results?
We’d like to go over some best practices in doing so while conducting a post-mortem of our recent not-so-great Belmont Stakes predictions. As a reminder, last Friday we wrote about our predictions for the past weekend’s Belmont Stakes based on a decision tree ensemble built on BigML. The predictions called for American Pharoah falling short of securing the Triple Crown by placing somewhere in the middle of the pack at Belmont likely because the wear and tear of the first two legs of the Triple Crown would catch up with him against fresher horses. That’s exactly what had happened to California Chrome last year — the 12th contender coming close, but failing at Belmont. Many expert handicappers’ opinions seemed to align with the model so if nothing else we know that we are in good company.
Fast forward to today, unless you were living under a rock, you probably heard that American Pharoah defied our predictions and broke the 37 year Triple Crown drought after a solid performance leading the race wire-to-wire. Frosted, which came in second as the model predicted gave the champion horse some competition as they raced into the last stretch, but ultimately could not respond to American Pharoah’s pace when it accelerated away into the books of thoroughbred racing history. The horse that was predicted to be the most likely winner per our model (#8 Materiality) was perfectly placed in second position after a mile or so yet he was eased by his jockey in the last stretch thereby ending up with the last place finish. Given his 3rd best odds at the start of the race, this dismal performance was not expected from him by many.
We included the results of last weekend’s race in our original dataset and ran an anomaly detection task on the resulting post-race dataset. As seen below, last weekend’s results have been assigned fairly high anomaly scores, which point to somewhat unexpected outcomes. However, the scores are not high enough to justify being marked as certain outliers so we would still keep them were they present in the original training set.
Our decision tree ensemble had performed better on our test dataset as compared to the mean benchmark. Specifically, the mean benchmark was beaten by approximately 9% and 13% respectively based on Mean Absolute Error and Mean Squared Error measures. On the other hand, the R Squared (a.k.a. RMSE) value for our ensemble was 0.11. Ideally, one wants the R Squared value to be closer to 1, which means perfect predictive power. However, in domains that involve a human component (as opposed to driven by forces of nature) this is hardly the case. In those type of domains, R Squared values below 0.5 are pretty common and do not necessarily render the model useless. Thoroughbred racing definitely has a big human component given the horses, breeders, trainers and jockeys involved. With that said, we’d like to be able to improve on the 0.11 figure and to get a better edge over the mean benchmark. This would require a systematic approach in looking for different angles to further iterate on and to calibrate the model accordingly.
Some of our readers jokingly commented on our post stating “apparently the machine has some more learning to do…”. Others mentioned that handicapping remains an art not a science. We still believe models can help provide more informed estimates than subjective opinion not grounded in data. However, we did take the advice to heart as there is some truth in both comments. So we went to work in understanding what caused the deviations knowing full well that predicting real life events such as horse races will remain an inherently difficult problem with many variables factoring in. In conducting our post-mortem analysis we’d like to reference Ahmed El Deeb’s post on ways to improve your predictive model, which coincidentally employs an equine analogy summed up in his graphic below that resembles a horse.
So let’s follow Ahmed’s framework and see where we may have fallen short and which areas it may make sense to apply more energy towards next time.
1) More Data
Our final data sample contained over 250 horses that took place in Belmont Stakes in the last 25 years. Despite capturing over two decades of events, this is not a large dataset to draw rock solid conclusions from. Perhaps more importantly, even though there were close calls, we did not have a Triple Crown winner in this period for the model to learn from. So it would have been ideal to go back farther. This shortcoming alone would have pointed out to American Pharaoh having a good shot at the Triple Crown had the model included him in the front pack of the finishers, but instead he was predicted to finish between 3rd and 5th places depending on track condition and odds changes right before the race. How can we explain his impressive victory then?
2) More Features
As we mentioned before, horserace handicapping is a complex undertaking with a long history. Professionals rely on many more signals in making their predictions than we did with our quickly whipped together model. Most notably lacking were:
- Mid-race performances for each horse rather than the race leader alone
- More complete view of past race performances for each horse inclusive of Kentucky Derby prep races.
- Trip related variables summarizing whether the thoroughbred had problems during the race e.g. Bumped at the starting gate, stuck in the middle of the pack, loss of momentum due to jockey error etc.
- Pre-race workout performance
- Preferred racing style e.g. Needs the lead, likes to run off pace, closer etc.
- Beyer speed figures
- Post position etc.
With none of these key pieces of data that professional handicappers swear by, the model had to make the best of what we fed it given our time constraints — and it was still able to generate some edge.
3) Feature Selection
While this makes for a good advice in situations where there are fairly low number of observations against a fairly large number of features, we feel that BigML’s ensemble algorithm that was used to generate our model already did a good job of reducing the noise in the data by weighting a subset of the features much heavier than others. We covered those model favored features in last week’s post so no need to repeat here. If and when we include new independent variables, it would be a good time re-evaluate features.
Feature selection is more of a human guided process, where the practitioner shuffles various independent variables and observes whether or not that had a big impact on the resulting error measures (e.g. Root Mean Squared Error). Regularization on the other hand, achieves the same effect implicitly by the algorithm automatically optimizing information gain (as in the case of decision trees) by way of using the minimal amount of features while at the same time not overly relying on any single one of them. Of the 39 features we fed our model, it favored 10 or so with Odds being the one it relied on the most. Be that as it may, Odds only explained less than half (44%) of the historic deviations in Belmont Stakes relative finishing positions, which we had designated as our target variable. Again, the inclusion of new variables would likely have significant impact on the model’s favored subset of features if we engaged in further iterations. These could very well yield new rules and relationships to consider.
In his blog post, Hasan makes the point that Bagging can do wonders in reducing prediction error variance without any noticeable impact on bias. We are in full agreement with his take and it just so happens that our model was already using the bagging technique (a.k.a. Bootstrap Aggregating) in creating its ensemble of decision trees. He then goes on to explain that Bagging comes at the additional cost of computational intensity. BigML has a solid implementation of its ensembles that takes care of the memory management behind the scenes so the user does not have to deal with the intricacies of hardware optimization. We invite more of our readers take advantage of our linearly scaling ensemble models and lightning fast predictions.
Boosting relies on training several models successively in trying to learn from the errors of the preceding models. It can decrease bias with minimum impact on variance, but can make for a complex implementation scenario as far as the pipeline required to support it. This methodology is not yet supported by BigML and thus would not make a difference in the case of our Belmont Stakes model.
7) Different Class of Models
Finally, different types of algorithms can yield different (and at times complementary) results in minimizing overall prediction errors. On the other hand, a comparative approach seeking to run many algorithms in parallel prolongs the solution development cycle and can be utterly uninterpretable. Some Machine Learning researchers who have specialized knowledge on the underlying mathematics of certain algorithms may have a better chance to explain why algorithm X may have come up with prediction Y, but many ML practitioners and enthusiasts do not have that level of deeper understanding — leaving them with a steep hill to navigate when it comes to dealing with many more algorithms in solving a singe predictive use case. In those instances, it may be best to rely on general knowledge as to which algorithms tend to work best on what type of problems and keep the process as simple as possible before involving more algorithms. For example, decision trees and decision forests are known to be very versatile across various problem spaces. This is one of the main reasons BigML chose to offer them first in addition to their interpretability and scalability advantages.
To recap, this has been a fun learning experience in a brand new domain for us. It was an acceptable first step in the right direction, but serious effort must be undertaken to make it a professional grade system in the future — primarily by incorporating many more features into our model. Given the addition of even more features than the 39 we have had, it may then be worthwhile to do some human-guided feature selection. Subject to data availability and time constraints (we are not a sports betting outfit after all) we may offer some new horse racing insights prior to next year’s Belmont Stakes barring unforeseeable factors like in race injuries etc.
In the meanwhile, we celebrate the mighty American Pharoah who left no doubt as to which horse is the best of his generation. Maybe more importantly, he single-handedly restored our faith in fairy tale endings.
(NOTE: We suggest that you also read our follow up post including the post-mortem analysis of the results from the machine learning model described here.)
This Saturday Americans will witness the 147th Belmont Stakes, thoroughbred racing’s 3rd and final leg of the highly coveted Triple Crown. If you follow horse racing, you will know that no horse since Affirmed in 1978 has been able to claim this elusive prize. In fact, Affimed was only the 11th Triple Crown winner in the long history of American thoroughbred racing. Since 1978 there have been 13 close calls, where a horse won both the Kentucky derby and the Preakness Stakes yet failed to repeat the same success in Belmont Park (in Elmont, NY), where the famed Secretariat showcased a historic performance in 1973 and broke the world record for the distance — a record that still stands today.
As was the case during last year’s Triple Crown (one featuring a rare California-bred colt aptly named California Chrome), 2015 has also presented lovers of the “Sport of Kings” with what seems as a worthy suitor in American Pharaoh. American Phaorah won the prestigious Kentucky Derby by a length in a hard-fought battle ahead of Firing Line. He then carried his form to the Pimlico Race Track to also claim the Preakness Stakes on a rain drenched track pretty much wire-to-wire ahead of serious competition — this time by a comfortable 7 lengths. This has stoked the sports media machine and here we are again eagerly awaiting to crown a new super horse after some 37 years.
But the question remains, can American Pharaoh really achieve the feat? If you ask FiveThirtyEight’s Benjamin Morris or Wired Magazine , it remains a tall order for this year’s contender to buck the recent trend and to master the “Test of Champions” on Saturday. Many experts tie these near misses to several factors including:
- the longer race distance of Belmont Stakes (12 furlongs = 1.5 miles = 2400 m.) as compared to the Derby (10 furlongs = 1.25 miles = 2000 m) and Preakness (9.5 furlongs = 1.19 miles = 1900 m), which demands more of a distance horse that contrasts against the more sprint-like first two legs of the Triple Crown
- the grind of having to run three stakes races in five weeks – often against a Belmont field that hasn’t gone through the same gauntlet and thus is more fresh.
Despite these factors, its seems odds-makers are still putting a lot of high hopes on Triple Crown contenders frequently listing them as even money bets. American Pharaoh is no exception at 3/5 morning line odds with odds for the nearest horse (Frosted) listed as 5/1 — more than 8 times the payoff.
Though we have taken note, we were not fully satisfied with the oft-repeated assumptions and subjective opinions with no grounding in data. Therefore, we thought it may be a fun exercise to construct a predictive model using BigML in order to predict the outcome of this weekend’s race among the 8 entrants. As we go into the details of what we found out, please be forewarned that we are NOT professional handicappers here. While we hope you enjoy the informational aspect of this post, we urge you NOT to bet your life savings on any of the predictions below. (If you are not particularly interested in how this analysis was conducted, you may prefer to jump to the conclusion section or else read on!)
THE HANDICAPPING DATA
Thoroughbred racing records are pretty well kept and go back a long time. However the format has evolved over time and because of its non-digital origins, online services that offer such data in a reliable way are somewhat limited. For this analysis, we used publicly available race result sheets for all Triple Crown races (Kentucky Derby, Preakness Stakes, Belmont Stakes) in the last 25 years.
So, what data do we have after all? Below are the horse and race related variables we used to build a predictive model:
- RACE YEAR
- RACE NAME
- HORSE NAME
- HORSE PEDIGREE
- FINISH RANK
- FINISH TIME
- TRACK CONDITION (i.e. Fast, Good, Sloppy)
- 1st SPLIT (The time of the leading horse after 2 furlongs or 400m.)
- 2nd SPLIT (The time of the leading horse after 4 furlongs or 800m.)
- 3rd SPLIT (The time of the leading horse after 6 furlongs or 1200m.)
- 4th SPLIT (The time of the leading horse after 8 furlongs or 1600m.)
- ODDS (Morning line odds for each horse to run the Belmont Stakes)
- LENGTHS BEHIND (# of horse lengths this horse was behind as compared to the race winner.)
- % of RACE RECORD TIME
- AVERAGE SPEED
- POINTS (Proprietary measure of the horse’s race performance as compared to the best ever recorded time for that distance and track.)
Horse race handicapping is pretty much a science these days and there are a number of businesses setup on this premise. Admittedly, there are many more pieces of data available that we chose to leave out due to time constraints. One could argue that a number of those could potentially factor in influencing the race predictions e.g. Equipment, Post Position, Sire etc. However, the list above does capture a good level of detail on the previous Triple Crown performances by these elite horses. As such, these variables should intuitively be linked to the results we have observed over the last 25 years.
All in all, we looked at 729 distinct horses having raced in at least one of the 74 Triple Crown races and filtered it down to our final dataset containing 256 horses having participated in the Belmont Stakes since 1991.
MODELING & RESULTS
BigML’s decision trees can unearth instant insights that are hidden in your data by analyzing numerous input variables as they relate to your target variable. In this case, we are trying to predict LENGTHS BEHIND for Belmont Stakes 2015. So that becomes our “Target Variable”. The input or independent variables are all the others listed above — inclusive of the 2015 Kentucky Derby and Preakness Stakes that took place in the last 5 weeks. Once the data munging process is over (in this case we used pandas), we uploaded our “Machine Learning ready” data file (i.e., data was in tabular format) to BigML and started analyzing it using BigML’s built-in Anomaly Detector, Clustering and Decision Tree algorithms. To be able to compare and contrast different model configurations, we performed several iterations of the model building process — evaluating model performance at the end of each iteration and either keeping or dumping the original model based on the results. In most cases, the whole process took less than a minute or two, and just like that we had our models correlating 25 years worth of racing data to this Saturday’s expected outcomes. We finally settled on a Decision Tree ensemble model (with Bagging) containing 250 decision trees within it. This particular model handily beat the random picks benchmark, and was able to also comfortably beat (by over 12%) a more educated benchmark utilizing the historic average lengths back figure for Belmont Stakes. All of this benchmarking was done without a line of coding, instead relying on BigML’s powerful model evaluation feature.
The nifty Model Summary Report feature of our decision tree has revealed the following as the most influential input variables on predicting Belmont Stakes performance with the predictive power expressed as %s listed in parentheses after each:
- ODDS — BELMONT (22%)
- JOCKEY (22%)
- TRAINER (13%)
- HORSE PEDIGREE RELATED (10%)
- RACE YEAR (4%)
- LENGTHS BEHIND — KENTUCKY DERBY (4%)
- HORSE PEDIGREE RELATED (4%)
- AVERAGE SPEED — KENTUCKY DERBY (3%)
- POINTS — KENTUCKY DERBY (3%)
- % of RACE RECORD TIME — PREAKNESS (3)
ANALYSIS & CONCLUSION
Let’s start by stating that the interpretation here is not written in stone, but rather based on a general understanding and the recent trends in the sport of American thoroughbred racing. Furthermore, none of this is meant to entice you, the reader, into gambling on this weekend’s race. And none of if should be treated as professional handicapping advice by any measure.
With that said, it is somewhat expected and relieving that the betting odds are the foremost predictor for this weekend’s race. If it did not show up at all, we’d be both surprised and worried about the racing fans relying on the experts’ and fellow racing fans’ “collective intelligence”. However, the model tells us that the odds are not all that they are cracked up to be (at least for the Belmont Stakes). After all, the odds explain far less than half the variation in results and thus cannot be exclusively trusted in this case.
Next up are the Jockey and Trainer of the horse. These point out to certain experienced teams having an edge in Belmont either because they are based there or they have a history of success and/or a higher ability to attract great horses. In a way, this reminds us of the recruitment halo effect programs and coaches like John Calipari of University of Kentucky NCAA basketball fame have come to accomplish in other competitive arenas. In other words, success begets more success.
There is also a large chunk of the handicapping world that puts a lot of emphasis on a horse’s pedigree. There are very detailed accounts of the progeny going back 9 or 10 generations deep just to be able to tell if a given horse carries the genes to run longer distances or whether it is better suited for sprint racing. Similarly, experts claim that these signals can determine how well the thoroughbred will perform on turf vs. dirt surfaces. It is a bit of an art mixed in with some science, so we won’t go into details here. Although these factors did not weigh in as heavily as we expected, they do have a say in the predictions, which hopefully somewhat justifies the crazy high stud fees some thoroughbreds are able to secure for their breeders.
At first, it was surprising to us that the Race Year also mattered. We suspect this may have something to do with the aforementioned evolution caused by the in-breeding with an emphasis on speed vs. stamina. Indeed, it appears that we have gotten to a point where most of the first rate horses never run a race as long as the Belmont Stakes in their entire racing career anymore. This suggests older years are more likely to present us with the so called “Stayers” that pack the kind of stamina an ideal thoroughbred should possess at the starting gates of Belmont Stakes.
The remaining variables center around the Kentucky Derby performance of the horse. Many believe that asking for the modern thoroughbred to race against top competition 3 times in 5 weeks is too much — perhaps even dangerous. That is why the recent years have turned the strategy of skipping the Preakness to have a better shot at Belmont into a very popular undertaking for the trainers and owners of these man made modern speed demons. In this regard, our model may be pointing out to the efficacy of this approach with a number of performance measures from the first leg of the Triple Crown. In contrast, there is only a single Preakness performance related measure further pointing out to the lesser middle child status of Maryland’s Preakness.
PREDICTING THE 2015 BELMONT STAKES
Given the rough model construct we just covered, which horse(s) seem to have the best chance to reach the finish line first and make history? Will American Pharaoh be a living legend or will some no name horse spoil the day for horse racing fans once again?
Without further adieu, given the field of 9 entrants, the model predicts the finishing positions as follows:
- Win: MATERIALITY (6-1)
- Place: FROSTED (5-1)
- Show: MADE FOR LUCKY (12-1)
- KEEN ICE (20-1)
- AMERICAN PHAROAH (3-5)
- FRAMMENTO (30-1)
- TALE OF VERVE (15-1)
- MUBTAAHIJ (10-1)
As the prediction goes, American Pharaoh will end up being one more false hope as the next Triple Crown champion, finishing somewhere in the middle of the table and disappointing its fans despite a great overall Triple Crown racing performance. Some experts claim that given this field, American Pharoah is likely to jump ahead early as he did in the Preakness and dictate the pace from then on in the hopes that it can steal the race. In that scenario, it seems that our model is predicting that first Materiality and Frosted and later in the stretch Made for Lucky and Keen Ice can close up on Pharoah’s tired legs having enjoyed a much longer and less stressful preparation before the this final “Test of Champions”. In addition, when we look at the confidence intervals accompanying this most likely scenario, they mostly point out to American Pharoah finishing anywhere from 3rd to 5th never seriously threatening the favored Materiality.
We hope this inspires you to grab a cool drink and catch the Belmont Stakes live or on TV or online this weekend. Horse races can be exhilarating events for a wide audience from youngsters to the retired. These beautiful animals are true athletes in the sense that they can move their 1000 pound masses at speeds approaching 40 miles per hour (65 km/h) across a dirt track all the while dealing with equally talented equine rivals looking in their eyes stride by stride. We sure hope thoroughbred racing in America gets to capture its former glory and becomes cherished events beyond the Triple Crown races. Who knows? Perhaps Machine Learning gets to play a small part in that!