How to Improve Your Predictive Model: A Post-mortem Analysis
Building predictive models with machine learning techniques can be very insightful and provide tremendous business value in optimizing resources that are simply impossible to replicate manually or by more traditional statistical methods. It can best add this value when coupled with good data and domain expertise in interpreting the data and the predictions. Predictive modeling is seldom a one-way street, where the first run through the cycle of data wrangling, feature engineering, model building, evaluation and predictions yields perfectly accurate results. This requires the practitioner to go through many more iterations with different input data and model configurations in order to minimize the error while steering clear from overfitting and other types of bias. In this regard, the process does require some “art” that must be appreciated for what it is.
Sometimes predictions may be out of whack despite best tuning efforts given available resources. When this happens, what are the steps one must take in order to improve the results?
We’d like to go over some best practices in doing so while conducting a post-mortem of our recent not-so-great Belmont Stakes predictions. As a reminder, last Friday we wrote about our predictions for the past weekend’s Belmont Stakes based on a decision tree ensemble built on BigML. The predictions called for American Pharoah falling short of securing the Triple Crown by placing somewhere in the middle of the pack at Belmont likely because the wear and tear of the first two legs of the Triple Crown would catch up with him against fresher horses. That’s exactly what had happened to California Chrome last year — the 12th contender coming close, but failing at Belmont. Many expert handicappers’ opinions seemed to align with the model so if nothing else we know that we are in good company.
Fast forward to today, unless you were living under a rock, you probably heard that American Pharoah defied our predictions and broke the 37 year Triple Crown drought after a solid performance leading the race wire-to-wire. Frosted, which came in second as the model predicted gave the champion horse some competition as they raced into the last stretch, but ultimately could not respond to American Pharoah’s pace when it accelerated away into the books of thoroughbred racing history. The horse that was predicted to be the most likely winner per our model (#8 Materiality) was perfectly placed in second position after a mile or so yet he was eased by his jockey in the last stretch thereby ending up with the last place finish. Given his 3rd best odds at the start of the race, this dismal performance was not expected from him by many.
We included the results of last weekend’s race in our original dataset and ran an anomaly detection task on the resulting post-race dataset. As seen below, last weekend’s results have been assigned fairly high anomaly scores, which point to somewhat unexpected outcomes. However, the scores are not high enough to justify being marked as certain outliers so we would still keep them were they present in the original training set.
Our decision tree ensemble had performed better on our test dataset as compared to the mean benchmark. Specifically, the mean benchmark was beaten by approximately 9% and 13% respectively based on Mean Absolute Error and Mean Squared Error measures. On the other hand, the R Squared (a.k.a. RMSE) value for our ensemble was 0.11. Ideally, one wants the R Squared value to be closer to 1, which means perfect predictive power. However, in domains that involve a human component (as opposed to driven by forces of nature) this is hardly the case. In those type of domains, R Squared values below 0.5 are pretty common and do not necessarily render the model useless. Thoroughbred racing definitely has a big human component given the horses, breeders, trainers and jockeys involved. With that said, we’d like to be able to improve on the 0.11 figure and to get a better edge over the mean benchmark. This would require a systematic approach in looking for different angles to further iterate on and to calibrate the model accordingly.
Some of our readers jokingly commented on our post stating “apparently the machine has some more learning to do…”. Others mentioned that handicapping remains an art not a science. We still believe models can help provide more informed estimates than subjective opinion not grounded in data. However, we did take the advice to heart as there is some truth in both comments. So we went to work in understanding what caused the deviations knowing full well that predicting real life events such as horse races will remain an inherently difficult problem with many variables factoring in. In conducting our post-mortem analysis we’d like to reference Ahmed El Deeb’s post on ways to improve your predictive model, which coincidentally employs an equine analogy summed up in his graphic below that resembles a horse.
So let’s follow Ahmed’s framework and see where we may have fallen short and which areas it may make sense to apply more energy towards next time.
1) More Data
Our final data sample contained over 250 horses that took place in Belmont Stakes in the last 25 years. Despite capturing over two decades of events, this is not a large dataset to draw rock solid conclusions from. Perhaps more importantly, even though there were close calls, we did not have a Triple Crown winner in this period for the model to learn from. So it would have been ideal to go back farther. This shortcoming alone would have pointed out to American Pharaoh having a good shot at the Triple Crown had the model included him in the front pack of the finishers, but instead he was predicted to finish between 3rd and 5th places depending on track condition and odds changes right before the race. How can we explain his impressive victory then?
2) More Features
As we mentioned before, horserace handicapping is a complex undertaking with a long history. Professionals rely on many more signals in making their predictions than we did with our quickly whipped together model. Most notably lacking were:
- Mid-race performances for each horse rather than the race leader alone
- More complete view of past race performances for each horse inclusive of Kentucky Derby prep races.
- Trip related variables summarizing whether the thoroughbred had problems during the race e.g. Bumped at the starting gate, stuck in the middle of the pack, loss of momentum due to jockey error etc.
- Pre-race workout performance
- Preferred racing style e.g. Needs the lead, likes to run off pace, closer etc.
- Beyer speed figures
- Post position etc.
With none of these key pieces of data that professional handicappers swear by, the model had to make the best of what we fed it given our time constraints — and it was still able to generate some edge.
3) Feature Selection
While this makes for a good advice in situations where there are fairly low number of observations against a fairly large number of features, we feel that BigML’s ensemble algorithm that was used to generate our model already did a good job of reducing the noise in the data by weighting a subset of the features much heavier than others. We covered those model favored features in last week’s post so no need to repeat here. If and when we include new independent variables, it would be a good time re-evaluate features.
Feature selection is more of a human guided process, where the practitioner shuffles various independent variables and observes whether or not that had a big impact on the resulting error measures (e.g. Root Mean Squared Error). Regularization on the other hand, achieves the same effect implicitly by the algorithm automatically optimizing information gain (as in the case of decision trees) by way of using the minimal amount of features while at the same time not overly relying on any single one of them. Of the 39 features we fed our model, it favored 10 or so with Odds being the one it relied on the most. Be that as it may, Odds only explained less than half (44%) of the historic deviations in Belmont Stakes relative finishing positions, which we had designated as our target variable. Again, the inclusion of new variables would likely have significant impact on the model’s favored subset of features if we engaged in further iterations. These could very well yield new rules and relationships to consider.
In his blog post, Hasan makes the point that Bagging can do wonders in reducing prediction error variance without any noticeable impact on bias. We are in full agreement with his take and it just so happens that our model was already using the bagging technique (a.k.a. Bootstrap Aggregating) in creating its ensemble of decision trees. He then goes on to explain that Bagging comes at the additional cost of computational intensity. BigML has a solid implementation of its ensembles that takes care of the memory management behind the scenes so the user does not have to deal with the intricacies of hardware optimization. We invite more of our readers take advantage of our linearly scaling ensemble models and lightning fast predictions.
Boosting relies on training several models successively in trying to learn from the errors of the preceding models. It can decrease bias with minimum impact on variance, but can make for a complex implementation scenario as far as the pipeline required to support it. This methodology is not yet supported by BigML and thus would not make a difference in the case of our Belmont Stakes model.
7) Different Class of Models
Finally, different types of algorithms can yield different (and at times complementary) results in minimizing overall prediction errors. On the other hand, a comparative approach seeking to run many algorithms in parallel prolongs the solution development cycle and can be utterly uninterpretable. Some Machine Learning researchers who have specialized knowledge on the underlying mathematics of certain algorithms may have a better chance to explain why algorithm X may have come up with prediction Y, but many ML practitioners and enthusiasts do not have that level of deeper understanding — leaving them with a steep hill to navigate when it comes to dealing with many more algorithms in solving a singe predictive use case. In those instances, it may be best to rely on general knowledge as to which algorithms tend to work best on what type of problems and keep the process as simple as possible before involving more algorithms. For example, decision trees and decision forests are known to be very versatile across various problem spaces. This is one of the main reasons BigML chose to offer them first in addition to their interpretability and scalability advantages.
To recap, this has been a fun learning experience in a brand new domain for us. It was an acceptable first step in the right direction, but serious effort must be undertaken to make it a professional grade system in the future — primarily by incorporating many more features into our model. Given the addition of even more features than the 39 we have had, it may then be worthwhile to do some human-guided feature selection. Subject to data availability and time constraints (we are not a sports betting outfit after all) we may offer some new horse racing insights prior to next year’s Belmont Stakes barring unforeseeable factors like in race injuries etc.
In the meanwhile, we celebrate the mighty American Pharoah who left no doubt as to which horse is the best of his generation. Maybe more importantly, he single-handedly restored our faith in fairy tale endings.