Skip to content

Fresh off the news on the opening of our new European headquarters, we are excited to make public that BigML has completed the acquisition of the groundbreaking Association Discovery software Magnum Opus. First released fifteen years ago, and progressively refined since, Magnum Opus has delivered reliable and actionable insights for retailers, financial institutions and numerous scientific applications and embodies the state-of-the-art in the field of association discovery. Consequently, this acquisition is a significant step forward in BigML’s vision to build the world’s premier cloud-based Machine Learning platform including carefully curated, most effective algorithms and data mining techniques that have already proven their mettle on complex real-world predictive analytics problems.

As part of the acquisition, world-renowned expert on Association Discovery and this year’s ACM SIGKDD Sydney Conference program co-chair Geoff Webb has joined BigML as Technical Advisor. Dr. Webb is a Professor of Information Technology Research in the Faculty of Information Technology at Monash University of Melbourne, where he heads the Centre for Data Science. He was editor in chief of the premier data mining journal, Data Mining and Knowledge Discovery, for ten years. He is co-editor of the Springer Encyclopedia of Machine Learning, a member of the advisory board of the Statistical Analysis and Data Mining journal, a member of the editorial board of the Machine Learning journal, and was a foundation member of the editorial board of ACM Transactions on Knowledge Discovery from Data. Dr. Webb is an IEEE Fellow and has received the 2013 IEEE ICDM Service Award and a 2014 Australian Research Council Discovery Outstanding Researcher Award.

Association discovery is one of the most studied tasks in the field of data mining. Stated simply, association mining identifies items that are associated with one another in data. Historically, far more attention has been paid to how to discover associations than to what associations should be discovered. Having observed the shortcomings of the dominant frequent pattern paradigm, Dr. Webb developed the alternative top-k associations approach. Magnum Opus employs the unique k-most-interesting association discovery technique as it allows the user to specify what makes an association interesting and how many associations s/he would like. The available criteria for measuring interest include lift, leverage, strength (also known as confidence), support and coverage. This approach effectively reveals the statistically sound, new and unanticipated core associations in the data whereas most other association discovery tools find so many spurious associations that it is next to impossible to find useful associations amongst the dross. Association mining complements other statistical data mining techniques in a number of ways as it:

• Avoids the problems due to model selection. Most data mining techniques produce a single global model of the data. A problem with such a strategy is that there will often be many such models, all of which describe the available data equally well.  Association mining can find all local models rather than a single global model. This empowers the user to select between alternative models on grounds that may be difficult to quantify for a typical statistical system to take into account.
• Scales very effectively to high-dimensional data.  The standard statistical approach to categorical association analysis (i.e. log-linear analysis) has complexity that is exponential with respect to the number of variables. In contrast, association mining techniques can typically handle many thousands of variables.
• Concentrates on discovering relationships between values rather than variables. This is a non-trivial distinction. If someone is told that there is an association between gender and some medical condition, they are likely to immediately wish to know which gender is positively associated with the condition and which is not. Association mining goes directly to this question of interest. Further, association between values, rather than variables, can be more powerful (discover weaker relationships) when variables have more than two values.
• Strictly controls the risk of making false discoveries.  A serious issue inherent in any attempt to identify associations with classical methods is an extreme risk of false discoveries. These are apparent associations that are in fact only artifacts of the specific sample of data that has been collected. Magnum Opus is the only commercial association discovery software to provide strict statistical control over the risk of making any such errors.

The BigML product team has already started charting the path to a seamless integration of Magnum Opus capabilities into our platform in 2015. This means effective immediately, we will NOT be offering new Magnum Opus licenses or downloads. Existing Magnum Opus licensees will be supported as usual.  Additional blog posts, a lecture series by Dr. Webb and more information on the integration timeline will be provided in the coming weeks so please stay tuned.

We are very happy to announce that BigML will be establishing its European Headquarters in Valencia, Spain. BigML had a strong European connection since its early days thanks to the founding team members’ origins in the continent (e.g., our co-founder and CEO Francisco J. Martin grew up in Valencia and got his 5-year degree in Computer Science from Universitat Politècnica de València and half of our team has been working from Valencia or Barcelona). As such, some of the most respected minds in Artificial Intelligence from Spain have been playing a big part in growing the business and making BigML a leader in today’s fast-growing Machine Learning landscape.  In furthering our cause we intend to hire up to 15 engineers in the remainder of this year and to strength the ties between Valencia and Corvallis, Oregon.

Part of the decision to select Valencia was indeed driven by the fact that BigML already had several team members there who graduated from both Universitat de València and Universitat Politècnica de València. However, our interest in Valencia goes beyond mere convenience as we also have organic ties with other Spanish cities.

Spain has a large (#13 in the world) and diverse economy yet it has traditionally been underrated from a technology startup ecosystem perspective despite some success stories of its own, especially in the e-commerce space. Nevertheless, much of this high-tech activity has been concentrated in Barcelona and Madrid, which also attract technical talent from outside Spain to fuel their growth.

In contrast, as a focal point on the beautiful “Costa de Azahar”, Valencia has a metro area population exceeding 1.5M that is responsible for a very respectable GDP of $52.7 billion. Always a city thriving on trade with the biggest port on the western Mediterranean coast, Valencia is known as the leading automotive industry hub for Spain. Ford’s recently announced injection of 2.3 billion Euros in investment capital into expanding its Valencia operations is a testament to the long-term economic prospects of the city in this critical sector of the economy. Valencia’s strong economic growth over the last decade spurred by tourism and the construction industry with concurrent development and expansion of telecommunications and transport has been accompanied by a transition to a more service-oriented economy, where currently 84% of the working population is employed in the service sector. Consistent with this ongoing transformation, Valencia is looking to take the initiative to shift the economic mix towards higher-value added goods and services in turn make waves in high-tech by attracting more tech-businesses and tech jobs. Major ICT companies have realized this and invested in branch offices in Valencia (e.g. HP, IBM). There are also regional SME technological companies specialized in different technological areas such as video game development, 3D, electronics, nanophotonics etc. According to Foundum, Spain’s equivalent of AngelList, Valencia ranks a solid 3rd behind Barcelona and Madrid in terms of startup ecosystem membership. Given its economic standing in the country, its young and well-educated technical talent and ongoing infrastructure investments one can argue Valencia has a lot more upside left in the tank. Recent years have seen more startup scene momentum with precedents like that of Silicon Valley’s Plug-and-Play Center entering Valencia in 2012 as well as local co-working spaces like Workether. Similarly, VIT Emprende initiated by the City Council’s Foundation InnDEA brings together innovative entrepreneurs of Valencia. Its members exchange knowledge, collaborate on R & D, engage in technology transfer, and establish synergies through networking and maintaining contact with prominent in-the-field of entrepreneurs around Valencia. Furthermore, Iker Marcaide (a member of BigML’s Board of Directors) has founded the international education payments outfit peerTransfer in Valencia. The company remains one of the hottest startups in Europe with its engineering team located in Valencia. Valencia has 2 public and 2 private universities and more than 100,000 university students, including 15,000 ICT related grade and post-grade students. So these developments are welcome news for the likes of Universitat Politècnica de València (36,000 students and 3,000 professors) looking to supply new technical talent into the ecosystem. In terms of tech events Valencia is the starting point of Campus Party, which went on to become one of the largest global technology networking events of the world. In addition, there are regular video game development related events and a notable upcoming Health Informatics event. Following these good examples, one of our first calls in Valencia will be to reach out to the developer and data scientist communities and inform them about what BigML has to offer. We have recently created a free Meetup group for this purpose and will be hosting a special demo and technical recruitment event on July 9, 2015. Please RSVP asap and reserve your spot. Prior experience with Machine Learning is helpful but not required for membership so long as you have the curiosity to learn and develop your data science skills. With your help, we are looking forward to add some AI and Machine Learning flavor as a key ingredient in Valencia’s 21st century economic development paella. Hope you can join in the fun. Ens veiem a València! The 2nd annual PAPIs.io conference is getting closer. The August 6th & 7th event in Sydney is scheduled right before the KDD conference and it holds the distinction of being the World’s only conference dedicated to Predictive APIs and Apps. As we blogged recently, the conference program is chock-full of interesting sessions from distinguished speakers from a wide spectrum of industries. They will be covering real life examples of predictive applications and the lessons learnt by developers of such apps. In addition, this year’s conference will include a separate technical track with tutorials on tools and APIs for building predictive apps plus a dedicated research track. There is something for you to take away from PAPIs.io 2015 whether you are a business lead looking for innovative predictive use cases that can improve your KPIs, a developer looking to figure out ways to deploy highly scalable predictive applications with ease or a student or academic eager to keep up to date with some of the most coveted machine learning techniques with proven real-world outcomes. In case you are still on the fence, the organizers of PAPIs have graciously decided to give away some tickets to aid with your plans. If interested, you can submit a short form and be in the running for this$195 value.

We are looking forward to meet with the newly minted members of the PAPIs community and share memorable moments in Sydney.

In our prior post we talked about clustering and how you can group your data together into segments by typing a single sentence using BigMLer — BigML‘s command line API. The same can be done for Anomaly Detection, so after reading this post you should be able to find the outliers in your data on a single command line.

Anomaly Detection is a technique used to identify the instances in your data that do not conform to the general pattern. Depending on the nature of your problem, pinpointing these instances can really make all the difference: they may be the errors in your dataset, which you would like to exclude before building models. They may represent fraudulent transactions in a credit card database, defective products in a manufacturing context, etc. In these cases, separating the wheat from the chaff is something paramount for your business or research.

In BigML, we added the anomaly detector to our machine learning kit last year. It is an unsupervised tool, so you don’t need to label your data as normal or abnormal. You just upload it to BigML for the anomaly detector to figure that out for you. For a good example, just check David Gerster’s post about anomaly detection and breast cancer biopsies. Now let’s see how easy it is to build an anomaly detector using BigMLer.

# Finding anomalies in your data

BigMLer can build an anomaly detector from any CSV data file just like this:

With this simple command, BigMLer will upload your data and build a source from it, create a dataset summarizing all the statistical information per field, and finally build an anomaly detector. The console will show links to these resources as they are created. Their IDs appear in these links and they are also stored for later use in files under the output directory of the command. You can also load your data from remote repositories:

or even stream it from your standard input

How do you then extract the anomalous instances in your dataset from the anomaly detector that you have just created? And how can you know how anomalous they are?

The information about the anomalous instances in your dataset is stored in the JSON that describes the anomaly detector itself, where a list of the top anomalies is enclosed. This is the information that our web interface displays, namely: the list of the instances with the top anomaly scores. Each instance in the dataset is assigned an anomaly score that ranges from 0 (least anomalous) to 1 (most anomalous).

What do these scores mean? The anomaly detector is built using an iforest, that is, a bunch of overfitted decision trees grown from samples of your data. The anomaly score is obtained from this iforest by comparing the medium depth of these trees with the real depth of the node where the instance under test is classified. The rational behind this procedure is pretty simple: the easier it is to single out an instance the more anomalous it is, while the average looking instances that follow the general pattern are hard to tell from each other. Thus, the higher the score, the more an instance is dissimilar from the general pattern.

The anomaly detectors will by default use an iforest of 128 trees and show the top ten anomalies, but these figures can also be changed using the --top-n and --forest-size options. You can build an anomaly detector from an existing dataset in BigML using its identifier:

In this example, an iforest of 50 trees showing the top twenty anomalies will be created. The --anomaly-seed option is added to ensure that the sample’s random pick are deterministic. Now that we have created and tailored our anomaly detector, how will we use it to improve our datasets and models?

# Extracting outliers and anomaly scores

Well, the first use of an anomaly detector is extracting a new dataset that contains only the top anomalous instances:

The --anomaly option refers to the existing anomaly detector ID, and the --anomalies-dataset is set to in to select only the top anomalies. The opposite case, excluding the top anomalies from the dataset used to create the anomaly detector, is also possible by using --anomalies-dataset out. This can be very useful to get rid of outliers in your dataset, as models built upon cleansed data will most likely perform better.

Sometimes you may prefer to see the score assigned to every instance in your dataset when deciding where the threshold for outliers should be. The best option then is to create a batch anomaly score for the anomaly detector training dataset that can later be downloaded as a CSV file, as in:

or stored as a new dataset with an additional column that contains the anomaly score for each instance:

The anomaly detector can be used to score datasets other than its own training dataset. To compute anomaly scores locally on any test data file:

If you use this command, the anomaly detector code will be downloaded to you computer, and each instance in your test file will be scored locally. The resulting scores will be stored in the my_dir/anomaly_scores.csv CSV file. Similarly, if you would like to score an existing dataset remotely, you can use the --test-dataset option to set the dataset ID:

As you can see, removing outliers, detecting fraud and improving the quality of your data is just a BigMLer command away. Now it’s your turn to give it a try and join in the fun: we hope you get to bring the power of BigML to your command line before too long!

Building predictive models with machine learning techniques can be very insightful and provide tremendous business value in optimizing resources that are simply impossible to replicate manually or by more traditional statistical methods. It can best add this value when coupled with good data and domain expertise in interpreting the data and the predictions. Predictive modeling is seldom a one-way street, where the first run through the cycle of data wrangling, feature engineering, model building, evaluation and predictions yields perfectly accurate results. This requires the practitioner to go through many more iterations with different input data and model configurations in order to minimize the error while steering clear from overfitting and other types of bias. In this regard, the process does require some “art” that must be appreciated for what it is.

Sometimes predictions may be out of whack despite best tuning efforts given available resources. When this happens, what are the steps one must take in order to improve the results?

We’d like to go over some best practices in doing so while conducting a post-mortem of our recent not-so-great Belmont Stakes predictions. As a reminder, last Friday we wrote about our predictions for the past weekend’s Belmont Stakes based on a decision tree ensemble built on BigML. The predictions called for American Pharoah falling short of securing the Triple Crown by placing somewhere in the middle of the pack at Belmont likely because the wear and tear of the first two legs of the Triple Crown would catch up with him against fresher horses. That’s exactly what had happened to California Chrome last year — the 12th contender coming close, but failing at Belmont. Many expert handicappers’ opinions seemed to align with the model so if nothing else we know that we are in good company.

Fast forward to today, unless you were living under a rock, you probably heard that American Pharoah defied our predictions and broke the 37 year Triple Crown drought after a solid performance leading the race wire-to-wire. Frosted, which came in second as the model predicted gave the champion horse some competition as they raced into the last stretch, but ultimately could not respond to American Pharoah’s pace when it accelerated away into the books of thoroughbred racing history. The horse that was predicted to be the most likely winner per our model (#8 Materiality) was perfectly placed in second position after a mile or so yet he was eased by his jockey in the last stretch thereby ending up with the last place finish. Given his 3rd best odds at the start of the race, this dismal performance was not expected from him by many.

We included the results of last weekend’s race in our original dataset and ran an anomaly detection task on the resulting post-race dataset. As seen below, last weekend’s results have been assigned fairly high anomaly scores, which point to somewhat unexpected outcomes. However, the scores are not high enough to justify being marked as certain outliers so we would still keep them were they present in the original training set.

Our decision tree ensemble had performed better on our test dataset as compared to the mean benchmark. Specifically, the mean benchmark was beaten by approximately 9% and 13% respectively based on Mean Absolute Error and Mean Squared Error measures. On the other hand, the R Squared (a.k.a. RMSE) value for our ensemble was 0.11. Ideally, one wants the R Squared value to be closer to 1, which means perfect predictive power. However, in domains that involve a human component (as opposed to driven by forces of nature) this is hardly the case. In those type of domains, R Squared values below 0.5 are pretty common and do not necessarily render the model useless. Thoroughbred racing definitely has a big human component given the horses, breeders, trainers and jockeys involved. With that said, we’d like to be able to improve on the 0.11 figure and to get a better edge over the mean benchmark. This would require a systematic approach in looking for different angles to further iterate on and to calibrate the model accordingly.

Some of our readers jokingly commented on our post stating “apparently the machine has some more learning to do…”. Others mentioned that handicapping remains an art not a science. We still believe models can help provide more informed estimates than subjective opinion not grounded in data. However, we did take the advice to heart as there is some truth in both comments. So we went to work in understanding what caused the deviations knowing full well that predicting real life events such as horse races will remain an inherently difficult problem with many variables factoring in. In conducting our post-mortem analysis we’d like to reference Ahmed El Deeb’s post on ways to improve your predictive model, which coincidentally employs an equine analogy summed up in his graphic below that resembles a horse.

So let’s follow Ahmed’s framework and see where we may have fallen short and which areas it may make sense to apply more energy towards next time.

## 1) More Data

Our final data sample contained over 250 horses that took place in Belmont Stakes in the last 25 years. Despite capturing over two decades of events, this is not a large dataset to draw rock solid conclusions from. Perhaps more importantly, even though there were close calls, we did not have a Triple Crown winner in this period for the model to learn from. So it would have been ideal to go back farther. This shortcoming alone would have pointed out to American Pharaoh having a good shot at the Triple Crown had the model included him in the front pack of the finishers, but instead he was predicted to finish between 3rd and 5th places depending on track condition and odds changes right before the race. How can we explain his impressive victory then?

## 2) More Features

As we mentioned before, horserace handicapping is a complex undertaking with a long history. Professionals rely on many more signals in making their predictions than we did with our quickly whipped together model. Most notably lacking were:

• Mid-race performances for each horse rather than the race leader alone
• More complete view of past race performances for each horse inclusive of Kentucky Derby prep races.
• Trip related variables summarizing whether the thoroughbred had problems during the race e.g. Bumped at the starting gate, stuck in the middle of the pack, loss of momentum due to jockey error etc.
• Pre-race workout performance
• Preferred racing style e.g. Needs the lead, likes to run off pace, closer etc.
• Beyer speed figures
• Post position etc.

With none of these key pieces of data that professional handicappers swear by, the model had to make the best of what we fed it given our time constraints — and it was still able to generate some edge.

## 3) Feature Selection

While this makes for a good advice in situations where there are fairly low number of observations against a fairly large number of features, we feel that BigML’s ensemble algorithm that was used to generate our model already did a good job of reducing the noise in the data by weighting a subset of the features much heavier than others. We covered those model favored features in last week’s post so no need to repeat here.  If and when we include new independent variables, it would be a good time re-evaluate features.

## 4) Regularization

Feature selection is more of a human guided process, where the practitioner shuffles various independent variables and observes whether or not that had a big impact on the resulting error measures (e.g. Root Mean Squared Error). Regularization on the other hand, achieves the same effect implicitly by the algorithm automatically optimizing information gain (as in the case of decision trees) by way of using the minimal amount of features while at the same time not overly relying on any single one of them. Of the 39 features we fed our model, it favored 10 or so with Odds being the one it relied on the most. Be that as it may, Odds only explained less than half (44%) of the historic deviations in Belmont Stakes relative finishing positions, which we had designated as our target variable.  Again, the inclusion of new variables would likely have significant impact on the model’s favored subset of features if we engaged in further iterations.  These could very well yield new rules and relationships to consider.

## 5) Bagging

In his blog post, Hasan makes the point that Bagging can do wonders in reducing prediction error variance without any noticeable impact on bias. We are in full agreement with his take and it just so happens that our model was already using the bagging technique (a.k.a. Bootstrap Aggregating) in creating its ensemble of decision trees. He then goes on to explain that Bagging comes at the additional cost of computational intensity. BigML has a solid implementation of its ensembles that takes care of the memory management behind the scenes so the user does not have to deal with the intricacies of hardware optimization. We invite more of our readers take advantage of our linearly scaling ensemble models and lightning fast predictions.

## 6) Boosting

Boosting relies on training several models successively in trying to learn from the errors of the preceding models. It can decrease bias with minimum impact on variance, but can make for a complex implementation scenario as far as the pipeline required to support it. This methodology is not yet supported by BigML and thus would not make a difference in the case of our Belmont Stakes model.

## 7) Different Class of Models

Finally, different types of algorithms can yield different (and at times complementary) results in minimizing overall prediction errors. On the other hand, a comparative approach seeking to run many algorithms in parallel prolongs the solution development cycle and can be utterly uninterpretable. Some Machine Learning researchers who have specialized knowledge on the underlying mathematics of certain algorithms may have a better chance to explain why algorithm X may have come up with prediction Y, but many ML practitioners and enthusiasts do not have that level of deeper understanding — leaving them with a steep hill to navigate when it comes to dealing with many more algorithms in solving a singe predictive use case. In those instances, it may be best to rely on general knowledge as to which algorithms tend to work best on what type of problems and keep the process as simple as possible before involving more algorithms. For example, decision trees and decision forests are known to be very versatile across various problem spaces.  This is one of the main reasons BigML chose to offer them first in addition to their interpretability and scalability advantages.

## SUMMARY

To recap, this has been a fun learning experience in a brand new domain for us.  It was an acceptable first step in the right direction, but serious effort must be undertaken to make it a professional grade system in the future — primarily by incorporating many more features into our model. Given the addition of even more features than the 39 we have had, it may then be worthwhile to do some human-guided feature selection.  Subject to data availability and time constraints (we are not a sports betting outfit after all) we may offer some new horse racing insights prior to next year’s Belmont Stakes barring unforeseeable factors like in race injuries etc.

In the meanwhile, we celebrate the mighty American Pharoah who left no doubt as to which horse is the best of his generation. Maybe more importantly, he single-handedly restored our faith in fairy tale endings.

(NOTE: We suggest that you also read our follow up post including the post-mortem analysis of the results from the machine learning model described here.)

This Saturday Americans will witness the 147th Belmont Stakes, thoroughbred racing’s 3rd and final leg of the highly coveted Triple Crown.  If you follow horse racing, you will know that no horse since Affirmed in 1978 has been able to claim this elusive prize.  In fact, Affimed was only the 11th Triple Crown winner in the long history of American thoroughbred racing.  Since 1978 there have been 13 close calls, where a horse won both the Kentucky derby and the Preakness Stakes yet failed to repeat the same success in Belmont Park (in Elmont, NY), where the famed Secretariat showcased a historic performance in 1973 and broke the world record for the distance — a record that still stands today.

As was the case during last year’s Triple Crown (one featuring a rare California-bred colt aptly named , 2015 has also presented lovers of the “Sport of Kings” with what seems as a worthy suitor in American Pharaoh.  American Phaorah won the prestigious Kentucky Derby by a length in a hard-fought battle ahead of Firing Line.   He then carried his form to the Pimlico Race Track to also claim the Preakness Stakes on a rain drenched track pretty much wire-to-wire ahead of serious competition — this time by a comfortable 7 lengths.  This has stoked the sports media machine and here we are again eagerly awaiting to crown a new super horse after some 37 years.

But the question remains, can American Pharaoh really achieve the feat?  If you ask FiveThirtyEight’s Benjamin Morris or Wired Magazine , it remains a tall order for this year’s contender to buck the recent trend and to master the “Test of Champions” on Saturday.  Many experts tie these near misses to several factors including:

• the longer race distance of Belmont Stakes (12 furlongs = 1.5 miles = 2400 m.) as compared to the Derby (10 furlongs = 1.25 miles = 2000 m) and Preakness (9.5 furlongs = 1.19 miles = 1900 m), which demands more of a distance horse that contrasts against the more sprint-like first two legs of the Triple Crown
• the grind of having to run three stakes races in five weeks – often against a Belmont field that hasn’t gone through the same gauntlet and thus is more fresh.

Despite these factors, its seems odds-makers are still putting a lot of high hopes on Triple Crown contenders frequently listing them as even money bets.  American Pharaoh is no exception at 3/5 morning line odds with odds for the nearest horse (Frosted) listed as 5/1 — more than 8 times the payoff.

Though we have taken note, we were not fully satisfied with the oft-repeated assumptions and subjective opinions with no grounding in data. Therefore, we thought it may be a fun exercise to construct a predictive model using BigML in order to predict the outcome of this weekend’s race among the 8 entrants.  As we go into the details of what we found out, please be forewarned that we are NOT professional handicappers here.  While we hope you enjoy the informational aspect of this post, we urge you NOT to bet your life savings on any of the predictions below.  (If you are not particularly interested in how this analysis was conducted, you may prefer to jump to the conclusion section or else read on!)

## THE HANDICAPPING DATA

Thoroughbred racing records are pretty well kept and go back a long time. However the format has evolved over time and because of its non-digital origins, online services that offer such data in a reliable way are somewhat limited.  For this analysis, we used publicly available race result sheets for all Triple Crown races (Kentucky Derby, Preakness Stakes, Belmont Stakes) in the last 25 years.

So, what data do we have after all?  Below are the horse and race related variables we used to build a predictive model:

• RACE YEAR
• RACE NAME
• HORSE NAME
• HORSE PEDIGREE
• JOCKEY
• TRAINER
• FINISH RANK
• FINISH TIME
• TRACK CONDITION (i.e. Fast, Good, Sloppy)
• SCRATCHED?
• 1st SPLIT (The time of the leading horse after 2 furlongs or 400m.)
• 2nd SPLIT (The time of the leading horse after 4 furlongs or 800m.)
• 3rd SPLIT (The time of the leading horse after 6 furlongs or 1200m.)
• 4th SPLIT (The time of the leading horse after 8 furlongs or 1600m.)
• ODDS (Morning line odds for each horse to run the Belmont Stakes)
• LENGTHS BEHIND (# of horse lengths this horse was behind as compared to the race winner.)
• % of RACE RECORD TIME
• AVERAGE SPEED
• POINTS (Proprietary measure of the horse’s race performance as compared to the best ever recorded time for that distance and track.)

Horse race handicapping is pretty much a science these days and there are a number of businesses setup on this premise. Admittedly, there are many more pieces of data available that we chose to leave out due to time constraints.  One could argue that a number of those could potentially factor in influencing the race predictions e.g. Equipment, Post Position, Sire etc. However, the list above does capture a good level of detail on the previous Triple Crown performances by these elite horses. As such, these variables should intuitively be linked to the results we have observed over the last 25 years.

All in all, we looked at  729 distinct horses having raced in at least one of the 74 Triple Crown races and filtered it down to our final dataset containing 256 horses having participated in the Belmont Stakes since 1991.

## MODELING & RESULTS

BigML’s decision trees can unearth instant insights that are hidden in your data by analyzing numerous input variables as they relate to your target variable.  In this case, we are trying to predict LENGTHS BEHIND for Belmont Stakes 2015.  So that becomes our “Target Variable”.  The input or independent variables are all the others listed above — inclusive of the 2015 Kentucky Derby and Preakness Stakes that took place in the last 5 weeks.  Once the data munging process is over (in this case we used pandas), we uploaded our “Machine Learning ready” data file (i.e., data was in tabular format) to BigML and started analyzing it using BigML’s built-in Anomaly Detector, Clustering and Decision Tree algorithms.  To be able to compare and contrast different model configurations, we performed several iterations of the model building process — evaluating model performance at the end of each iteration and either keeping or dumping the original model based on the results.  In most cases, the whole process took less than a minute or two, and just like that we had our models correlating 25 years worth of racing data to this Saturday’s expected outcomes.  We finally settled on a  Decision Tree ensemble model (with Bagging) containing 250 decision trees within it.  This particular model handily beat the random picks benchmark, and was able to also comfortably beat (by over 12%) a more educated benchmark utilizing the historic average lengths back figure for Belmont Stakes.  All of this benchmarking was done without a line of coding, instead relying on BigML’s powerful model evaluation feature.

The nifty Model Summary Report feature of our decision tree has revealed the following as the most influential input variables on predicting Belmont Stakes performance with the predictive power expressed as %s listed in parentheses after each:

1. ODDS — BELMONT (22%)
2. JOCKEY (22%)
3. TRAINER (13%)
4. HORSE PEDIGREE RELATED (10%)
5. RACE YEAR (4%)
6. LENGTHS BEHIND — KENTUCKY DERBY (4%)
7. HORSE PEDIGREE RELATED (4%)
8. AVERAGE SPEED — KENTUCKY DERBY (3%)
9. POINTS — KENTUCKY DERBY (3%)
10. % of RACE RECORD TIME — PREAKNESS (3)

## ANALYSIS & CONCLUSION

Let’s start by stating that the interpretation here is not written in stone,  but rather based on a general understanding and the recent trends in the sport of American thoroughbred racing.  Furthermore, none of this is meant to entice you, the reader, into gambling on this weekend’s race.  And none of if should be treated as professional handicapping advice by any measure.

With that said, it is somewhat expected and relieving that the betting odds are the foremost predictor for this weekend’s race.  If it did not show up at all, we’d be both surprised and worried about the racing fans relying on the experts’ and fellow racing fans’ “collective intelligence”.  However, the model tells us that the odds are not all that they are cracked up to be (at least for the Belmont Stakes).  After all, the odds explain far less than half the variation in results and thus cannot be exclusively trusted in this case.

Next up are the Jockey and Trainer of the horse.  These point out to certain experienced teams having an edge in Belmont either because they are based there or they have a history of success and/or a higher ability to attract great horses.  In a way, this reminds us of the recruitment halo effect programs and coaches like John Calipari of University of Kentucky NCAA basketball fame have come to accomplish in other competitive arenas.  In other words, success begets more success.

There is also a large chunk of the handicapping world that puts a lot of emphasis on a horse’s pedigree.  There are very detailed accounts of the progeny going back 9 or 10 generations deep just to be able to tell if a given horse carries the genes to run longer distances or whether it is better suited for sprint racing.  Similarly, experts claim that these signals can determine how well the thoroughbred will perform on turf vs. dirt surfaces.  It is a bit of an art mixed in with some science, so we won’t go into details here.  Although these factors did not weigh in as heavily as we expected, they do have a say in the predictions, which hopefully somewhat justifies the crazy high stud fees some thoroughbreds are able to secure for their breeders.

At first, it was surprising to us that the Race Year also mattered.  We suspect this may have something to do with the aforementioned evolution caused by the in-breeding with an emphasis on speed vs. stamina.  Indeed, it appears that we have gotten to a point where most of the first rate horses never run a race as long as the Belmont Stakes in their entire racing career anymore.  This suggests older years are more likely to present us with the so called “Stayers” that pack the kind of stamina an ideal thoroughbred should possess at the starting gates of Belmont Stakes.

The remaining variables center around the Kentucky Derby performance of the horse.  Many believe that asking for the modern thoroughbred to race against top competition 3 times in 5 weeks is too much — perhaps even dangerous.  That is why the recent years have turned the strategy of skipping the Preakness to have a better shot at Belmont into a very popular undertaking for the trainers and owners of these man made modern speed demons.  In this regard, our model may be pointing out to the efficacy of this approach with a number of performance measures from the first leg of the Triple Crown.  In contrast, there is only a single Preakness performance related measure further pointing out to the lesser middle child status of Maryland’s Preakness.

## PREDICTING THE 2015 BELMONT STAKES

Given the rough model construct we just covered, which horse(s) seem to have the best chance to reach the finish line first and make history?  Will American Pharaoh be a living legend or will some no name horse spoil the day for horse racing fans once again?

Without further adieu,  given the field of 9 entrants, the model predicts the finishing positions as follows:

1. Win: MATERIALITY (6-1)
2. Place: FROSTED (5-1)
3. Show: MADE FOR LUCKY (12-1)
4. KEEN ICE (20-1)
5. AMERICAN PHAROAH (3-5)
6. FRAMMENTO (30-1)
7. TALE OF VERVE (15-1)
8. MUBTAAHIJ (10-1)

As the prediction goes, American Pharaoh will end up being one more false hope as the next Triple Crown champion, finishing somewhere in the middle of the table and disappointing its fans despite a great overall Triple Crown racing performance.  Some experts claim that given this field, American Pharoah is likely to jump ahead early as he did in the Preakness and dictate the pace from then on in the hopes that it can steal the race.  In that scenario, it seems that our model is predicting that first Materiality and Frosted and later in the stretch Made for Lucky and Keen Ice can close up on Pharoah’s tired legs having enjoyed a much longer and less stressful preparation before the this final “Test of Champions”. In addition, when we look at the confidence intervals accompanying this most likely scenario, they mostly point out to American Pharoah finishing anywhere from 3rd to 5th never seriously threatening the favored Materiality.

We hope this inspires you to grab a cool drink and catch the Belmont Stakes live or on TV or online  this weekend.  Horse races can be exhilarating events for a wide audience from youngsters to the retired.  These beautiful animals are true athletes in the sense that they can move their 1000 pound masses at speeds approaching 40 miles per hour (65 km/h) across a dirt track all the while dealing with equally talented equine rivals looking in their eyes stride by stride.  We sure hope thoroughbred racing in America gets to capture its former glory and becomes cherished events beyond the Triple Crown races.  Who knows?  Perhaps Machine Learning gets to play a small part in that!

As we have blogged about before, PAPIs.io 2015 is taking place in Sydney this year on the 6th and 7th of August right before the KDD conference. As a founding member and sponsor, BigML is looking forward to this year’s event. PAPIs.io is unique in that it has been able to bring together data scientists, developers and practitioners from large tech companies to leading startups and prominent educational institutions from around the globe to discuss all aspects of Predictive APIs and Predictive Apps. The very hands on and interactive approach of the agenda is centered on addressing the challenges of building real-world predictive applications based on a growing number of Predictive APIs that are making Machine Learning more and more accessible to developers. As a bonus, this year’s event will also introduce a technical track. Our enthusiasm is only elevated further after seeing today’s preliminary conference program announcement, which exhibits great diversity in terms of the speakers and the topics to be covered.

Here are some preliminary program highlights:

• Big Wins with Small Data: PredictionIO in Ecommerce (David Jones, Resolve Digital)
There’s a lot of noise about big data and cutting edge algorithms optimisations. Returning to the basics, this presentation shows you might not need as much data as you think to get real world benefits. Learn about machine learning in ecommerce, PredictionIO and how we used off the shelf, well implemented algorithms to get a 71% increase in revenue with an online wine retailer.
• Open Sourcing a Predictive API (Alex Housley, Seldon)
After operating for three years as a “black box” predictive API, Seldon recently open-sourced it’s entire predictive stack. Alex will talk about Seldon’s journey from closed to open: the challenges and pitfalls, architectural considerations, case studies, changes to business models, and new opportunities for partnership across the full stack – between both open and closed technology providers.
• Deploying Predictive Models with the Actor Framework (Brian Gawalt, Upwork)
Build a better, faster, more efficient predictive API with the Actor model of programming. Latency, logging, full utilization are all easily handled with this framework. Upwork (formerly Elance-oDesk) freelancer availability model — anticipating who’s looking for work right now — is now a real-time service, without costly or complicated build-out of our stack or our datacenter, thanks to the Actor model.
• Protocols and Structures for Inference: A RESTful API for Machine Learning (James Montgomery, University of Tasmania)
Diversity in machine learning APIs works against realizing machine learning’s full potential by making it difficult to compose multiple algorithms. This paper introduces the Protocols and Structures for Inference (PSI) service architecture and specification for presenting learning algorithms and data as RESTful web resources that are accessible via a common but flexible and extensible interface. This is joint work with Dr. Mark Reid of the Australian National University and NICTA and Dr. Barry Drake of Canon Information Systems Research Australia.
• Large scale predictive analytics for anomaly detection (Nicolas Hohn, Guavus Inc.)
The focus will be on anomaly detection for network data streams, where the aim is to predict a distribution of future values and flag unlikely situations. Challenges both in terms of data science and engineering will be discussed, such as the accuracy, robustness and scalability of the prediction API. An example of a production deployment will also be discussed.
• AzureML: Anatomy of a machine learning service (Sharat Chikkerur, Microsoft)
Describing AzureML: a web service enabling software developers and data scientists to build predictive applications including. Will outline the design principles, system design and lessons learned in building such a system.
• Building Machine Learning Models for Predictive Maintenance Applications (Yan Zhang, Microsoft)
This talk introduces the landscape and challenges of predictive maintenance applications in the industry, illustrates how to formulate (data labeling and feature engineering) the problem with three machine learning models (regression, binary classification, multi-class classification), and showcases how the models can be conveniently trained and compared with different algorithms in Azure ML.

There will also be a panel discussion moderated by Mark Reid of ANU/NICTA.

If you are also attending KDD and want to kill two birds with one stone while you will have travelled all the way Down Under, there is no better alternative than attending PAPIs.io 2015 and rubbing shoulders with some of the most notable movers and shakers in the Machine Learning world in a more cozy and comfortable setting. You can follow subsequent announcements on PAPIs.io on Twitter. Hope to see you all in Sydney!

It’s been a while since we last wrote about the latest changes in our command line tool, BigMLer. In the meantime, two unsupervised learning approaches have been added to the BigML toolset: Clusters and Anomaly Detectors. Clusters are useful to group together instances that are similar to each other and dissimilar to those in other groups, according to their features. Anomaly Detectors, on the contrary, try to reveal which instances are dissimilar from the global pattern. Clusters and anomaly detectors can be used in market segmentation or fraud detection respectively. Unlike trees and ensembles, they don’t need your training data to contain a field that you must previously label. Rather, they work from scratch which is why they’re called unsupervised models.

In this post, we’ll see how easily you can build a cluster from your data, and a forthcoming post will do the same for anomaly detectors. Using the command line tool BigMLer, these machine learning techniques will easily pop out of their shells, and you will be able to use them either from the cloud or from your own computer, no latencies involved.

# Clustering your data

There are many scenarios where you can get new insights from clustering your data. Customer segmentation might be the most popular one. Being able to identify which instances share similar characteristics has always been a coveted feature, whether it be for your marketing campaigns or to identify groups with loan default risk or for health diagnosis. Clusters do this job by defining a distance between your data instances based on the values of their features. The distance must ensure that similar instances are closer than different ones. In BigML, all kinds of features contribute to this distance: numeric features, obviously, but also categorical and text features are taken into account. Using this distance, the instances in your dataset that are closer together and more distant to the rest of points are grouped into a cluster. The central points of these clusters, called centroids, give us the average features of the group. The user can optionally label each cluster with a descriptive name based on the features of the centroid. These labels can be used as a cluster-based class name, which can be thought of as assigning a label to each instance of the cluster dataset. Later, you could build a new global dataset with a new column that would be the label of the cluster this instance is assigned to, and in this sense, the label would be a kind of category. Then, when new data appears, you can tell which cluster it belongs to by checking which centroid it is closest to.

So, how can BigMLer help cluster your data? Just type:

and the process begins: first your data is uploaded to BigML‘s servers, and BigML infers the fields that your source contains, their types, the missing values these fields might have, and builds a source object. This source is then summarized into a dataset, where all statistical information is computed. After that, the dataset is used to look for the clusters in your data and a cluster object is created. The command’s output shows the steps of the process and the IDs for the created resources. Those resources are also stored in files in the output directory.

The default clustering algorithm used is G-means, but what does that mean? We talked about that extensively in a prior post, but basically this algorithm groups your data around a small number of instances (the number is usually denoted by k) by computing the distance of the rest of the points in the dataset to those instances. Each point in the dataset is considered to be grouped around the closest one of the k-instances selected in the beginning. This process results in k clusters and their centroids (the central point of the each group) are used as the next starting set of instances for a new iteration of the same algorithm. This carries on until there is no more improvement in the separation among chosen clusters. An advanced algorithm, G-means, eliminates the need for the user to divine what the value for k must be. It compares the results found using different values for k and choses the one that achieves the best Gaussian shaped clusters. That’s the algorithm used in BigMLer if no k value is specified, but you can always set the k of your choice in the command if you please:

In this case, the created cluster will have exactly 3 centroids, while in the first case, the final cluster count will be determined automatically. This process of picking instances as seed is done randomly, therefore you must specify a seed using the modifier --cluster-seed in your command if you want to ensure deterministic results. To run the clustering from your created dataset with a seed, use the dataset id as starting point:

This command will create a new cluster from the data enclosed in the existing dataset that will be reproducible in a deterministic way. It will also generate predictions for the new data enclosed in the my_new_instances file, finding which centroid each instance is closer to. But then, how do you know which instances are grouped together?

# Profiling your clustered data

To know more about the characteristics of the obtained set of clusters, you can create a dataset for each of them with --cluster-datasets. Refer to your recently created cluster using its id

and you’ll obtain a new dataset per centroid. Each of them will contain the instances of your original dataset associated to the centroid. If you are not interested in all of the groups, you can chose which one to generate by using the name of the associated centroid.

will only generate the datasets associated to the clusters labeled as Cluster 1 and Cluster 2. The generated datasets will show the statistical information associated with the instances that fall into a certain cluster-defined class.

Can you learn more from your cluster? Well, yes! You can also see which features are most helpful to separate the clustered datasets, so that you can infer the attributes that “define” your clusters. In order to do that, you should create a model for each clustered dataset that you are interested in using --cluster-models:

How does BigMLer build a tree model on a particular cluster, let’s say Cluster 1? By adding a new field to every instance in your dataset that will contain whether or not this particular instance is grouped there. This brings an additional advantage: each model has a related field importance histogram that shows the importance a certain field has in the classification. Knowing which fields are important to classify new data as related to a centroid can give you an insight into what features define the cluster.

And these are basically the magic sentences that you must know to identify the groups of instances your data contains, the profiling of each group and the features that make them alike and different from others. Similarly, in a forthcoming post, we’ll talk about another set of commands that will help you pinpoint the outliers in your data by using Anomaly Detectors. Stay tuned!

Following up our post announcing the availability of BigMLKit, we are now going to introduce the BigMLKit API and present a sample app that can be used as a playground to experiment with BigMLKit.

As already mentioned in the previous post, BigMLKit brings the capability of “one-click-to-predict” to iOS and OS X developers. This is accomplished through the notion of task, which is basically a sequence of steps. Each step has traditionally required a certain amount of work such as preparing the data, calling BigML’s REST API, waiting for the operation to complete, collecting the right data to prepare the  next step and so on. BigMLKit takes care of all of this “glue logic” for you in a streamlined manner, while also providing an abstracted way to interact with BigML and build complex tasks on top of our platform.

# BigMLKit Classes

BigMLKit’s classes can be grouped into three groups:

• Foundation
• Tasks
• Configuration.

## Foundation

Everything in BigML is associated to resources, such as datasets, clusters, sources, etc. A resource’s identity in BigMLKit is defined through a name and a UUID (universally unique identifier), which are encapsulated in the BMLResourceProtocol protocol. A concrete implementation of BMLResourceProtocol will additionally provide more properties and/or methods according to the specific application that is being built. If you want to be able to filter available resources locally, you will possibly need to use Core Data and define a model whose entities contain the attributes you want to filter on. On the other hand, if you only want to support remote behavior for your entities, then their UUID is enough information for the REST API to handle them.

BigMLKit defines three basic types to build resource UUIDs:

• BMLResourceType
• BMLResourceUuid
• BMLResourceFullUuid.

The three types are typedef’ed NSStrings. According to how BigML REST API identifies resources, a BMLResourceFullUuid is made up of a BMLResourceType and a BMLResourceUuid joined through a slash, e.g. “model/de305d54-75b4-431b-adb2-eb6b9e546014″. The class BMLResourceUtils defines convenience methods to extract a BMLResourceType or BMLResourceUuid from a BMLResourceFullUuid.

## Tasks and Workflows

Tasks and Workflows are what makes BigMLKit useful.

A workflow is a collection of BigML operations. It can be as simple as a single call to BigML’s REST API or it can include multiple steps, e.g., when creating a sequence of BigML resources starting with a dataset and ending with a prediction.

BigMLKit provides several classes to define and use tasks and workflows, as detailed below.

BMLWorkflow is an abstract base class that is used to build composite workflows combining lower-level workflows together. The simplest form of BMLWorkflow is a BMLWorkflowTask, which corresponds to a single step workflow. BigMLKit provides several BMLWorkflowTask-derived classes that represent basic operations that BigML REST API allows to execute:

• BMLWorkflowTaskCreateSource
• BMLWorkflowTaskCreateDataset
• BMLWorkflowTaskCreateModel
• BMLWorkflowTaskCreateCluster
• BMLWorkflowTaskCreatePrediction.

BMLWorkflowTaskSequence is a higher-level workflow that is able to execute a sequence of workflows.

BMLWorkflowTaskContext provides the context for task execution, where input, output, and intermediate results can be stored. The context also acts as a monitor for remote operations: it will poll BigML API to check a resource state progress and handle it according to its semantics. The storage mechanism is exposed through an NSMutableDictionary. The association key/value is an implementation detail of the workflows that use the context to carry through their operation. A context also hosts a connector object, which is responsible for handling the communication with BigML through its API interface. Currently, the connector object is an instance of ML4iOS.

## Configuration

BigML’s REST API offers a lot of options to configure the available machine learning algorithms. For each resource type, BigMLKit provides a plist file that describes which options are available and what their type is, so a program can easily handle them, e.g., to display a list of available options or allowing users to set values for them. There are three main classes at play here:

• BMLWorkflowTaskConfiguration, which allows for collecting all options in a common place and accessing them in an organized way; e.g., by getting all option definitions, or their values, etc.
• BMLWorkflowTaskConfigurationOption, which is the atomic option. This basically provides a way to set whether the options should be effectively used in a given execution of the workflow, and to retrieve the current option value.
• BMLWorkflowConfigurator, which is a container for all the BMLWorkflowTaskConfiguration instances associated with a user session. A configurator can be shared across multiple executions of the same workflow, or even different workflows.

In many cases, it is enough to use BigML’s default values for configuration options so there is no need to tweak them. The topic of BigMLKit configuration will be explored in a further post.

# Running a Workflow

Running a workflow requires two preliminary steps:

• creating the workflow
• creating and setting up the context for its execution.

As mentioned above, at the moment BigMLKit provides single-operation workflows and a task sequence workflow, but you can easily implement any kind of specific workflow that you might need. Creating and setting up a context is workflow specific. You can see an example in the sample app introduced below.

Once the preliminary steps are done, you can run a workflow by calling the runInContext:completionBlock: method. The completion block will be called at the end of the workflow execution with an NSError argument in case an error occurred.

# BigMLKitSampleApp

BigMLKitSampleApp is a simple iOS app that shows how you can integrate BigMLKit into your apps. The sample app is available on GitHub and allows to create a prediction from a data file, which is used to train a model. All the steps from datasource creation up to the model creation are executed remotely on BigML servers, while the prediction step is executed locally based on the calculated model and does NOT require access to BigML services. Three sample source files are provided: iris.csv, diab.csv, and wines.csv.

To keep the sample app code simple enough, it defaults to creating a decision tree, although BigMLKit also provides support for training clusters, and, in the near future, anomaly detectors and other Machine Learning algorithms provided by BigML. Furthermore, the app uses static data files to train the models, but in a real application you could as easily read the data to train your models from iOS HealthKit and/or ResearchKit, or you could use HealthKit/ResearchKit data to make predictions based on an existing reference model.

To understand how BigMLKit is integrated into the app, you can inspect the BMLPredictionViewController class, and in particular its two methods called setupFromModel: and startWorkflow.

The setupFromModel: method is called whenever the app delegate detects that the user tapped on any of the three available source files. On the other hand, startWorkflow is responsible for enabling the UI to provide the user with some visual feedback about the workflow being executed. It also handles the display of workflow results. In greater detail:

• When a tap on a resource file is detected, the app delegate stores the current source file in the shared view model, then calls setupFromModel:.
• setupFromModel: creates a new workflow, properly initializes a context, and finally calls startWorkflow.
• startWorkflow will update the UI and then it will start the workflow and provide a callback.
• The callback, if no errors are found, will use the workflow results (available in the workflow context) to build a prediction form, so the user can try different combinations of input arguments to make new predictions.

This is all that is required! As you can see, BigMLKit makes it really straightforward to run a simple workflow and use the power of machine learning in your apps, and we hope that you will find great applications for our technology and create extensions to BigMLKit that will make it even more convenient.

If you have any questions on how to get started with BigMLKit, feel free to contact us at info@bigml.com.

This is the first post in a series of statistics primers to inaugurate the arrival of BigML’s new advanced statistics feature. Depending on your background as a reader, the theory portion of this post may cover ideas which you already understand. If that’s the case go ahead and skip ahead to how to access these stats in BigML. Today’s topic is Benford’s law, which can be applied to detect irregularities in numeric data. It applies to collections of numeric data whose values satisfy the following criteria:

1. Have a wide distribution, spanning several orders of magnitude.
2. Generated by some “natural” process, rather than, say, arbitrarily chosen by a human.

Given that those conditions are met, Benford’s law states that the first significant digits (FSDs) will be distributed in a very specific pattern. In other words, we can take each of the digits from 1 to 9 and look at the relative proportion with which they appear in the first significant position among values in the data (e.g. the FSDs for the values 122.4, -54.01, and 0.0048 are 1, 5, and 4 respectively). If these proportions match the ones predicted by Benford’s law, then we can be assured that our data satisfy our two criteria. Otherwise, the data may have been tampered with, or may simply cover too narrow of a range for Benford’s law to apply. If we denote pd as the proportion of the data in which the digit d is in the first significant position, Benford’s law states that these proportions will take on the following values:

$p_d = \log10 (d+1) - \log10(d)$

In the plots that follow, p1 through p9 are drawn as the green line. We see that 1 should be the FSD in about 30% of the data while 9 should only be about 6% of the FSDs. The first two plots are examples of numeric data which conform to Benford’s law. The Fibonacci numbers and US county populations both satisfy the criteria given above. The gray bars denote the relative proportions of FSDs in the data.

The next two plots are examples of non-conforming data. The first example is data from the ubiquitous Iris dataset. Although it is undeniably a natural dataset, it fails the first criterion, since its values span only the the narrow range from 4-8 cm. The second example is an instance of fraudulent data. As chronicled in the State of Arizona v. Wayne James Nelson (CV92-18841), Mr. Nelson, a manager in the Arizona state treasurer’s office, attempted to embezzle nearly \$2 million through bogus vendor payments.  Since Nelson started small and worked his way up to larger amounts, the values do satisfy the first criterion. However, as all the amounts were artificially invented, the second criterion is not satisfied and the final FSD distribution is very far from the one given by Benford’s law, with the digits 1-6 being too scarce and 7-9 being much more common than expected.

The last of these examples highlights the potential usefulness of this phenomenon in detecting suspicious numbers, and indeed there are many documented cases where fraudulent data have been exposed through application of Benford’s law.  Multiple analyses of results from the 2009 Iranian presidential elections have used Benford’s law to provide statistical evidence suggesting vote rigging.  A post mortem Benford’s law analysis of the accounts for several bankrupt US municipalities revealed inconsistent figures, which could be indicative of the fiscal dishonesty which led to the municipalities’ financial ruin. A team of German economists applied a Benford’s law analysis to the accounting statistics reported by European Union member and candidate nations during the years leading up to the 2010 EU sovereign debt crisis. They found that the numbers released by Greece showed the highest degree of deviation from the expected Benford’s law distribution.  As Greek national debt was one of the main drivers of the crisis, we can draw the conclusion that the Greek government was fudging the numbers to hide its fiscal instability. Interestingly, while researching this topic we found that the Greek source data for this analysis is now conspicuously absent from EUROSTAT website.

## Testing Benford’s Law

Having seen that deviation from Benford’s law can be a useful indicator of anomalous data, we are left with the question of actually quantifying that deviation.  This brings us to the topic of statistical hypothesis testing, in which we seek to confirm or reject some hypotheses about a random process, given a finite number of observations from that process. For the purposes of our current discussion, the random process in question is the population from which our numeric data are drawn, and the hypotheses we consider are as follows:

H0 (null hypothesis): The population’s FSD distribution conforms to Benford’s Law

H1 (alternate hypothesis): The popluation’s FSD distribution is different from Benford’s Law

Depending on the outcome of the test, we either accept the null hypothesis, or reject it in favor of the alternate hypothesis. In the latter case, we may have grounds for applying more scrutiny to the values as failure to fit Benford’s law can be a sign of questionable data. The second piece of a statistical test is a significance level, also known as a p-value. In statistics, the results we obtain are not concrete facts; rather, our conclusions are parameterized by some level of certainty less than 100%. The precise definition of the p-value is rather nuanced,  but we can think of it as how extreme the calculated test statistic is, under the assumption that the null hypothesis is true. The workflow of a statistical test is thus as follows:

1. Calculate a test statistic from the sample data, using the method prescribed for the specific test.
2. Choose a desired significance level, which determines a critical value for the test statistic.
3. If the calculated statistic is greater than the critical value, then the null hypothesis is rejected at the chosen significance level. Otherwise, the null hypothesis is accepted.

For Benford’s Law hypothesis testing, commonly employed tests are Pearson’s Chi Square test-of-fit, and the Cho-Gaines d statistic. Let’s work these tests out using our four example datasets.

### Chi Square Test-of-fit

This test is a general purpose test for verifying whether data are distributed according to any arbitrary distribution. The test statistic is computed from counts rather than proportions. Let $\hat{p}_d$ be the observed proportion of digit d in the data’s FSD distribution, and $p_d$ be the expected Benford’s law proportion defined previously. For a data set containing N observations, the observed and expected frequencies are given by $O_d = N\hat{p}_d$ and $E_d = Np_d$ respectively. The Chi-square statistic is defined as follows:

$\chi^2 = \sum_{d=1}^9 \frac{(O_d - E_d)^2}{E_d}$

The critical value for this test comes from a chi-square distribution with (9-1) = 8 degrees of freedom. For a significance level of 0.01, we get a critical value of 20.09. If the value of χ2 is greater than this value, then we can reject a fit to Benford’s law with 99% certainty. In the Nelson check fraud dataset, we have the following observed frequencies:

$O_1,\dotsc,O_9 = [1,1,1,0,0,0,0,3,9,8]$

In other words, 1 was the first significant digit in one of the entries, while 9 was the FSD in 8 entries. For this 22 point dataset, our expected Benford’s law frequencies are:

$E_1,\dotsc,E_9 = [ 6.622 , 3.874 , 2.749, 2.132, 1.742,1.473, 1.276, 1.125, 1.007]$

Computing the chi-square statistic is a simple matter of plugging in the values:

$\chi^2 = \frac{(1-6.622)^2}{6.622} + \frac{(1-3.974)^2}{3.874} + \dotsb + \frac{(8-1.007)^2}{1.007} = 121.0169$

The obtained value is greater than the critical value, so we can indeed say that the fraudulent check data do not fit Benford’s Law. Iris, our other non-conforming dataset also produces a chi-square statistic larger than the critical value (506.3930), while the Fibonacci and US Census datasets produce values less than the critical value (0.1985 and 10.6314 respectively).

### Cho-Gaines d

For small sample sizes, the chi-square test can encounter difficulty in discriminating between data which do and do not fit Benford’s Law. The Cho-Gaines’ d statistic is an alternative test which is formulated to be less sensitive to sample size. It is defined as follows:

$d = \sqrt{N \sum_{d=1}^9 (\hat{p}_d - p_d)^2}$

For a significance level of 0.01, the critical value for d is 1.569. The values for d from our example data are 0.114, 1.066,  7.124, and 2.789 for the Fibonacci, US Counties, Iris, and Nelson datasets respectively. The first two values are less than the critical value, whereas the last two are greater, thus producing a result which is consistent with the chi-square test and visual comparison of the FSD distributions. Rather than being computed from a well parameterized distribution like the chi-square test, these critical values for the Cho-Gaines’ d test are obtained from Monte Carlo simulations, and are only available for a few select significance levels. This means that it is not possible to know the exact p-value for any arbitrary value of d, and thus represents a tradeoff compared to the chi-square test.

## Wrap-Up

In this post, we’ve explored First Significant Digit analysis with Benford’s Law. This straightforward concept, when combined with simple statistical tests, can be a useful indicator for rooting out anomalous numeric data. Benford’s law analysis is one of the many statistical analysis tools that are being incorporated into BigML. So stay tuned for a follow up post on how to perform this handy task and more on BigML.