Skip to content

Getting the NDA out of the Way with Machine Learning

Today’s edition of our blog post series, written by the speakers at the upcoming 2ML event, covers how the Dutch company JuriBlox B.V. applies ML to make the lives of legal professionals easier. This is a great example of how automating tedious and repetitive tasks with Machine Learning saves time for legal professionals that often need to process Non-Disclosure Agreement (NDA’s). It’s a well-defined task among many other productivity boosting applications transforming the legal industry.

This guest post is written by Arnoud Engelfriet, Founder at JuriBlox B.V. and author of the paper Creating an Artificial Intelligence for NDA Evaluation. To hear the full story, we invite you to join his session at 2ML on May 8-9.

Everyone in business has seen dozens of NDAs, also called Non-Disclosure Agreements, or confidentiality agreements. After ordering coffee, signing an NDA prior to negotiations or discussions is the most common act in business. For most businesspeople, an NDA is very much a standard document. However, from a legal perspective, that couldn’t be more wrong. Carefully reading NDAs for pitfalls is key. But who has time to do that for each NDA? Well, our Machine Learning Lawyer: NDA Lynn.

Actually, both businesspeople and lawyers are right. NDAs are routinely used to cover confidential exchange of any business information, from prototype designs to customer lists or proposals for new business ventures. Most NDAs are not reviewed as carefully as attorneys would recommend: it takes a significant amount of time and legal expertise to get an NDA just right and negotiated down to the last issue. So NDAs are routine documents that are perceived as standard, but in fact, they are custom documents that are unique.

Until the legal world comes up with a standard NDA for the whole world, the best course of action is to review each NDA carefully prior to signing. However, due to the high costs associated with a legal review, many businesspeople are somewhat hesitant to go this route. And for lawyers, the problem with reviewing an NDA is that it is mostly scanning for deviations from boilerplate text, which is extremely boring even for people whose job it is to review boring prose.

So, here we have a legal document that needs careful reviewing. The review consists of looking for standard patterns and deviations from the standard, and each review should produce the right output. Does that sound like a job for Machine Learning to you? Well, this is the premise on which NDA Lynn was built.

A challenge for Machine Learning on text documents is that text is unstructured. Legal documents do have clauses, headings and so on, but they are hard to recognize by a computer. Therefore, a two-step approach was used. In the first step, a Machine Learning model was developed to identify whether individual sentences in a document belong to one of some twenty-plus legal categories (e.g. purpose of NDA, duration of confidentiality, security obligations). Sentences in the same category can then be treated as a clause on that topic.

In the second step, a Machine Learning model specific to each category is deployed to determine the “flavor” or type of the clause in the category. For example, is this security clause strict, relaxed or intermediate. And with those flavors, it becomes possible to judge the NDA: if you are providing information, it’s bad to have a relaxed security clause as that creates a risk the information ends up in the wrong hands without the clause having been violated.

With these models, the remaining functionality of NDA Lynn follows quickly. A document is received on the website, its text is extracted and each sentence is fed to the first model. Sentences in the same category are grouped, and for each group the appropriate “flavor” ML model is employed. Finally, a simple lookup table is used to determine the outcome: the customer is giving information and the security clause is relaxed, that is no good. As a result, NDA Lynn can judge any NDA (well, in English) with only one question to be answered in advance: are you providing or getting information, or both?

Surprisingly, the actual Machine Learning part is not that hard. The ensemble and neural network models of BigML proved very flexible and effective, and the easy-to-use interface made it a short point-and-click exercise to turn a training dataset into a complete model, ready to go at We could even use the models offline, giving lightning fast responses for each document.

Oh, did I mention those datasets? Any ML model is only as good as the data you put into it. That meant having to manually tag a lot of NDAs: what type of sentence is this, security or duration? And for each group of sentences (each clause), what flavor does it have? And yes, that is a lot of tagging. The current dataset comprises over 1200 Non-Disclosure Agreements, each between 30 and 60 sentences in length. But the effort was worth it: NDA Lynn now performs sentence classification with 94% accuracy, and its flavor models perform on average well over 90% too.

We just released our Business Edition, so it’s time to leverage the power of NDA Lynn to create your own NDA-reading lawyerbot for your organization. The Business Edition allows you to tune the lookup table: what is good, what is bad and where do you draw the line at the not-really-ok-but-not-dealbreaker-type clauses? Moreover, you get NDA document management, you can have the reviews sent to your company lawyer for a manual check and there’s even an API to connect Lynn to your e-mail or document management system. How’s that for the future of legal work?

Want to know more about how ML is used in the Legal Profession?

Join the #2ML18 event on May 8-9 in Madrid, Spain. Get your ticket today so you can meet all the speakers as well as the BigML and Barrabé teams, the co-organizers of 2ML. For more details about the agenda and other activities, please visit the event page. We hope to see you there!

Fireside Chat with Tom Dietterich

BigML’s Chief Scientist, Emeritus Professor Tom Dietterich was recently interviewed by Eric Horvitz as part of Microsoft Research’s Fireside Chat series with distinguished thought leaders in the fields of Machine Learning, artificial intelligence, and cloud computing.

The interview goes into the details of Professor Dietterich’s early interest in Machine Learning as well as the history and evolution of the discipline: starting with the period of “known knowns” when fully deterministic systems ruled the roost, going on with the “known unknowns” period introducing probabilistic reasoning, and culminating in today’s “unknown unknowns”  focusing on bringing systems where the computer doesn’t have a full model of the real world phenomenon it’s trying to predict.

Other interesting topics covered include:

  • What do kangaroos have to do with BMW?
  • Ideas on how to go from “narrow AI” to broader AI
  • Ethical challenges of today’s AI systems
  • How “differentiable programming” is marrying corporate interest with a new set of tools in the current AI landscape
  • Which research areas are likely to surprise us in 8 to 10 years?


How Machine Learning is Disrupting the Accounting Industry

The second edition of 2ML: Madrid Machine Learning is fast approaching. In less than a month, on May 8-9, in Madrid, Spain, 2ML will bring together hundreds of decision makers, technology professionals, and other industry practitioners that aim to stay current with the latest in Machine Learning. During the two-day program, attendees will get a full understanding of how ML has evolved to where it is today and where the industry is headed, both from a technical and business perspective. To demonstrate the importance and benefits of adopting Machine Learning, distinguished international experts will present how Machine Learning is currently applied in different business areas, such as: cinema, sports, legal services, marketing, human resources, finance, and investments, among others.

To learn the key insights directly from each of these experts, we invite you to attend 2ML Madrid Machine Learning. In the coming days, we will post a series of blog posts authored by the speakers of #2ML18 as a warm up for the main event. Today’s post, written by Jorge Pascual, CEO at anfix, covers how ML is disrupting the accounting industry.

The accounting and finance functions of any business have traditionally been a gray world, full of gray-suited gentlemen who did things in black and white (e.g., record transactions in double entry books). These support functions garner even less attention in small and medium-sized enterprises, when compared to sales or marketing.

Although it is true that the starting point of small-business accounting has been tracking transactions that take place in the company, regular costs and revenue estimates, things have substantially changed over the last few years. Currently, the accounting industry is immersed in a moment of great transformation that will change the lives of entrepreneurs working at small and medium-sized companies, including of course, the Accounting Advisors.

The main change underway in this sector is the adoption of the cloud as a preferred means to record business transactions. This process taken by thousands of entrepreneurs and companies has given way to the existence of a large data repository. And that is where predictive models begin to provide valuable lessons. This also is ripe for Machine Learning algorithms to return actionable insights with further added value without compromising the confidentiality of the underlying information.

What exactly is meant by added value? The potential applications of ML in the accounting and financial world is practically infinite. On one hand, the entire management can be automated. With the right ML techniques, when we receive a new invoice, we can predict the expense account that this invoice belongs to. Before ML, this process could only be done by an accountant who was in control, remembered the entire accounting plan, and knew what each account was for. Now, this can be an automatic process. Additionally, if we receive invoices or other documents in paper, we can also use ML to extract information about the content, and again, we can automate the accounting process. These two simple examples currently make up approximately 80% of the work done by accountants or Accounting Advisors. The automation of these two tasks alone has already completely changed the working model, making a tremendous impact on the industry.

Management is only the first area within accounting where ML has a large impact. Once you have the accounting information, a very wide range of possibilities appear. The obvious next step is the generation of taxes from this data. However, the process of generating these tax models is usually not a linear process and it depends on the (legal) “creativity” that is applied by humans. Again, here is another example where ML can help humans apply the right criteria that is most satisfactory according to the goal we want to achieve (pay less taxes, improve company’s financial performance, etc.).

Another possibility that is somewhat related has to do with financial forecasting statements. We can predict future data based on past data, for example, we can find out in advance if we will run out of money at some point in time, which will be an insight found by our data, not by our intuitions. Here is where the Accounting Advisors can provide solutions to the companies they support, beyond the company accounting itself. For instance, the right tool can predict a situation where the company will need funding in the short, mid, or long term. Knowing this information, the company can then anticipate the negotiations with a bank to get a loan under good conditions, as the bank probably won’t pressure the company that much at this stage, so the company can get a better deal.

Machine Learning will completely transform the Accounting Advisory industry. Currently, Accounting Advisors spend 80% of their time processing data, which can be done by machines. If ML reduces this amount of time to only 10%, these professionals only need to supervise what the machines do for them, which allows them to use the 70% of their time to focus on other activities that either result in more value to their customers or help them find new customers.

To conclude, by changing this way of working, Accounting Advisors will also have to change the business model and the entire century-old accounting industry with it. Currently, this sector charges for each invoice entered, which will likely become meaningless with ML in the mix. The second big beneficiary of this transformation will be the entrepreneurs themselves. ML allows you to predict all kind of scenarios; today we have just named a few but there are many more. Knowing these situations, entrepreneurs will be able to make smarter and well thought out decisions.

Want to know more about how ML is disrupting the accounting industry?

Join the #2ML18 event on May 8-9 in Madrid, Spain. Get your ticket today so you can meet all the speakers as well as the BigML and Barrabé teams, the co-organizers of 2ML. For more details about the agenda and other activities, please visit the event page.

Finding Sense in March Madness with Machine Learning

If you are one of the approximately 54 million Americans that filled out a bracket to predict the NCAA Men’s Basketball tournament this year, odds are that your bracket was no longer perfect within the first 24 hours of tournament, and substantially off track by the end of the opening weekend. Correctly predicting all 63 games (ignoring the 4 play-in games) is infamously difficult, with a probability that ranges from 1 in 9.2 quintillion to a much more manageable 1 in 128 billion, depending on who is counting. With such low odds, it is no surprise that Warren Buffet had famously offered a $1 billion prize to anyone who correctly picks every winner.

Not familiar with how the NCAA basketball tournament works? Now is a good time to pause and check out this guide.


This year, the tournament once again lived up to its “March Madness” moniker with a number of heavy favorites losing in the first two rounds of the tournament. The chaos was headlined by an unprecedented upset of #1 overall seed Virginia in the first round by long shot UMBC, and also claimed #1 Xavier, #2 North Carolina, and #2 Cincinnati as victims. With so many brackets officially busted, we decided to investigate how a Machine Learning approach would perform on predicting the remainder of the tournament.

Data Collection and Feature Engineering

In nearly all data analysis projects, data acquisition and wrangling constitute the greatest challenge and time demand. Fortunately, a well-structured data set of NCAA basketball games  – extending back to the 1985 season – has been compiled by Kaggle and Kenneth Massey. While this data was not in a format that could be considered “Machine Learning-ready” it did provide substantial raw material to engineer features. Our approach was to represent each team as a list of these engineered features, such that each past basketball game could be represented by the combination of two lists (one for each team), and an objective field consisting of the result of the game from the perspective of the first team listed. Because so many of our features relied on data that was only collected going back to the 2003 season, we limited our final data set to 118 total features acquired from the most recent 15 seasons.

March Madness features table

The features used in this investigation belonged to several different categories:

  • Team performance: e.g., home/away wins, wins in close games, longest win streak, scoring differential, etc.
  • In-game statistics: e.g., free-throw percentage, three-point field goal percentage, average three-point field goal attempts, average rebounding difference, etc.
  • Ranking information: e.g., RPI, tournament seeding, Whitlock, Pomeroy, Sagarin, etc.

Model Training and Evaluation

Operating under the assumption that each season can be considered independently, we trained and compared four distinct supervised Machine Learning algorithms offered by BigML using historical NCAA tournament and regular season data going back to 2003. Using the Python bindings, a unique cross-validation approach was implemented in which the results for each tournament were predicted using games from all other tournaments as training data. Given that 15 seasons of training data was available, the resulting evaluation was analogous to a 15-fold cross validation, which is visualized in the boxplot below. Default parameters were used for each of the four algorithms investigated: random decision forests, boosted trees, logistic regression, and deepnets.

March Madness model comparison

While these algorithms performed similarly to one another, we ultimately decided to apply our deepnets implementation since it had the smallest variance season-to-season and the greatest minimum performance, that is, it rarely performed poorly relative to the other methods.

March Madness deepnet model feature importance

Top 20 features used in the final deepnet model. Chart was created by downloading the CSV of field importances from BigML.

When investigating the field importance of the model, interestingly the team seed was not the most important feature, although both ranked in the top 20. This indicates that our model will return predictions that consider more than simply the seeding in the tournament. The top four features were quite consistent: the average scoring differential and margin of victories for the respective teams. This result suggests that how many points you score relative to the competition is perhaps a greater indicator of future wins than simply whether or not you have won in the past. Accordingly, teams with many blowout wins likely perform better than teams that have equivalent records but have won by narrower margins. The absence of “close wins/losses” among the top features indicates that close games may be decided more often by chance than by determination.

Finally, a number of different ranking systems, including RPI, WLK (Whitlock), WOL (Wolfe), and POM (Pomeroy), were found among the top features. It should be noted that while each of these systems uses a different methodology to rank teams, they are very highly correlated with one another as well as the seeding in the tournament.  If interested, you can check out an exhaustive list of these different ranking systems and how they compare.

Deepnet Bracket Prediction

Filling out an NCAA tournament bracket certainly does not require an algorithm, and several popular heuristics exist which enjoy varying degrees of success. While some combination of gut instinct, alma mater loyalty, and even mascot preference inform most bracket decisions, the default method of selection is simply choosing the lower seed. While this pattern largely held in our machine learning-generated predictions, the efficacy of this method breaks down after the initial rounds pass and teams become more equally matched. In a typical bracket contest, these are also the rounds worth the greatest amount of points.

In addition to picking the winner of each match-up according to our model, we have also color-coded the probability of each team winning. The intensity of color represents the probability of the result, with upsets being colored in red and pale colors indicating low confidence in an outcome.

BigML's bracket for March Madness

While conservative overall in its predictions, our model does not always choose the lower seed. In the East and Midwest, Villanova and Duke are expected to advance to the Final Four, although both Elite 8 match-ups are predicted with confidence not much higher than a coin flip. In the South, our model prefers #9 Kansas State over lower-ranked Kentucky, although Nevada is picked to advance to the semi-finals. Finally, in the West Region, our deepnet model has considerable confidence in Gonzaga and Michigan advancing, and prefers Michigan overall. Our projected championship game is a match-up between Big Ten tournament champion Michigan and perennial contender Duke, with the Blue Devils emerging victorious on April 2 for their 6th national title.

Tournament Simulations

While predicting the discrete outcome of a match-up is a compelling exercise, the frequency of upsets in the NCAA tournament reminds us that even very rare events inevitably occur. The next step was to explore the probabilities returned by our model in greater detail.

Rather than simply assuming the higher-probability result would always occur, we can instead simulate each game as an “unfair” coin-flip, according to the match-up probabilities returned by our model. That is, if Villanova is likely to defeat West Virginia with 73% probability, there is still a considerable chance (27%) that West Virginia may advance. By simulating tournament games in this manner, we can introduce upsets into our predictions according to how likely they are to occur. In the end, we simulated the remaining games of the tournament 10,000 times. The results are summarized in the table below.

The probability of each team advancing to the remaining rounds of the NCAA tournament according to 10,000 simulations.

Because the probability of events in the later rounds of the tournament represent compound probabilities, we see results that at first glance may not seem consistent with the bracket produced above. For instance, although Villanova is favored over Purdue in a head-to-head match-up, Purdue* still has the highest probability of winning the entire tournament. This is a reflection of two factors:

  1. Purdue being more likely to advance out of the Sweet 16 round than Villanova.
  2. Differences in results with the other teams in the tournament that could be faced in the Final Four rounds.

*Unfortunately, our model does not have the sophistication to factor in injuries to key players at this point, nor was it updated with data from the first two rounds of the tournament. Many experts agree that Purdue’s chances have taken a significant hit following the injury to 7-foot-2 center Isaac Haas, unless Purdue engineers can save the day.

Final Thoughts

While relying on a Machine Learning model to predict a bracket is far from a sure strategy, it can provide an alternative method to make interesting and compelling picks. By emphasizing the probability of the events, rather than the discrete outcomes, we can get a better sense of the frequency of upsets. Ultimately, the only way to know who will win the tournament is to play the games.


Rethinking the Legal Profession in the Age of ML

By now, Machine Learning is soundly in the public domain as its wide impact is being felt across many industries around the world as they go through digital transformations. Although the spearheading ML applications have come from the usual suspects such as Internet companies and software firms, the waves of automation and data-driven decision making have been recently crushing on the shores of the Legal Services industry (article in Spanish).

A typical law firm in the Western world employs tens or even hundreds of attorneys specializing in different practice areas e.g., intellectual property, corporate, civil, criminal, constitutional law. The business of legal services remains perhaps the very definition of a human-driven industry essentially relying on increasing the employee count to be able to scale to higher revenues. Such growth no doubt may present some efficiencies, but there’s no evidence of strong network effects letting few players dominate the market. So it becomes even more important to make the best use of your expensive human resources to succeed in this highly fragmented industry full of niche players.

Whatsmore, the legal profession is historically known as quite conservative in its business practices since it is educated on precedent and is less forgiving towards experimentation and failure. However, a combination of factors sweeping the industry is pushing more firms to reconsider this stance. For starters, clients are demanding faster, more intuitive and accessible legal advice delivered over multiple channels and geographies. In addition, billable hours for less sophisticated commodity aspects such as research or project management are being scrutinized more closely as opposed to reasoning and judgment.

In their 2018 predictions, the Legal Institute For Forward Thinking outlines that AI will be a ticket for admission as a driver of consistent, high-quality client experience. This suggests leading law firms will have to be run more like other companies with an emphasis on operational efficiency. Those who are left behind will have to do with less profitable clients and a shrinking client base.

How can ML make a difference?

Digitalization is the norm in today’s business environment, which means detailed data on legal evidence, contracts, legislation, and jurisprudence are all available in easily accessible digital formats. However, the bigger challenge remains in making sense of this data deluge, which where most law firms have been struggling to keep up with. Unsurprisingly, a lot of them are turning to technology to be able to deal with it without having to multiply human experts on their payroll.

The typical legal practice tasks involve reviewing and generating documents, discovering useful associations and understanding motivation and behavior of the parties involved in a legal dispute. State of the art Machine Learning techniques that work with unstructured data have a high degree of applicability in these tasks, in turn, reducing the burden of excessive paperwork. For example, contract specifics like parties involved, payment terms, or start and end dates can be automatically extracted and mapped for faster due diligence or anomaly detection.

On the other hand, legal firms share similar administrative challenges as many other firms like human resources management, pricing, forecasting or customer relationship management. By some accounts, over 50% of partner and associate time is being spent on such administrative tasks. The more efficient these peripheral activities and their underlying processes run, the more profitable the firm becomes as it leaves more resources to be creatively deployed towards new specializations and differentiated service offerings. As you may have come to suspect, with a little human expert help, Machine Learning can connect many of these dots better than humans alone can.

These opportunities do not merely represent forward-looking statements and wishful thinking either. As the old adage goes: the future is here, it’s just unevenly distributed. In fact, we’re already witnessing ML being successfully introduced into more sub-domains of law with use cases ranging from automated jurisprudence aids and predicting judicial decisions to predicting the success of claims. In all three examples, AI systems did as good if not better than collections of human experts. Are these the Google Deepmind moments of the legal industry? Time will tell.

Predictive Apps on BigML

As for BigML, thanks to our engagement with a leading North American law firm, we have been able to implement a solution to help predict (in detail) future legal services demand, associated resource requirements and optimal pricing for new matters by analyzing more than a decade’s worth of invoices and other expense reports. The resulting system provided partners and administrators unprecedented insights into cost drivers by matter type, jurisdiction, litigation team structure, and other case-specific factors. None of this could be replicated even by the most experienced members of the firm.

Other BigML customers in the legal space also keep adding to the creative ways ML innovations are deployed in the legal industry. For instance, NDA Lynn recently launched its automated NDA checker service, to begin with, training their models on hundreds and then thousands of variations of Non-disclosure Agreements. This collection of data produced interesting patterns that can serve as early warning signs for NDA Lynn customers looking to address any undue risks before agreeing to the terms stated in their NDA.

NDA Lynn

This simple, narrow-AI example will likely find its way to many other types of contracts over time as digital data samples increase in size and the need to manage risks in a quantifiable way mounts in today’s ultra-competitive legal marketplace. As such, leading-edge law firms see the need to add many more ML-powered micro-services capabilities to their next generation IT platforms making lawyering more efficient, accurate, and less labor intensive. If this trend stays in place, CTO or CDO jobs in law firms may be a hotter commodity than they’ve been perceived so far by top-notch technical professionals, further attracting the best and brightest young lawyers feeling right at home working with ML-driven systems.

Should be a fun ride to see how it all unfolds and whether one of the oldest industries can pass its test against technology!

Predicting Air Pollution in Madrid

Air pollution is a tremendous problem in big cities, where health issues and traffic restrictions are continuously increasing. The concentration of Nitrogen Dioxide (NO2) is commonly used to determine the level of pollution. In Madrid, Spain, there are several stations in different parts of the city that are constantly collecting the NO2 levels. My colleague, Jaime Boscá, and I applied BigML to see if we could accurately predict air pollution in Madrid.

Air Pollution Map Spain

NO2 view from European satellite Sentinel-5P (Photo: ESA)

A set of alerts based on the NO2 levels (shown in the table below) have been defined to monitor and avoid high pollution levels.


Madrid government air pollution alert states

These alerts trigger some measures that subsequently enforce traffic restrictions for Madrid citizens. The main problem is that these levels of NO2 are usually reached at the end of the day and the traffic restriction measures take effect the next day. Therefore, the population affected has only a few hours to rearrange their means of transport the following day. These measures have caused many criticisms of the local government. Predicting such alerts would help warn the population in advance so they have more time to reschedule their transportation plans.


Traffic restrictions due to air pollution are common in European big cities like Madrid or Paris (Photo: AFP)

Is it possible to predict which days will have pollution alerts?

Our goal is to predict a pollution alert (YES/NO) in advance by 1, 4, and 7 days. A pollution alert means that one of the previous alert levels has been reached.

Data collection

To address Madrid’s air pollution problem, we used three main data sources about the city:

  • Air quality data: has been gathered for years and is available for multiple air measuring stations gathering NO2 levels on an hourly basis. 
  • Weather data: information available daily about temperature, rain, and wind.
  • Historical traffic data: detailed traffic load information available online for main streets and highways around Madrid.

The data used was collected from 2013 to 2017. To simplify the problem, we limited the analysis to zone 1 (shown below) as it includes most of the Madrid city area, which has the greatest number of air stations.


Map of air and weather stations in Madrid zone 1

Data transformations

Both the weather information and the pollution alerts statuses are available daily. That’s why data has been represented with daily granularity: each sample (or instance) will provide information for a given day. Therefore, aggregated information of weather and air are included as additional features per day.

We also considered traffic conditions and the predictions of traffic in our model. In order to include traffic predictions, we used another model to predict Madrid traffic, which was implemented in BigML using features such as weekdays and holidays. The evaluation results have been promising, allowing us to use BigML traffic batch prediction results as features in our model for predicting air pollution. In the same way, temperature predictions were also modeled and used as features.

Predicting air pollution is a challenge. How many days in advance could we anticipate obtaining an acceptable prediction? We tried three different predictions: 1, 4 and 7 days in advance. Each prediction uses a different time window for the same features.

Feature engineering

Most datasets can be enriched with extra features derived from existing data. In our case, we can use time-based information such as feature values for a previous date or number of days since an event happened. We used the following features:

  • NO2 averages and maximum values.
  • Maximum, minimum and average temperatures.
  • Rain, wind and traffic information.
  • Traffic predictions.
  • Number of days since the last alert.


The datasets used including all features are available on the BigML Gallery:

Data exploration

The colored table previously mentioned shows the 3 air pollution alert levels defined in Madrid: “prior notice”, “notice” and “alert”. Within the five years of available data, only “prior notice” and “notice” alerts occurred; red “alert” never happened. Also, the distribution of pollution alerts is not balanced, but luckily, not many alerts are raised: less than 100 “notice” and “prior notice” states have been observed in total.

That’s why we decided to group alerts and create a boolean objective field to predict whether or not a pollution alert will be raised.

From our analysis, we can see that NO2 levels are directly related to air pollution (shown in the visualization below). We can also see that significant rain and wind have an impact on NO2 levels.


NO2, traffic load, wind and rainfall visualization

In the graph above, maximum total precipitation daily is represented in blue and wind maximum gust speed is in orange. Traffic load is represented in green while NO2 average level is represented in grey. In general, high wind speeds and abundant precipitations seem to correlate with lower NO2 levels, while low traffic loads seem to correlate with lower NO2 levels.

The BigML scatter plot graphs below support this correlation. The following graph displays the correlation between the boolean of whether there was rain over 15mm during the last 3 days and the average level of NO2. We can observe that all cases with rain over 15mm correspond to NO2 levels under 55 µgrams/m3.


3 days rainfall over 15mm correlation to average NO2 level

The next graph displays the correlation between the wind daily average maximum speed over the past 3 days and the NO2 level. When the wind average maximum speed is over 20km/h then NO2 is under 50 µgrams/m3.


3 days wind maximum speed correlation to average NO2 level


Predictive modeling involves evaluating models and comparing results to select the appropriate algorithms and their specific parameters. Initially, we tried different algorithms available in BigML suitable for classification (models, logistic regression, ensembles, and deepnets). Ensembles gave the best results (see all models comparison in the next evaluation section). Using the WhizzML script SMACdown we could automatically test all possible parameter settings for ensembles.


Modeling strategy


Initially, the dataset is split chronologically: data from 2013 to 2016 is used for training and 2017 data is used as a test set for evaluations. Evaluations criteria are based on the Area Under the Curve (AUC) of the ROC curve (graphically representing the trade-off between the recall and specificity for classification problems). Since we have a very imbalanced dataset (the days with alerts are very few compared to the days without alerts), we need to balance the model by applying a probability threshold. The optimal threshold has been set trying to minimize the False Negatives (days predicted as not having alerts but they actually have an alert) without penalizing too many of the False Positives (days predicted as having alerts, but they don’t actually have an alert). We have compared all the available models using the BigML comparison tool to ensure we selected the best performing model.

Below we can find the field importance graphic for this ensemble used in the evaluation. The most important field is the number of stations having a NO2 measure over 150 µgrams/m3 the day before, followed by the NO2 average range the day before, and the NO2 maximum range over the 5 previous days. Traffic prediction, rainfall, and wind representative fields also appear in the top 10.


BigML field importance graph: 1 day prediction ensemble

We can see in the figure below the different evaluations for predictions 1 day in advance. The boosted ensemble of 300 iterations (represented in orange below) gave the highest ROC AUC (0.8781).


BigML evaluations comparison tool: 1 day prediction

Once we have selected the best model by looking at the AUC metric, we need to look at the recall and precision of a given model to select the optimal threshold to start making predictions. The recall is the number of true positives over the number of positive instances, while the precision is the number of true positives over the number of positive predictions. The image below displays a BigML prediction evaluation for 1 day with the suitable probability threshold set to 27%. We can see how the model predicted 14 out of 19 actual alerts resulting in a 73.68% recall. It also predicted 16 other days that did not have an alert incorrectly which means a precision of 46.67%.

evaluation (1)

BigML ensemble evaluation: 1 day prediction

The chart below shows the recall and precision for the three predictions performed: 1, 4, and 7 days in advance.


Precision and recall results

As expected, the higher number of days in advance we try to predict the lower the performance. Nevertheless, making pollution alert predictions even one day in advance would already benefit citizens in their daily lives, as they are currently being warned only a few hours in advance.

Taking this use case a step further, predicting pollution levels accurately and sufficiently in advance could even enable us to reduce high pollution levels, one city after another. Insights from Machine Learning aren’t meant to simply remain as additional information about our world – they are meant to be put to good use and improve people’s lives, in our businesses, societies, and beyond.

2ML Madrid Machine Learning: Keep your Business Ahead of the Competition

This week we saw the power of Machine Learning in action when BigML deepnets accurately predicted the winners of the major award categories of the Oscars 2018. The movie industry is simply one visible example where Machine Learning can be applied, but there are many more business-oriented real-world use cases that will be shown at the second edition of 2ML Madrid Machine Learning, to be held in Madrid, Spain, on May 8-9.

Barrabé and BigML are bringing to Madrid the second edition of the annual series of 2ML events where hundreds of decision makers, technology professionals, and other industry practitioners will gather to learn and discuss the latest developments in the fast-changing world of Machine Learning. 2ML is a game-changer event that helps you keep your business competitive, raises awareness about the best ways to integrate Machine Learning capabilities into your business processes, analyzes the current analytics landscape in leading industries, and showcases the impact that Machine Learning is already having in finance, the legal sector, marketing, sales, human resources, sports, social enterprises, and more.

Encouraged by the success of 2ML17, we are ready to continue raising awareness of the key role that Machine Learning plays in the transformation of sectors representing a wide swath of global economic activity.


Want to know more about #2ML18? 

Discover the impact that Machine Learning is going to have on your business while receiving valuable insights from innovators and early adopters that can help keep your company ahead of the competition. Don’t miss 2ML 2018’s jam-packed agenda presenting some of the brightest minds in the Machine Learning field today. Join us at #2ML18 on May 8-9 in Madrid, Spain, and be sure to purchase your ticket before March 28 to get a 30% discount!

2018 Oscars Predictions Proved Right: 6 out of 6!

Last night, Hollywood stars were looking stunning on the red carpet for the 2018 Oscars Ceremonies. BigML’s Machine Learning algorithm was also on point. Our predictions for the 2018 Oscar Winners were correct, 6 out of 6!

BigML's 2018 Oscars Predictions

The BigML deepnets model accurately predicted the winners of the major award categories: best picture, best director, best actress,  best actor, best supporting actress, and best supporting actor. The notable improvement from our 2017 Oscar Predictions is thanks to the powerful capabilities of our deepnets model, which is one of the top performing algorithms across different platforms.

2018 Oscars Predictions Results

Movies bring people together, from families getting cozy on their couches, to individuals sharing their stories with complete strangers across the globe. For this reason, the entertainment industry is an exciting area to apply Machine Learning, as seen in the outpouring reactions to BigML’s 2018 Oscars Predictions.

Thanks to everyone who has commented and joined the conversation! To mention a few, check out Enrique Dans’s blog recap, KDnuggets tweets, and the article by El País Retina (in Spanish). Head to BigML’s Twitter to see many more.

Can’t wait for next year’s Oscars! In the meantime, we look forward to many other cool applications of Machine Learning. Have ideas to share with BigML? We’d love to hear them at

Predicting the 2018 Oscar Winners

After the success of last year’s Oscar winner predictions, we are excited to announce this year’s predictions. Furthermore, this year we count on our powerful BigML deepnets and their automatic optimization option which makes them one the best performing algorithms across platforms.

This year, there is a clear favorite, The Shape of Water with 13 nominations, but this doesn’t mean we are not witnessing a fierce competition between a wide set of high-quality independent films with stunning performances. However, models don’t care much about this as they don’t merely follow critics’ opinions. Instead, they search patterns based on the films that won in the past and make predictions for this year’s nominees. Ok… what data exactly?

The Data

Theoretically, models get better with more observations. Therefore, this year we are keeping all the previous data and features we had brought together for last year’s predictions. This amounts to a total of 1,183 movies from 2000 to 2017, where each film has 100+ features including:


The only major change in the data this year was the removal of the full user reviews from IMDB since they didn’t prove to be important last year and the effort to obtain them is relatively high.

The Models

As before, we train a separate model per award category. For a change, this year we’ll use deepnets, BigML deep neural networks, instead of the ensembles that we used last year.  Using BigML deepnets with their unique first-class automatic optimization option (“Automatic Network Search”) is the simplest and safest way to ensure that we are building a top performing classifier. Each model takes around 30 minutes to train since it’s training dozens of different networks in the background, but it is time well spent as the resulting model is very likely to beat others you’d configure via trial and error.


When the deepnet is created, we can easily inspect the most important features of the model and the impact of each of them on predictions.  For example, in the case of the Best Picture, we can find several awards like the Critics’ Choice Awards, the Online Film and Television Awards, the Hollywood Film Awards, and the BAFTA Awards among the top predictors. Alleviating the fact that deep neural networks tend to be hard to interpret, BigML offers a unique deepnet visualization, the Partial Dependence Plot, to analyze the marginal impact of various fields on your predictions.


To ensure our model is a good classifier, we trained it by using the movies from 2000 until 2012 and we then evaluated it by using the movie data from 2013 and 2016.  For all award categories, we obtained a ROC AUC over 0.98 which means that models were able to predict the winners for four consecutive years (2013 until 2016) with few mistakes. For example, see below the confusion matrix for the Best Actress model, where it predicts correctly 3 out of the 4 test years.


The Predictions

Without further adieu, let’s predict the 2018 winners! Drum rolls please…

For each category, you can find the winner as well as the scores predicted by the model for the rest of nominees.

The Shape of Water is the heavy favorite for the Best Picture with a 91 score. However, the model also gives a respectable chance of awarding the prized statue to Three Billboards Outside Ebbing, Missouri with a 68 score.


For the Best Director category, the model doesn’t have doubts. Guillermo del Toro is the likely winner with a score of 75 and no other nominee comes close.


Similarly, for Best Actress, there seems to be little competition.  Frances McDormand is the undoubtedly the favorite with a 99 score. Far behind, we can find Margot Robbie with a score of 5.


Gary Oldman is predicted by the model as the winner for Best Actor with a score of 88 for his amazing transformation as Winston Churchill in the Darkest Hour. However, he will need to subdue Timothée Chalament, the up-and-comer from Call Me By Your Name, who shows a score of 72 to win according to the model. Another strong rival is the consummate professional Daniel Day-Lewis with a score of 51 to win the award for his role in Poul Thomas Anderson’s film, Phantom Thread.


Among the five nominees for the Best Supporting Actress, the model favors Allison Janney for her role in I, Tonya with a 64 score.


The Best Supporting Actor category has more competition, however, Sam Rockwell seems the clear favorite for his role in Three Billboards Outside Ebbing, Missouri with a 95 score. With that said, Willem Dafoe has also a decent chance for his performance in The Florida Project with a score of 61.


This wraps up our 2018 Oscar predictions. So it’s time to grab your popcorn, favorite drink and see who the real winners are this Sunday, March 4th. That is, unless Jimmy Kimmel and company come with another jaw-dropping snafu to mess up our models even if for a Hollywood minute!

The BigML Dashboard Released in Chinese

新年快乐!Happy Chinese New Year!

It’s only fitting that we release the BigML Dashboard in Chinese at this time of the year.

Since its very beginning, BigML has strived to make Machine Learning Beautifully Simple for Everyone (机器学习美观简单人人用). Today our journey reached another milestone by allowing over 1 billion people to use the BigML platform in their native language.

Top 20 Languages

You can change the website language on BigML by using the selector highlighted in the image below. While the web interface will appear different, all the BigML functionalities remain the same.

When you sign up, your BigML username will still have to be alphanumeric, but you can use Chinese in your “Full Name”. After logging on, all Dashboard features are identical to the English version. You can create and manage all resources and workflows the same way.

Dataset View

You can watch this video to check out the BigML Dashboard in Chinese:

Over time, we will make improvements such as providing more documentation and tutorials in Chinese, and integrating the BigML Blog and Events pages. Furthermore, our internationalization will continue with support for more languages. Let us know at what languages you would like to see on BigML next.

读万卷书,行万里路。For hundreds of years, this phrase has been the motto of Chinese intellectuals. It literally means reading tens of thousands of books, and walking tens of thousands of miles. We think it now applies to BigML: Learn millions of volumes of data and travel millions of miles, to reach every corner of the world.

%d bloggers like this: