Skip to content

Door-to-Door Data Delivery with External Data Sources


A common step when working with BigML is extracting data from a database or document repository for uploading as a BigML Data Source. Have you ever wished you could skip that step and create a BigML Data Source directly from your data store? Well, now you can!

Both the BigML Dashboard and the API allow you to provide connection information along with a table or query specifying the data you wish to extract. BigML will then connect to your data store and create the Data Source in BigML’s server.

Importing External Data with the BigML Dashboard

In the Dashboard, go to the Sources tab and you will see a new database icon with a dropdown for external sources as shown here:


Choose your desired data store and you will have the opportunity to select a connector to a particular instance. Or, you can create a new connector by providing the necessary information. This can vary according to the data store. Here we see the Create New Connector dialog for MySQL:

Once you have selected your connector, you will be presented with the tables and views (where applicable) from your data store. Here you have two options. First, you can simply select one or more tables and immediately import them into your BigML account as Data Sources. Each table will be imported into a separate source.

If you’d like to first take a look at a bit of the data from a given table you can click on it for a preview. That way you can remind yourself of the columns and see some sample data before importing. Here we see a preview of a table containing the well-known Iris data set:

Sometimes the simplest table import is too blunt an instrument. That’s where your second option comes in — the ability to select the exact data you want by writing an SQL select statement. If you only wish to import a subset of columns, for example, the query can be as simple as

       select sepal_width, petal_width, species from iris

The preview button will verify that the query is valid in your data store and show you the initial result set, allowing you to confirm your intentions before importing into a BigML Data Source. Be assured you can take advantage of your data store’s full query language. A more advanced example, below, shows a select statement with both a join and a group-by clause. The typically normalized data has one table with school district information and another containing individual teacher statistics. Here we are creating a Data Source with information about school districts inclusive of the average teacher salary in each district:

Importing External Data via the BigML API

As is the case with most BigML features, external data sources can be utilized via the API. Again, this is done by providing connection information along with either a table, for a simple import, or a custom query for more control. Here’s an example using curl that imports a “sales” table as a BigML Data Source. (See the BigML documentation for how to construct your BIGML_AUTH string.)

    curl "${BIGML_AUTH}" \
      -X POST \
      -H 'content-type: application/json' \
      -d '{"external_data": {
                "source": "sqlserver",
                "connection": {
                    "host": "",
                    "port": 1433,
                    "database": "biztel",
                    "user": "autosource",
                    "password": "********"
                "tables": "sales"}}'

In the case of the API, you have a few ways to control the imported data without resorting to a query. You can specify which fields to include or which to exclude, as well as limiting the number of records to import. You can also specify an offset along with an order to put that offset in context. All of this is explained in detail in our API docs.

Ready to Get More From Your Data?

We hope being able to import directly from your external data stores simplifies even more your ability to get the most out of your data with BigML. Currently supported are MySQL, Postgres, and Microsoft SQL Server relational databases, as well as the Elasticsearch analytics engine. If you have suggestions for other data stores to support please let us know. To learn still more about External Data Sources please visit our release page where you’ll find all the relevant documentation.


Fundación ONCE and BigML Join Forces to Create the World’s First Accessible Machine Learning Platform and Drive New Applications

BigML, the leading Machine Learning platform and Fundación ONCE, a Spanish foundation that has a long history of improving the quality life of people with disabilities have agreed to jointly evolve the BigML platform to allow all people, with or without disabilities, to effectively use Machine Learning as they build predictive applications facilitating more accessibility for citizens.

The strategic alliance will promote the creation of new applications that increase the capacities of professionals with cognitive or physical challenges. This collaboration will enable the adoption of Machine Learning among all types of professionals. As such, BigML will have access to the extensive experience of Fundación ONCE to make its platform more accessible and inclusive. In return, BigML will help Fundación ONCE to train its employees and collaborators and will support them in the process of creating their own Machine Learning applications. The ultimate goal is to create the world’s first and foremost accessible Machine Learning platform that will result in new smart applications making a positive impact in a variety of industries and businesses.

Machine Learning is already penetrating many corporations and institutions that apply it daily towards business cases in which the large volume of data makes it impossible for humans to make decisions efficiently. Examples of the Machine Learning applications in the real world include a wide array of use cases like predicting customer demand for a service or product, being able to anticipate when a machine is going to need a repair, detecting fraudulent transactions, increasing energy savings, improving customer support, and predicting illnesses, among others.

This agreement aims to ensure that these types of projects can just as easily be developed by people with disabilities. Therefore, this initiative will play a key role in promoting Machine Learning while providing access to equal employment opportunities for disabled professionals. To facilitate the inclusion of professionals with disabilities, Fundación ONCE and BigML will co-organize conferences, congresses, seminars, and training activities that will help improve occupational know-how, entrepreneurship, and overall gainful employment.

BigML’s CEO Francisco Martín points out, “Our mission for more than 9 years has been to democratize Machine Learning and make it easy and accessible to everyone. This new alliance allows us to count on ONCE’s vast experience to make BigML much more inclusive by accelerating the development of new high-impact applications. The entire BigML team is very excited and motivated with the collaboration.”

Jose Luis Martínez Donoso, CEO of Fundación ONCE, states “this agreement places Fundación ONCE in a leading position within an important field that will allow us to advance the inclusion of people with disabilities. Innovation and new technologies must take into account people with disabilities so that they are not excluded, and further opportunities are generated for their full social inclusion.”

Introducing New Data Connectors and BigML Dashboard Enhancements

We’re excited to share that we have just released a trio of new capabilities to the BigMLplatform. In this post, we’ll do a quick introduction to them followed by two more blog posts early next week that will dive deeper with some examples of how you can best utilize the new features. Without further ado, here they are.

New Data Connectors

BigML Data Connectors

Every Machine Learning project starts with data and data can come from many sources. This is especially true for complex enterprise computing environments. Naturally, many BigML users look to import data directly from external databases to streamline Machine Learning workflows. No sweat, BigML now supports MySQL, SQL Server, Elasticsearch, and Spark SQL in addition to PostgreSQL.

Both the BigML Dashboard and the API allow you to establish a connector to your data store by providing relevant connection and authentication information, which are encrypted and stored in BigML for future access. BigML can then connect to your data store and immediately create the ‘Source’ in its server(s). You have the option to import data from individual tables or to do it selectively via custom queries by specifying data spanning multiple tables. Moreover, in an organization setting, administrators can easily create connectors to be used by other members of the same organization.

API Request Preview Configuration Option

API Request Review

As a rule of thumb, anything you create on the BigML Dashboard, you can replicate with the BigML API. Now, BigML has added the ability to preview an API request as part of the configuration of unsupervised and supervised models — also available for Fusions on the Dashboard. This handy feature visually shows the user how to create a given resource programmatically including the endpoint for the REST API call as well as the corresponding JSON file that specifies the arguments to be configured.

WhizzML Scripting Enhancements in BigML Dashboard

WhizzML Inputs

When you use WhizzML scripts, some inputs may be set as mandatory while others are optional. You may also provide default values to inputs. You can specify those in the corresponding JSON metadata files. Now, you can also do this on the BigML Dashboard when inputs are resources like Sources, Datasets, and Models. BigML provides checkboxes for users to easily toggle between those inputs, which can be set as mandatory or optional. Similarly, users also have the option to provide default values for those inputs or leave them empty in the BigML Dashboard.

Want to know more about these features?

If you have any questions or would like to find out how the above features work, please visit the release page. It includes useful links to the BigML Dashboard and API documentation already and we’ll add the links for the upcoming blog posts as we publish them.

Claire, the Smart B2B Marketplace for the Food Industry powered by BigML

The food trading industry is one of those green-fields, where digitalization hasn’t really taken off in full force. Against this backdrop, about a year ago, Ramón Sánchez-Ocaña and Angeles Vitorica, Co-founders of Claire Global, contacted BigML with an idea that could turn the industry upside down. After more than 20 years as owners of a food trading business, they had all the domain knowledge about how to best achieve digitization in their industry and provide a valuable service to their peers. Now, their collaboration with BigML is helping them turn their ideas into reality.

Claire Global

Claire is a marketplace devoted to the B2B food trading business. It is purpose-built to implement Machine Learning-driven solutions to optimize the buying and selling processes that are core to the wide-reaching global food industry. Since its launch in January 2020, the marketplace has been in ‘Open BETA’ with interest from a diverse set of companies. Check out the introductory video below (or this one in Spanish) to find more about this innovative project pushing the envelope as far as B2B marketplaces can go.

The project is focused on increasing customer conversions and, most importantly, customer engagement. This becomes possible by adding valuable new functionality to the platform while actively supporting a highly heterogeneous group of user personas. All this to promote more activity on the platform and capture all the relevant inputs from customers, products, and transactions facilitated. This data, in turn, allows the team to implement the Machine Learning capabilities that add further value to the platform and its users.

Here are some of the most interesting optimization Machine Learning use cases we are exploring in this next-generation commerce environment:

  • Automated Product Recommendations: Selections based on customer data (e.g., prior transactions, web navigation patterns, user segmentation) and product data (e.g., product attributes, product similarities, purchase history) are key in making personalized offers to customers. By better “knowing” your users, you can provide them with highly relevant B2B information to further enhance their purchasing experience. This results in more repeat usage and customer loyalty over time. 
  • Optimal Pricing Suggestions for Sellers: Finding the optimal price point is a common problem in retail due to the high amount of parameters that can be considered when doing so, e.g., competitive dynamics, customer feedback, seasonality, current demand. There are also a wide variety of pricing strategies that can be chosen depending on the objectives of the retailer. For instance, maximizing profitability, accessing a new market, implementing dynamic pricing, etc. The use of predictive models for price optimization is quite attractive to cover all these possible different pricing scenarios.
  • Stock Management for Buyers and Sellers: Historical sales data can be very useful in order to extract sales trends and seasonality effects. This, together with some external data such as upcoming events or geographical location, can provide a producer with the best information on how to distribute its products among different warehouse locations according to the quantities it will sell in different areas. Supermarkets or hotel chains can also benefit by predicting the optimal quantity of a certain product they must acquire per location.
  • Anomaly Detection: This is a key technique for tasks such as flagging suspicious customer behavior to prevent fraud, checking for data inconsistencies by spotting pricing mistakes and other data integrity issues going otherwise unnoticed.

In short, we enthusiastically invite those of you in the food industry to actively participate in this new and exciting digital endeavor!

Machine Learning in Construction: Predicting Oil Temperature Anomalies in a Tunnel Boring Machine

Today, we continue our series of blog posts highlighting presentations from the 2nd Edition of Seville Machine Learning School (MLSEV). You may read the first post about the ‘6 Challenges of Machine Learning’ here.

One of the very interesting real-world use case presentations during the event was that of Guillem Ràfales from SENER. Founded in 1956 in Spain, SENER is a multi-national private engineering and technology group active in a set of diverse industrial activities such as construction, energy, environment, aerospace, infrastructure and transport, renewables, power, oil & gas, and marine.

SENER Projects

SENER’s Tunnel Construction Projects

Under its construction activities, SENER has successfully completed 19 large scale tunnel boring projects amounting to 80 kilometers of urban tunnels and a total of 224 kilometers of tunnels in the last 20 years. A great example is the high-speed railway service project in Barcelona. SENER delivered the 5.25 km segment near Gaudi’s architectural masterpiece Basílica de la Sagrada Família, a UNESCO World Heritage site.

Tunnel Boring BCN

Select technical specs of SENER’s project in Barcelona, Spain

Tunnel Boring Machines (TBM) are used to perform rock-tunneling excavation by mechanical means. The main bearing of a TBM is the mechanical core of the colossal machine. It enables the turning cutter head and transmits the machine’s torque to the terrain. At all times, It is critical to keep the bearing properly lubricated, often to the tune of 5000 liters of oil. One of the ways to monitor TBM performance is to analyze the physical and chemical properties of the lubricant oil in regular intervals.


The operational benefits of applying Machine Learning and advanced analytics in the context of TBMs can be summed up as avoiding unnecessary wear, costly equipment breakdowns, and overall suboptimal performance that may result in cost and project delivery overruns.


With this consideration in mind, BigML has worked closely with SENER engineering teams to build models to predict changes in the gear oil temperature variations for their TBMs. There two main objectives of the project were to:
  • understand how various internal TBM parameters are related to temperature changes
  • try and predict such temperature changes to avoid machinery wear or failure

The team worked on a large dataset from a past SENER project that contained hundreds of measurements internal to TBM operations sampled every 10 seconds. Some of the key measurements included torque, speed, pressure and chamber material attributes. The fact that notable oil temperature variations tend to take place gradually and infrequently added to the overall challenge in the form of a highly unbalanced dataset. Despite these, BigML’s feature engineering, algorithmic learning resources were put to great use. The team was able to uncover key insights with the help of Association Discovery during the data exploration phase followed by Anomaly Detection and Classification modeling that ultimately helped SENER technicians isolate an important subset of instances, where the oil temperature increases could be anticipated in advance. The entire custom workflow was captured in the BigML platform for traceability, easy re-training and automation purposed as seen in the plot below.

BigML for Construction

Custom BigML workflow for the SENER project.

If you’ve hung around thus far, it’s time for you to take a more in-depth look into this exciting project pushing the limits of Machine Learning-powered smart applications in the field of Construction Engineering. The end-to-end Machine Learning process underlying this endeavor was managed and presented by BigML’s own Guillem Vidal. Now, please click on the Youtube video below and/or access the slides on our SlideShare channel:

Do you have a Construction Engineering challenge?

Depending on your specific needs, BigML provides expert help and consultation services in a wide range of formats including Customized Assistance to turn-key smart application delivery — all built on BigML’s market-leading Machine Learning platform. Do not hesitate to reach out to us anytime to discuss your specific use case at


MLSEV Conference Videos: ‘Six Challenges of Machine Learning’

We really enjoyed virtually hosting thousands of business professionals, developers, analysts, academics, and students during the two jam-packed days of training last week as part of the Seville Machine Learning School. To us, it was one more piece of evidence that Machine Learning is a global phenomenon that will keep positively impacting all kinds of industries as the world economy recovers from the effects of Novel Coronavirus.

As promised during the event, below are useful links to the material covered during MLSEV for your review and self-study as well as related pointers for follow up actions you can take.



One of the cornerstones of MLSEV was BigML Chief Scientist, Professor Tom Dietterich‘s presentation on the State or the Art in Machine Learning.  Professor Dietterich specifically talked about the Six Challenges in Machine Learning by providing the historical perspective for each point as well as the present-day state of affairs as it applies to the advances in research. These six challenges are:

  1. Generalization
  2. Feature Engineering
  3. Explanation and Uncertainty
  4. Uncertainty Quantification
  5. Run-time Monitoring
  6. Application-Specific Metrics


The video above is a must-watch to find out more on each topic and get caught up with some of the best new ideas the Machine Learning research community has been able to offer in recent years. For brevity, we’d like to open up a special parenthesis for Explanation and Uncertainty, which is near and dear to our hearts at BigML. ML Explanation and Uncertainty

In order for interpretability to make a difference, we need to refine the context in which explanations are needed. In many, cases that means understanding the end-user persona to consume the said explanations of the model and its predictions. Sometimes the persona is a Machine Learning engineer worried about more higher-order concepts like model performance metrics and overfitting. Other times, the end-user can be a frontline worker or end-consumer that will be looking for a simple cue (e.g., a recommendation for a similar product that’s in stock). A world-class smart application should be able to discern between the two scenarios while satisfying the needs of both sets of users.

We urge you to watch the rest of the video and find more about these key topics. One more thing…in the coming weeks, we will be covering other focal topics and themes from MLSEV as part of this blog series, so stay tuned!

The 2nd Edition of Seville Machine Learning School gathers thousands virtually

Due to the COVID-19 (Novel Coronavirus) pandemic, we’re living through unprecedented times that have put life on hold practically everywhere on earth. This very much applies to business gatherings such as conventions, symposiums, training programs, and conferences as well. The 2nd Edition of our Machine Learning School in Seville was one of many such events that were threatened to get canceled altogether. As the BigML team, we had to think on our feet and quickly react to the turn of events. The decision was made quickly to virtually deliver the content for FREE instead of telling the registrants thus far “Sorry folks, we’ll see you next time.” To be honest, we weren’t sure how the virtual event would be received given that people had a whole new set of priorities.

To our delight, many more thousands than originally planned for positively responded to the changes as if to say, “We won’t let this public health crisis keep us from our longer-term business and career goals!” As we were getting close to the event, we ended up with nearly 2,500 registrations from 89 different countries representing all continents but Antarctica! Over 900 businesses from a diverse set of industries and close to 500 education institutions made up our body of registrants. Overall, we observed a healthy 60%/40% split between business and academia, respectively.

MLSEV Registrations

As the event neared on March 26, 9 AM Central  European Time, we were pleased to host nearly 1,600 attendees live. They were happy to give a shout out to us on Twitter too, seemingly enjoying the fact that they were making the best out of their situation. Considering that CET is not the most convenient in many other parts of the world like North America, this was quite amazing!

MLSEV Tweets

MLSEV attendees sharing their experiences on Twitter

The high and participatory level of attendance from Day1, fortunately, carried over to Day 2 thanks to our distinguished mix of speakers ranging from BigML’s Chief Scientist, one of the fathers of Machine Learning, Professor Tom Dietterich to BigML customers (and partners) presenting real-life use cases as well as our experienced instructors delving into the state-of-the-art techniques on the BigML platform.

MLSEV Speakers

MLSEV Speakers connecting with the audience

Virtual Conferences are Here to Stay

Given this fresh experience in putting together our first virtual conference, we have a feeling this may be the wave of the future in a post-COVID-19 world that may drastically alter business travel habits and further limit opportunities to make contact at shared physical spaces. While one can easily make a case that human interactions online are not the same as in real life, we must also recognize that there are a different set of advantages in being virtual. Virtual events are perhaps best described not as perfect substitutes in that sense, but rather, as adjacent branches of the same tree. For those who are aspiring to organize virtual events in the future, here are a few pointers to take into account:

  • Coronavirus or not, life goes on. It pays off to have a parallel virtual event delivery plan even if your event is sticking with the good old co-location format.
  • Time zone differences are absolutely key for virtual conferences given spread-out speakers and attendees. Try and find the best balance based on the expected geographical center of gravity of your ideal audience.
  • Practice makes perfect. It’s best to schedule multiple dry-runs with each speaker prior to the event.
  • Have a ‘Plan B’ in case connection issues surface and be ready to shuffle content around to avoid people rolling their fingers and waiting for something to happen. Experienced moderators are key to carry out those transitions smoothly.
  • Make it (nearly) FREE! Unless your business model revolves around selling event tickets, it’s better to convey your message to a larger audience that has taken the time to register (for FREE) to your event. Attention is the most valuable currency in a world where content is constantly doubling.
  • Count on word of mouth to spread the message more so than traditional marketing channels. If your value proposition is strong enough people will show up.
  • Hands-on experiences beat dry, theoretical presentations online as most people can follow the steps involved in a virtual demo session (e.g., Machine Learning industry use case) void of distractions in their home office or other personal space provided that you give them access to the necessary tools. We made it a point to mention attendees can take advantage of the FREE-tier BigML subscription at the beginning of the event.
  • Overcommunicate. This applies equally before, during and after the event. Tools like Slack, Mailchimp, your blog and social channels help make up for the lack of physical contact.
  • Simulate the real world as appropriate. We put together four parallel “Meet the Speaker” Google Meet sessions in between regularly scheduled presentation/demo sessions to simulate the coffee breaks in physical spaces and they turned out pretty popular.
  • The all too familiar linear narrative is broken in the online world so it’s best to embrace non-linearity by breaking your video and/or slide contents into digestible pieces and sharing them online shortly after the event.

Are you planning a Machine Learning themed event in 2020?

Let us know of your idea at and we’d be happy to collaborate on it to better serve your audience.

Meanwhile, stay safe and carry on!

Machine Learning Benchmarking: You’re Doing It Wrong

I’m not going to bury the lede: Most machine learning benchmarks are bad.  And not just kinda-sorta nit-picky bad, but catastrophically and fundamentally flawed. 

ML Benchmarking

TL;DR: Please, for the love of statistics, do not trust any machine learning benchmark that:

  1. Does not split training and test data
  2. Does not use the identical train/test split(s) across each algorithm being tested
  3. Does not do multiple train/test splits
  4. Uses less than five different datasets (or appropriately qualifies its conclusions)
  5. Uses less than three well-established metrics to evaluate the results (or appropriately qualifies its conclusions)
  6. Relies on one of the services / software packages being tested to compute quality metrics
  7. Does not use the same software to compute all quality metrics for all algorithms being tested

Feel free to hurl accusations of bias my way. After all, I work at a company that’s probably been benchmarked a time or two. But these are rules I learned the hard way before I started working at BigML, and you should most certainly adopt them and avoid my mistakes.

Now let’s get to the long version.

Habits of Mind

The term “data scientist” has had its fair share of ups and downs over the past few years. At the same time, it can indicate both a person who’s technical skills are in high demand and a code word for an expensive charlatan. Just the same, I find it useful, not so much as a description of a skill set, but as a reminder of one quality you must have in order to be successful when trying to extract value from data. You must have the habits of mind of a scientist.

What do I mean by this? Primarily, I mean the intellectual humility necessary to be one’s own harshest critic. To treat any potential success or conclusion as spurious and do everything possible to explain it away as such. Why? Because often that humility is the only thing between junk science and a bad business decision. If you don’t expose the weaknesses of your process, putting it into production surely will.

This is obvious in few places more than benchmarking machine learning software, algorithms, and services, where weak processes seem to be the rule rather than the exception. Let’s start with a benchmarking fable.

A Tale Of Two Coders

Let’s say you are the CEO of a software company composed of you and two developers. You just got funding to grow to 15. Being the intrepid data scientist that you are, you gather some data on your two employees.

First, you ask each of them a question:  “How many lines of code did you write today?”.

“About 200.” says one.

“About 300.” says the other.

You lace your fingers and sit back in your chair with a knowing smile, confident you have divined which is the more productive of the two employees. To uncover the driving force behind this discrepancy, you examine the resumes of the two employees. “Aha!”  You say to yourself, the thrill of discovery coursing through your veins. “The superior employee is from New Jersey and the other is from Rhode Island!”  You promptly go out and hire 12 people from New Jersey, congratulating yourself the entire time on your principled, data-driven hiring strategy.

Of course, this is completely crazy. I hope that no one in their right mind would actually do this. Anyone who witnessed or read about such a course of action would understand how weak the drawn conclusions are.

And yet I’ve seen a dozen benchmarks of machine learning software that make at least one of the same mistakes.  These mistakes generally fall into one of three categories that I like to think of as the three-legged stool for good benchmarking: 

  • Replications: The number of times each test is replicated to account for random chance,
  • Datasets: The number and nature of datasets that you use for testing
  • Metrics: The way you measure the result of the test. 

Let’s visit these in reverse order with our fable in mind.

3 legged stool

#3: Metrics

Probably the biggest problem most developers would have with the above story is the use of “lines of code generated” as a metric to determine developer quality. These people aren’t wrong: Basically everyone concludes that it is a terrible metric.

I wish that people doing ML benchmarks could mount this level of care for their metric choices. For instance, how many of the people who regularly report results in terms of area under an ROC curve (AUC) are aware that there is research showing that the metric is mathematically incoherent? Or that when you compare models using the AUC, you’ll often get results that are opposite those given by other, equally established metrics? There isn’t a broad mathematical consensus on the validity of the AUC in general, but the arguments against it are sound, and so if you’re making decisions based on AUC, you should at least be aware of some of the counter-arguments and see if they make sense to you.

And the decision to use or not use an individual metric isn’t without serious repercussions. In my own academic work prior to joining BigML, I found that, in a somewhat broad test of datasets and classifiers, I could choose a metric that would make a given classifier in a paired comparison seem better than the other in more than 40% of possible comparisons (out of tens of thousands)!  The case where all metrics agree with one another is rarer than you might think, and when they don’t agree the result of your comparison hinges completely on your choice of metric.

The main way out of this is either to be more or less specific about your choice. You might make the former choice in cases where you have a very good idea of what your actual, real-world loss or cost function is. You might, for example, know the exact values of a cost matrix for your predictions. In this case, you can just use that as your metric and it doesn’t matter if this metric is good or bad in general; it’s by definition perfect for this problem.

If you don’t know the particulars of your loss function in advance, another manner of dealing with this problem is to test multiple metrics. Use three or four or five different common metrics and make sure they agree at least on which algorithm is better. If they don’t, you might be in a case where it’s too close to call unless you’re more specific about what you want (that is, which metric is most appropriate for your application).

But there’s an even worse and more subtle problem with the scenario above. Notice that the CEO doesn’t independently measure the lines of code that each developer is producing. Instead, he simply asks them to report it. Again, an awful idea.  How do you know they’re counting in just the same way? How do you know they worked on things that were similarly difficult? How do you know neither of them is lying outright?

Metrics we use to evaluate machine learning models are comparatively well defined, but there are still corner cases all over the place. To take a simple example, when you compute the accuracy of a model, you usually do with respect to some threshold on the model’s probability prediction. If the threshold is 0.5, then the logic is something like “If the predicted probability is greater than 0.5, predict true, if not predict false”. But depending on the software, you might get “greater than or equal to” instead. If you’re relying on different sources to report metrics, you might hit these differences, and they might well matter.

It almost goes without saying, but the fix here is just consistency, and ideally objectivity. When you compare models from two different sources, make sure the tools you use for evaluation are the same, and ideally not ones provided by either of the sources being tested. It’s a pain, yes, but if you’re comparing weights there’s just no way around buying your own scale. There are plenty of open-source reference implementations of almost any metric you can think of. Use one.

#2: Datasets

For the sake of argument, though, let’s assume that you have a good metric for measuring developer productivity. You’re still only measuring performance on the one thing each of your developers did yesterday! What if they’re writing python, and you’re hiring a javascript developer? What if you’re hiring a UI designer? What if you’re hiring a sales rep? Do you really think that the rules for finding a successful python developer will generalize so far?

Generalization can be a dangerous business. Those who have practiced machine learning for long enough know this from hard experience. Which is why it’s infuriating to see someone test a handful of algorithms on one or two or three datasets and then make a statement like, “As you can see from these results, algorithm X is to be preferred for classification problems.”

No, that’s not what I can see at all. What I see is that (assuming you’ve done absolutely everything else correctly), algorithm X performed better than the tested alternatives on one or two or three other problems. You might be tempted to think this is better than nothing, but depending on what algorithm you fancy you can *almost always* find a handful of datasets that show “definitively” that your chosen algorithm is the state of the art. In fact, go ahead and click on “generate abstract” on my ML benchmarking page to do exactly this!

This might seem unbelievable, but the reality is that supervised classification problems, though they seem similar enough on the surface, can be quite diverse mathematically. Dimensionality, decision boundary complexity, data types, label noise, class imbalance, and many other things make classification an incredibly rich problem space. Algorithms that succeed spectacularly with a dozen features fail just as spectacularly with a hundred. There’s a reason people still turn to logistic regression in spite of the superior performance of random forests and/or deep learning in the majority of cases: It’s because there are still a whole lot of datasets where logistic regression is just as good and tons faster. The “best thing” simply always has and always will depend on the dataset to which the thing is applied.

The solution here, as with metrics, is to be more or less specific. If you know basically the data shape and characteristic of every machine learning problem that you’ll face in your job, and you have a reasonably large collection of datasets laying around that is nicely representative of your problem space, then yes, you can use these to conduct a benchmark that will tell you what the best sort of algorithm is for this subset of problems.

If you want to know the best thing generally, you’ll have to do quite a bit more work. My benchmark uses over fifty datasets and I’m still not comfortable enough with its breadth to say that I’ve really uncovered anything that could be said about machine learning problems as a whole (besides that it’s breathtakingly easy to find exceptions to any proposed rule). And even if rules could be found, for how long would they hold? The list of machine learning use cases and their relative importance grows and changes every day. The truth about machine learning today isn’t likely to be the truth tomorrow.

#1: Replications

Finally, and maybe most obviously: The entire deductive process in the fable above is based on only a single day of data from two employees. Even the most basic mathematical due diligence would tell you that you can’t learn anything from so few examples.

Yet there are benchmarks out there that try to draw conclusions from a single training/test split on a single dataset. Making decisions like this based on a point estimate of performance derived from a not-that-big test set is a problem for statistical reasons that are not even all that deep, which is a shame as single-holdout based competitions like the sort that happen on Kaggle are implicitly training novice practitioners to do exactly this.

How do you remedy this?  The blog post above suggests some simple statistical tests you can do based on the number of examples in the test set, which is fine and good and way, way better than nothing.  When you’re evaluating algorithms or frameworks or collections of parameter settings rather than the individual models they produce, however, there are more sources of randomness than just the data itself.  There are, for example, things like random seeds, initializations, and the order in which the data is presented to the algorithm.  Tests based on the dataset don’t account for “luck” with those model-based aspects of training.

There aren’t any perfect ways around this, but you can get a good part of the way there by doing a lot of train/test splits (several runs of cross-validation, for example), and varying the randomized parts of training (seed, data ordering, etc.) with each split.  After you’ve accumulated the results, you might be tempted to average together these results and then choose the algorithm with the higher average, but this obscures the main utility of doing multiple estimates, which is that you get to know something about the distribution of all of those estimates.

Suppose, for example, you have a dataset of 500 points. You do five 80%/20% training/test splits of the data, and measure the performance on each split with two different algorithms (of course, you’re using the exact same five splits for each algorithm, right?):

Algorithm 1: [0.75, 0.9, 0.7, 0.85, 0.9].  Average = 0.820

Algorithm 2: [0.73, 0.84, 0.91, 0.74, 0.89].  Average = 0.821

Sure, the second algorithm has better average performance, but given the swings from split to split, this performance difference is probably just an artifact of the overall variance in the data. Stated another way, it’s really unlikely that two algorithms are going to perform identically on every split, so one or the other of them will almost certainly end up being “the winner”. But is it just a random coin flip to decide who wins? If the split-to-split variance is high relative to the difference in performance, it gives us a clue that it might be.

Unfortunately, even if a statistical test shows that the groups of results are significantly different, this is still not enough by itself to declare that one algorithm is better than another (this would be abuse of statistical tests for several reasons).  However, the converse should be true: If one algorithm is truly better than another in any reasonable sense, it should certainly survive this test.

What, then, if this test fails?  What can we say about the performance of the two models?  This is where we have to be very careful. It’s tempting to dismiss the results by saying, “Neither, they’re about the same”, but the more precise answer is that our test didn’t give evidence that the performance of either one was better than the other.  It might very well be the case that one is better than the other and we just don’t have the means (the amount of data, the resources, the time, etc.) to do a test that shows it properly. Or perhaps you do have those means and you should avail yourself of them.  Beware the trap, however, of endless fiddling with with modeling parameters on the same data.  For lots of datasets, real performance differences between algorithms are both difficult to detect and often too small to be important.

For me, though, the more interesting bit of this analysis is again the variance of the results. Above we have a mean performance of 0.82, with a range of 0.7 to 0.9.  That result is quite different to a mean performance of 0.82 with a range of 0.815 to 0.823. In the former case, you’d go to production having a good bit of uncertainty around the actual expected performance. In the latter, you’d expect the performance to be much more stable.  I’d say it’s a fairly important distinction, and one you can’t possibly see with a single split.

There are many cases in which you can’t know with any reasonable level of certainty if one algorithm is better than another with a single train/test split. Unless you have some idea of the variance that comes with a different test set, there’s no way to say for sure if the difference you see (and you will very likely see a difference!) might be a product of random chance.

Emerge with The Good Stuff

I get it. I’m right there with you. When I’m running tests, I want so badly to get to “the truth”. I want the results of my tests to mean something, and it can be so, so tempting to quit early. To run one or two splits on a dataset and say, “Yeah, I get the idea.” To finish as quickly as possible with testing so you can move on to the very satisfying phase of knowing something that others don’t.

But as with almost any endeavor in data mining, the landscape of benchmarking is littered with fool’s gold. There are so very many tests one can do that are meaningless, where the results are quite literally worth less than doing no test at all. Only if one is careful about the procedure, skeptical of results, and circumscribed in one’s conclusions is it possible to sort out the rare truth from the myriad of ill-supported fictions.

Changes to the 2020 Machine Learning School in Seville (VIRTUAL CONFERENCE)

As most event organizers, we have been monitoring the information about the progress of COVID-19 in Spain and around the world on a daily basis. Given the proximity of our Machine Learning School and the questions that we have been receiving, we’d like to share with you important changes to the event.

We have been trying to collect data about how the virus is evolving, but because the number of tests is not published it’s difficult to build a reliable model that goes beyond a time series. In any case, even if the current impact in Seville is low as temperatures are already high, the number of infected people in Spain is likely to increase significantly until the measures recently put in place take full effect. Your well-being is our utmost concern. Therefore, beyond the evolution of this disease in our geographical area, we must also consider all the potential regulations or individual company guidelines that can be put in place right before, during or following our event. Our interpretation is that the most common recommendation is to generally avoid traveling.

Virtual Conference

As such, we have decided that ML School Seville 2020 will take place according to the planned schedule, however, only virtually. The lectures will be delivered via live webinars from different parts of the world for our registrants in many different locales around the world. The conference will include a number of expert moderators that will compile questions during the sessions and you will also have the opportunity to talk with our presenters LIVE. In addition, during coffee breaks, you can attend smaller sessions (maximum 15 people) with the speakers. As usual, MLSchool session materials will be made available on the SlideShare platform.

Moreover, we will be issuing refunds shortly to all our attendees so their participation will effectively be free of charge.

A special thank you to our speakers and those who have been steadfastly assisting our organization. We are doing our best to make this virtual conference experience a satisfying one given the circumstances.  As they say, when life gives you lemons, make a virtual lemonade! With that spirit in mind, we ask you to give us a hand to pass on the news and invite friends or colleagues that may be interested in getting better with Machine Learning in the comfort of their home or office.

You will receive a message shortly with the relevant instructions on how to connect to the event on the mornings of March 26 & 27.


~The BigML Team

Registration Open for 2nd Edition of Machine Learning School in The Netherlands: July 7-10, 2020

We’re happy to share that we are launching the Second Edition of our Machine Learning School in The Netherlands in collaboration with the Nyenrode Business School and BigML Partner, INFORM.  This edition follows the footsteps of the well-attended First Edition and will take place on July 7-10, 2020 in Breukelen near Amsterdam. It’s the ideal learning setting for professionals that wish to solve real-world problems by applying Machine Learning in a hands-on manner. This includes analysts, business leaders, industry practitioners, and anyone looking to boost their team’s productivity by leveraging the power of automated data-driven decision making.

Dutch ML School 2020

The four-day #DutchMLSchool kicks off with its Machine Learning for Executives program on Day 1 and continues with a jam-packed introductory two-day course optimized to learn the basic Machine Learning concepts and techniques that are applicable in a multitude of industries. On Day 4, the curriculum concludes with a hands-on Working with the Masters track that allows the attendees to put in practice the techniques taught in the days prior by implementing an end-to-end use case starting with raw data.


Nyenrode Business University, Straatweg 25, 3621 BG Breukelen. See the map here.


4-day event: July 7-10, 2020 from 8:30 AM to 5:30 PM CEST.


Please follow this Eventbrite link to order your tickets. Act today to take advantage of our early bird rates and save. We recommend that you register soon since space is limited and the event may sell out quickly.


Lecturer details can be accessed here and the full agenda will be published as the event nears.

Beyond Machine Learning

In addition to the core sessions of the course, we wish to get to know all attendees better. As such, we’re organizing the following activities for you and will be sharing more details shortly:

  • Genius Bar. A one-on-one appointment to help you with your questions regarding your business, use cases, or any ideas related to Machine Learning. If you’re coming to the Machine Learning School and would like to share your thoughts with the BigML Team, be sure to book your 30-minute slot by contacting us at
  • International Networking. Meet the lecturers and attendees during the breaks. We expect local business leaders and other experts coming from European locales as well as from other countries around the globe.

We look forward to your participation!


%d bloggers like this: