Skip to content

PreSeries at 4YFN: a Good Beginning for a Great Year Ahead

PreSeries, the joint venture between Telefónica Open Future_ and BigML, has been invited to join the exclusive Four Years From Now (4YFN) event. 4YFN is the startup focused platform of the Mobile World Congress that enables investors and corporations to connect with the chosen entrepreneurs to launch new ventures together. As part of this year’s 4YFN agenda, we will also be presenting the fourth edition of the Artificial Intelligence Startup Battle, which takes place on Tuesday February 28, at the main stage of 4YFN in Barcelona, Spain. More than 500 technologists and decision makers will witness the power of the PreSeries Machine Learning algorithms, that predict the probability of success of any startup even in their early stages. There won’t be any humans involved in deciding the winner, PreSeries AI is the sole jury as we have showcased in previous battles.


If you plan on going to 4YFN, do not hesitate to come by the PreSeries booth at the Telefónica Open Future_ stand, where we’ll be ready to answer all your questions regarding our technology, product, new features, etc., from today Monday February 27 until Wednesday March 1.

Meet the contenders!

The four contenders will each give a four-minute pitch to the audience followed by PreSeries asking them a number of question in order to provide them with a score between 0 and 100. At this point, you may be wondering who will be competing in the battle, so let’s get to know the contenders!

Action AI

action-ai, from London, develops smart Chatbots that revolutionize interaction. They transform business processes or user experiences by changing the way people achieve tasks that meet their needs both in their personal and professional lives.’s technology enables industry-leading services to be launched without huge Capex or expertise in AI or Chatbots.



Descifra, from Mexico City, is an online service that helps businesses to understand the characteristics of the markets around them through easy to grasp charts, tables, and maps. They advise other companies on where to build their business by taking into consideration the level of competition and the market characteristics in a certain geographic area.

people-io, also from London, gives people ownership of their data to enable the next phase in the evolution of human connectivity. Users of earn credits each time they take an action such as answering a question or connecting a new data source like their email or bank accounts. As the type and amount of data associated with a given user increases, starts match users with relevant brands and advertisers. Each time a user is “matched”, they receive a brief update from the advertiser, which offers the user extra credits for viewing or engaging with their content. However, at no point in the whole process does a brand or advertiser get direct access to the users’ profile data.



Pixoneye, from London and Tel Aviv, is a unique company with the ability to analyze the untapped power of users’ mobile photo galleries on behalf of their clients. In doing so, the company provides best in market real-time behavioral understanding and targeting capabilities for their clients’ ads, offers and services. Pixoneye team consists of some of the leading minds in computer vision and deep learning, and has recently been named one of the five AI companies to watch out for in 2017.

These four contesting startups will be showcased on Tuesday February 28, at the main stage of 4YFN. For those that can’t make it to the live event, our subsequent blog posts will share the results of the fourth edition of the AI Startup Battle. Good luck to all contenders!

About PreSeries events and battles

PreSeries was born in March 2016 and officially started its journey of re-imagining early stage technology venture investing by taking an unorthodox, AI-driven approach showcased via the world premiere of Artificial Intelligence Startup Battle at the PAPIs Connect Conference in Valencia. Thanks to the continuation of AI Startup Battles in Boston and Brazil, as well as our collaboration at WIRED London 2016, the startup community has a new-found appreciation of how Machine Learning can be utilized to better allocate venture capital.


Be sure to follow the upcoming PreSeries’ battles in 2017 as they will be announced on the BigML event’s page as well as on Among other venues, these will include the battles that we will present at PAPIs ‘17 and PAPIs Connect. For more details, please stay tuned by following us on: LinkedIn, Google+, Facebook, or Twitter. The countdown starts now!

Predicting the 2017 Oscar Winners

Machine Learning is accelerating its transition from academia to industry. We see more and more media outlets reporting about it, but most of the time they exclusively focus on the final results and not in all the human-powered tasks that happen behind the scenes and that really make the magic possible. So for most people Machine Learning continues to be some sort of elusive magic. We were recently approached by One, the Vodafone-sponsored section of El Pais, to explain how Machine Learning works and, after giving it some thought, we decided to explain it using a simple example in a domain everyone is familiar with. As the 89th annual Academy of Motion Picture Arts and Sciences Award ceremony draws near and movie fans all over the world are getting ready for their office pools, we couldn’t resist the temptation to take a stab at predicting the 2017 Oscars by applying some BigML-powered Machine Learning courtesy of our own Teresa Álvarez and Cándido Zuriaga. 

Of course, picking Oscar winners remains a favorite pastime for many people every winter. As usual, there’s no shortage of opinions ranging from those of movie critics to POTUS Donald J. Trump to established media outlets that publish more data-driven analysis. One thing that many of these crystal balls have in common is the fact that none of them give the reader access to the underlying data, logic or models. Time for us to change that for the better!

Caveat Emptor

No model is perfect, so before we go ahead and reveal our picks, a word of caution is in order.

  • Our main objective with this exercise is to demonstrate the process that is usually followed in order to make a prediction by using Machine Learning.

  • The Oscars nominees and winners are selected by vote of the Academy members. To properly model this problem, we should also model the Academy members and all the factors that influence how they pick and choose their favorite movies.  However, we have restricted our effort to publicly available information about the movies and not about the Academy members.

  • Furthermore, the nature of the problem itself is ever evolving. The Academy is not a monolithic structure and the body of membership, the rules that apply to the nomination, and voting processes are subject to change over the years. A great example of that is the recent introduction of a new batch of Academy members in response to the complaints on the lack of diversity. So past behavior is not always the best predictor of future behavior.

  • Finally, tastes change. One can argue that “The Academy” has stronger roots in tradition as than other movie industry awards, but we can’t deny that what works in one era is not guaranteed to translate to another without any changes. In our digital age, it is no longer a pipe dream to imagine a small budget art house movie released with the right timing at the right festivals and riding a big wave of word-of-mouth promotion on social media to finally steal the show from blockbusters that major movie studios bankroll. The times they are a’changin!

So let’s begin!

1. Problem Definition and Context Understanding

Stating the predictive problem is the most critical step in any Machine Learning workflow, as it totally shapes the rest of our solving process. Predicting an Oscar winner can be modeled as a classification task, that is, we need to create a predictive model that given a movie released in 2016 will output ‘yes’ when it predicts that the movie will win the Oscar and ‘no’ otherwise.  In predicting this year’s Oscar winners, we decided to limit our predictions to only 8 out of the 24 awarded categories.

The next step is to collect and prepare some data about movies and what made them winners in those categories in the past, as well as those same attributes for 2016 movies.  The more context and business understanding of the problem you have, the more prepared you are to decide what data to collect. A couple of business insights guided our data collection process:

  • Of the total of approximately 600 films nominated since 2000, 62% are from the USA with an average budget of $50M, more than 3 times higher than the European budgets and 20 times higher than the Latin American countries.

Nominated Movies Distribution

  • The budget amount is correlated with subsequent income from the movie, but it does not seem to be strongly correlated to winning an Oscar. Moreover, for the analyzed period, the difference in the average budget between films that win Oscars and those that don’t varies wildly. So we are not expecting budget to be a significant factor in our models.

2. Data collection and data transformations

In virtually all Machine Learning projects, the most time consuming task is collecting and structuring data. In our case, due to time constraints, we anticipate that we have left out a lot of data that could be very valuable in making predictions. For example: actresses and actors with previous nominations or awards, or the number of Oscars previously received by the nominated director, scriptwriter, etc. It is also very important to select how far back in time your training data should go; not going far enough might mean missing something useful, but going too far back is going to pick up patterns that are probably no longer relevant (in the business we sometimes call this bad practice “doing archeology”). We decided to use movies between the years 2000 and 2016.

For that period of time, we have compiled a dataset that combines:

  • Movie metadata such as genre, year, budget etc. as well as user ratings and reviews from IMDb for the 50 most popular movies of each year
  • Each year’s nominations and winners of 20 key industry awards, including The Academy Awards, Golden Globes, BAFTA, Screen Actors Guild, Critics Choice, Directors Guild, Producers Guild, Art Directors Guild, Writers Guild, Costume Designers Guild, Online Film Television Association, People’s Choice, London Critics Circle, American Cinema Editors, Hollywood Film, Austin Film Critics Association, Denver Film Critics Society, Boston Society of Film Critics, New York Film Critics Circle, and Los Angeles Film Critics Association.

An added complexity is that, for data like ratings and reviews, it is difficult to determine if they were impacted by the fact that the movie was nominated for an Oscar or not. In other words, we don’t have the ability to reconstruct the exact timeline of our data’s construction.

We must also note that despite our best efforts to cleanse the dataset, there may still be some inaccuracies in the data itself. The final dataset that we have compiled is a wide one with fairly small number of rows, due to the nature of the problem (after all this is a once a year event with a different set of contestants each year). This makes models prone to noise and overfitting, even though the selection of ensembles as our algorithm mitigates this risk to some extent. You can get access to our input dataset via this BigML shared link or via BigML’s dataset gallery here.


3. Data Exploration

A good warm-up exercise in any predictive task is a visual perusal of the data. One such fishing expedition using BigML’s Association Discovery capabilities netted some interesting associations:

  • Nominees for best film are usually dramas and biographies and seldom action films. Among the winners, we did not find a strong correlation with genre since the nominees already belong to a tight group of genres.

  • When using Association Discovery to find the most important correlations between the Oscars and other awards, we saw especially notable correlations with the Golden Globes or the Critics Choice awards, among a few others.

  • As shown in the scatter plot below, not always the movies with higher box office or more votes win the Oscar to the best picture.  box-oficce-x-votes
  • Something similar happens with user and critics review:user_reviews-x-critics_reviews

4. Feature Engineering

Most datasets can be enriched with some extra features, derived from existing ones, that can increase the predictive power of the data. In our case, since the unstructured movie reviews can be challenging to analyze, we ran all the IMDb user review data through a Topic Model analysis, which automatically discovers a set of topics that can be used to characterize each data row. Then, for each movie, the set of topics and their associated probabilities were added as new features to our dataset. All told, our Machine Learning-ready dataset is composed of 256 different fields for 1,152 movies produced since 2000.  Once the a dataset is ready, the modeling and evaluation tasks become easy-peasy-lemon-squeezy with BigML.

Oscars Topic Model

5. Modeling

Usually predictive modeling involves comparing and selecting the appropriate classification algorithms and their specific parameters. Most of this process can be fully automated, although you need to be aware of the hype around full-automation. In our case, after a few tries and given our limited historical data and the need to avoid overfitting, we opted for tree ensembles over a decision tree or logistic regression.  So we created 8 separate binary classification models (one per award category) with the objective field (the column we want to predict) being “winner”.

6. Evaluation

To assess the predictive impact of each group of variables per category (e.g., metadata, ratings, reviews to unveil the Best Picture Winner), we took a stepwise approach, where we made different predictions based on different ensembles built on different subsets of our dataset.  This approach showed us the contribution of different types of data to the final combined prediction, and helped guide our efforts to pull together more data on certain aspects when needed. For instance, discovering that focusing on award data yields better results, translated into the collection of even more historical award data for better final predictions.

To evaluate our classification models, we used the period between 2000-12 as the training period, and 2013-15 as the test period. We then input the data for 2016 nominees to the already validated ensembles to arrive at our final predictions.

Evaluation Process

Evaluation results demonstrated that combining all available variables yields the best results by essentially reducing the False Positives, while maintaining a very high True Positive hit rate.


7. The Predictions: Drum rolls please…

So let’s finally see what our models found out!

It seems everyone’s sentimental favorite musical La La Land will have a big prize to show for the record 14 nominations it received this year. When we delve into the major drivers behind this prediction, we observe that the Critics Choice nomination and award it received along with its other award performances (with the Producers Guild, Screen Actors Guild and BAFTA especially sticking out) explain why La La Land is the favorite with a pretty high F-measure to boot.

However, if we dig deeper and look at how non-award data have played into the predictions we see a different picture, where IMDb user reviews are favoring Fences, and award nominations are accentuating the small budget wonder Moonlight.  But when actual award winnings are added and all the factors are combined, La La Land emerges as the #1 pick.  Can we be in for a surprise Sunday night?  Maybe so, but it would have to be one epic upset, so the chances are rather low.

Best Movie Oscar Prediction

Now that we have lifted the veil of mystery regarding the most anticipated award of the night, let’s quickly recap the remaining categories.

No big surprises here with Damien Chazelle expected to pick up the Best Director award consistent with his success in the awards circuit—especially Directors Guild.


Best Actress will go to Emma Stone carrying the tally for La La Land even higher. Again, Golden Globe and Screen Actors Guild pickups are the biggest forces pushing her nomination forward.

This year’s Best Actor category is a close call between Casey Affleck (Manchester by the Sea) and Denzel Washington (Fences), thus the lower confidence prediction here. Screen Actors Guild award going to Denzel Washington definitely makes a difference for him, but Casey Affleck has picked up more awards in total and, per our model, that seems to help him get a slight edge.  We’ll only find out on Sunday whether The Academy’s judgement was at all clouded by the personal troubles of Mr. Affleck despite what most agree was a stellar acting performance.

Best Supporting Actress 2017

Viola Davis (Fences) is the popular favorite for Best Supporting Actress thanks mainly to her performance in BAFTA, Screen Actors Guild and New York Critics Circle awards.


Best Supporting Actor award is another category with a number of viable choices. Our pick is Mahershala Ali (Moonlight). Most critics seem to be highlighting Mahershala Ali and Dev Patel as the favorites in this category as well. Mahershala Ali is somewhat handicapped by the model since he wasn’t able to pick up the Golden Globe for this category, which translates into a rather low confidence value for this prediction. Interestingly, the Nocturnal Animals actor Aaron Tayor-Johnson won the Golden Globe, but he is not even a nominee for the Oscars. Actually, in one of our preliminary models we made the mistake of inputing the Golden Globes wins for Michael Shannon and as it has been nominated to best supporting model, our model predicted him as the winner. This was helpful to remind us how careful you need to be not only collecting and cleaning the right data for training your models, but also making sure you input the right one at prediction time.

Best Original Screenplay is also very much in play with La La Land and Manchester by the Sea vying for the award.  La La Land seems to have a very slight edge on the back of its BAFTA success, but don’t count out Manchester by the Sea just yet.

Finally, Best Adapted Screenplay will likely go to Arrival since the movie did quite well within the award circuit for the category.

Lessons Learned

Besides being a fun undertaking, this exercise has been further testament to the power and importance of working with the right dataset and paying due attention to feature engineering. Being able to construct the best features remains the biggest return on time invested, especially in the presence of a solid Machine Learning platform like BigML where:

  • Some of the most versatile algorithms ever invented are offered via an intuitive interface (as well as a thorough API).
  • Scalability concerns are abstracted away for the end-user to concentrate on the analytical task at hand.
  • Flexible deployment options make it a breeze to operationalize chosen models working with the right data.

At the end of the day, feature engineering is the reflection of true expertise in a given domain into the models you build.

We hope you will find these predictions useful as you grab a glass of wine and follow Jimmy Kimmel kicking off the ceremonies on Sunday. Good luck with your picks!

Source: Zuberoa Marcos

Source: Zuberoa Marcos

STATS4TRADE Unlocks the Power of Machine Learning for Investors

The investment industry is an extremely competitive one, where fund managers work hard to demonstrate a strong track record that beats their respective benchmarks in order to be able to justify their fees and to partake in the profits from their assets under management. For retail investors, the recent years have been characterized by a significant shift towards passive investing instruments such as index funds at the expense of actively managed funds that have been struggling to justify higher expense ratios against the backdrop of high volatility markets and easy-money policies that interventionist central bank policies worldwide manufactured in response to the great recession.

In the meanwhile, the abundance of financial market data has given birth to new wave of startups looking to put Machine Learning to good use in order to create sustainable market edge at lower costs. One such exciting company is STATS4TRADE out of France.  We have caught up with the CEO – Founder of STATS4TRADE to see how his company is innovating with advanced analytics.



BigML: Congrats on launching your startup Jean-Marc. Can you tell us what was the motivation behind founding STATS4TRADE?

Jean-Marc Guillard: It really starts with my conviction that the financial services industry is faced with drastic change in the coming years and actively-managed equity funds are not immune to that. Investors are rightfully questioning high fees in the face of continued poor performance compared to passive funds with much lower fees. Similar to the disruptive changes now occurring in the transport industry, active-fund managers must contemplate an “Uber-ization” of their business model with software driving innovation to provide investors promised returns at lower cost.

BigML: I understand that active managers are between a rock and a hard place, but what’s wrong with the good old buy-and-hold?

Jean-Marc Guillard: Consider putting your money into a diversified index fund and waiting decades. Would such a traditional buy-and-hold approach yield decent returns with low volatility? Normally yes – but beware. This strategy can yield poor results with rather high volatility for some indices. For example just consider the performance of French CAC 40 over the past two decades.

The CAC 40 index is a diverse weighted stock price average of France’s 40 largest public companies including such internationally famous names as Airbus, L’Oreal and Michelin. As a result, the CAC 40 should serve as an ideal index for the risk averse small investor in France with a buy-and-wait strategy over the long-run. But the performance of the CAC 40 between 1990 and 2015 has been a dismal 3.3% without dividends and 6.5% with dividends. Moreover these returns came with rather high volatility upward of 22%.

Overall we strongly believe that a buy-and-hold strategy is absolutely valid for the risk averse small investor – especially if one considers the cost of active funds. However the recent advent of no fee brokerages like Robinhood in the United States and Deziro in Europe offers investors the ability to actively manage their own investments at costs approaching those of index funds. We want to encourage this democratization process by offering investors an objective way to automatically select stocks that yields better results while bypassing high fees.


BigML: That’s very interesting. How is STATS4TRADE’s approach to this problem different? How can the risk averse small investor earn decent returns – say in the range of 6-9% including all fees – with low volatility over shorter time periods than decades?

Jean-Marc Guillard: STATS4TRADE is uniquely positioned to help investors navigate this coming change. With the aid of Machine Learning and cloud computing technologies, we offer investors a new approach for selecting stocks and making buy/sell decisions – a data-driven approach that not only yields consistently better-than-index performance but also minimizes volatility and decreases operational costs while protecting capital.

Our trading applications leverage the power of BigML’s Machine Learning tools and allow investors both private and professional the opportunity to not only select but also simulate different investment strategies based on short-term price forecasts. Once an investor has selected a strategy corresponding to her particular risk-profile, the application automatically provides daily buy/sell signals for trading on no-fee platforms like Robinhood and Deziro.

Of course none of this is magic and our approach is not without its limitations. For example if one expects to become rich quickly, he will be sorely disappointed for no forecast is completely accurate. Normally one needs about six months to begin seeing the benefits of our method. Nonetheless the message is clear: through the power of statistics and data-driven approaches like ours ultimately yield better results at lower cost. The results certainly speak for themselves!

BigML: Thanks for the detailed explanation.  Can you also tell a bit about specifically how Machine Learning comes into play?

Jean-Marc Guillard: Someone once said that predicting the future is a fool’s errand. We agree. However, one can still use stastitics to estimate the likelihood of future events based on past data and an underlying statistical model. In fact, statistical methods have been used extensively for years in activities like consumer research, weather forecasting and of course finance. In our case we use Machine Learning methods powered by BigML to estimate the probability of short-term price movements of selected securities and indices, currencies and commodities. Namely, we aim to identify underlying statistical patterns for a given security, basket of securities, or an index and thereby accurately forecast upcoming movements in price.

BigML: What made you choose to build your models on the BigML platform?

Jean-Marc Guillard: A big part of it was the drastically faster iterative experimentation BigML Dashboard enables, which in turn allowed us to achieve faster time-to-market. One usually doesn’t know what the final Machine Learning workflow will look like when he sets on exploring possibilities in a large hypothesis space a complex problem like ours require. So it is essential that the tools you use afford you very quick and easy iterative exploration.  BigML excels on this front.

In addition, the automation options made available on the BigML platform let’s us decrease ongoing operational costs to a minimum level that can compete with passive index funds while further differentiating from actively-managed funds that rely on manual processes. Lastly, we have had phenomenal support from the BigML team throughout our evaluation, exploration and solutions implementation phases.

BigML: Thanks Jean-Marc. It is very impressive to see how you have been able to ramp up your Machine Learning efforts in such a limited time period despite constrained resources. We hope stories like yours inspire many more startups to realize that they too can turn their data and know-how into sustainable competitive advantages.

For our readers benefit, a downloadable PDF version of the STATS4TRADE case study is also available.

Machine Learning Automation: Beware of the Hype!

There’s a lot of buzz lately around “Automating Machine Learning”.  The general idea here is that the work done by a Machine Learning engineer can be automated, thus freeing potential users from the tyranny of needing to have specific expertise.

Presumably, the ultimate goal of such automations is to make Machine Learning accessible to more people.  After all, if a thing can be done automatically, that means anyone who can press a button can do it, right?


Maybe not.  I’m going to make a three-part argument here that “Machine Learning Automation” is really just a poor proxy for the true goal of making Machine Learning useable by anyone with data.  Furthermore, I think the more direct path to that goal is via the combination of automation and interactivity that we often refer to in the software world as “abstraction”.  By understanding what constitutes a powerful Machine Learning abstraction, we’ll be in a better position to think about the innovations that will really make Machine Learning more accessible.

Automation and Interaction

I had the good fortune to attend NIPS in Barcelona this year.  In particular, I enjoyed the (in)famous NIPS workshops, in which you see a lot of high quality work out on the margins of Machine Learning research.  The workshops I attended while at NIPS were each excellent, but were, as a collection, somewhat jarringly at odds with one another.

In one corner, you had the workshops that were basically promising to take Machine Learning away from the human operator and automate as much of the process as possible.  Two of my favorites:

  • Towards an Artificial Intelligence for Data Science – What it says on the box, basically trying to turn Machine Learning back around on itself and learn ways to automate various phases of the process.  This included an overview of an ambitious multi-year DARPA program to come up with techniques that automate the entire model building pipeline from data ingestion to model evaluation.
  • Bayesopt – This is a particular subfield in Machine Learning, where we try to streamline the optimization of any parameterized process that you’d usually figure out via trial and error.  The central learning task is, given all of the parameter sets you’ve tried so far, trying to choose the next one to evaluate so that you have the best shot at finding the global maximum.  Of course, Machine Learning algorithms themselves are parameterized processes tuned by trial and error, so these techniques can be used on them.  My own WhizzML script, SMACdown, is a toy version of one of these techniques that does exactly this for BigML ensembles.

In the other corner, you had several workshops on how to further integrate people into the Machine Learning pipeline, either by inserting humans into the learning process or finding more intuitive ways of showing them the results of their learning.

  • The Future of Interactive Learning Machines – This workshop featured a panoply of various human-in-the-loop learning settings,  From humans giving suggestion to Machine Learning algorithms, to machine-learned models trying to teach humans.  There was, in particular, an interesting talk on using reinforcement learning to help teachers plan lessons for children, which I’ll reference below.
  • Interpretable Machine Learning for Complex Systems – This workshop featured a number of talks on ways to allow humans to better understand what a classifier is doing, why it makes the predictions it does, and how best to understand what data the classifier needs to do its job better.

So what is going on here?  It seems like we want Machine Learning to be automatic . . . but we also want to find ways to keep people closely involved?  It is a strange pair of ideas to have at the same time.  Of course, people want things automated, but why do they want to stay involved, and how to those two goals co-exist?

A great little call-and-response on this topic happened between two workshops as I attended them.  Alex Wiltschko from Twitter gave an interesting talk on using Bayesian parameter optimization to optimize the performance of their Hadoop jobs (among other things) and he made a great point about optimization in general:  If there’s a way to “cheat” your objective, so that the objective increases without making things intuitively “better”, the computer will find it.  This means you need to choose your objective very carefully so the mathematical objective always matches your intuition.  In his case, this meant a lot of trial and error, and a lot of consultations with the people running the Hadoop cluster.

An echo and example came from the other side of the “interactivity divide”, in the workshop on interactive learning.  Emma Brunskill had put together a system that optimized the presentation of tutorial modules (videos, exercises, and so on) being presented to young math students.  The objective the system was trying to optimize was something like the performance on a test at the end of the term.  Simple enough, right?  Except that one of the subjects being taught was particularly difficult.  So difficult that few of the tutorial modules managed to improve the students’ scores.  The optimizer, sensing this futility, promptly decided not to bother teaching this subject at all.  This answer is of course unsatisfying to the human operator; the curriculum should be a constraint on the optimization, not a result of it.

Crucially though, there’s no way the computer could know this is the case without the human operator telling it so.  And there’s no way for the human to know that the computer needs to know this unless the human is in the loop.

Herein lies the tension between interactivity and automation.

On one hand, people want and need many of the tedious and unnecessary details around Machine Learning to be automated away; often such details require expertise and/or capital to resolve appropriately and end up as barriers to adoption.

On the other, people still want and need to interact with Machine Learning so they can understand what the “Machine” has learned and steer the learning process towards a better answer if the initial one is unsatisfactory.  Importantly, we don’t need to invoke a luddite-like mistrust of technology to explain this point of view.  The reality is that people should be suspicious of the first thing a Machine Learning algorithm spits out, because the numerical objective that the computer is trying to optimize often does not match the real-world objective.  Once the human and machine agree precisely on the nature of the problem, Machine Learning works amazingly well, but it sometimes takes several rounds of interaction to generate an agreement of the necessary precision.

Said another way, we don’t need Machine Learning that is “automatic”.  We need Machine Learning that is comfortable and natural for humans to operate.  Automating away laborious details is only a small part of this process.

If this sounds familiar to those of you in the software world, it’s because we’re here all the time.

From Automation to Abstraction

In the software world, we often speak in terms of abstractions.  A good software library or programming language will hide unnecessary details from the user, exposing only the modes of interaction necessary to operate the software in a natural way.  We say that the library or language is a layer of abstraction over the underlying software.

For those of you unfamiliar with the concept, consider the C programming language.  In C, we can write a statement like this:

x = y + 3

The C compiler converts this operation to machine code, which requires knowing where in memory the x and y variables are, loading these variables into registers, loading the binary value for “3” into a register, summing the values to a new register, assigning that result to a new variable, and so on.

The language hides machine code and registers from us so we can think in terms of operators and variables, the primitives of higher level problems.  Moreover, it exposes an interface (mathematical expressions, functions, structs, and so on) that allows us to operate the layer underneath in a way that’s more useful and natural than if we worked on the layer directly.  In this sense, the C language is a very good abstraction:  It hides many of the things we’re almost never concerned about, and exposes the relevant functionality in an easier-to-use way.

It’s helpful to think about abstractions in the same way we think about compression algorithms.  They can be “strong”, so that they hide a lot of details, or “weak” so they hide few.  They can also be “very lossy”, so that they expose a poor interface, up to “lossless”, where the interface exposed can do everything that the hidden details can do.  The devil of creating a good abstraction is rather the same as creating a good compression algorithm:  You want to hide as many unimportant details from your users as possible, while hiding as little as possible that those same users want to see or use.  The C language as an abstraction over machine code is both quite strong (hides virtually all of the details of machine code from the user) and near-lossless (you can do the vast majority of things in C that are possible directly via machine code).

The astute reader can likely see the parallel to our view of Machine Learning; We have the same sort of tension here between the hiding of drudgeries and complexities while still providing useful modes of interaction between tool and user.  Where, then, does “Machine Learning Automation” stand on our invented scale of abstractions?

Automations Are Often Lossy and Weak Abstractions

The problem (as I see it) with some of the automations on display at NIPS (and indeed in the industry at large) is that they are touted using the language of abstraction.  There are often claims that such software will “automate data science” or “allow non-experts to use Machine Learning”, or the like.  This is exactly what you might say about the C language; that it “automates machine code generation” or “allows people who don’t know assembly to program”, and you would be right.

As an example of why I find this a bit disingenuous, consider using Bayesian parameter optimization to tune the parameters of Machine Learning algorithms, one of my favorite newish techniques.  It’s a good idea, people in the Machine Learning community generally love it, and it certainly has the power to produce better models from existing software.  But how good of an abstraction is it, on the basis of drudgery avoided and the quality of the interface exposed?

Put another way, suppose we implemented some of these parameter optimizations on top of, say scikit-learn (and some people have).  Now suppose there’s a user that wants to use this on data she has in a CSV file to train and deploy a model.  Here’s a sample of some of the other details she’s worried about:

  1. Installing Python
  2. How to write Python code
  3. Loading a CSV in Python
  4. Encoding categorical / text / missing values
  5. Converting the encoded data to appropriate data structures
  6. Understanding something about how the learned model makes its predictions
  7. Writing prediction code around the learned model
  8. Writing/maintaining some kind of service that will make predictions on-demand
  9. Getting a sense of the learned model’s performance

Of course, things get even more complicated at scale, as is their wont:

  1. Get access to / maintain a cluster
  2. Make sure that all cluster nodes have the necessary software
  3. Load your data onto the cluster
  4. Write cluster specific software
  5. Deal with cluster machine / job limitations (e.g., lack of memory)

This is what I mean when I say Machine Learning automations are often weak abstractions:  They hide very few details and provide little in the way of a useful interface.  They simply don’t usually make realistic Machine Learning much easier to use.  Sure, they prevent you from having to hand-fit maybe a couple dozen parameters, but the learning algorithm is already fitting potentially thousands of parameters.  In that context, automated parameter tuning, or algorithm selection, or preprocessing doesn’t seem like it’s the thing that suddenly makes the field accessible to non-experts.

In addition, the abstraction is also “lossy” under our definition above; it hides those parameters, but usually doesn’t provide any sort of natural way for people to interact with the optimization.  How good is the solution?  How well does that match the user’s internal notion of “good”?  How can you modify it to do better?  All of those questions are left unanswered.  You are expected to take the results on faith.  As I said earlier, that might not be a good idea.

A Better Path Forward

So why am I beating on Bayesian parameter optimization?  I said that I think it’s awesome and I really do.  But I don’t buy that it’s going to be the thing that brings the masses to Machine Learning.  For that, we’re going to need proper abstractions; layers that hide details like those above from the user, while providing them with novel and useful ways to collaborate with the algorithm.

This is part of the reason we created WhizzML and Flatline, our DSLs for Machine Learning workflows and feature transformation.  Yes, you do have to learn the languages to use them.  But once you do, you realize that the languages are strong abstractions over the concerns above.  Hardware, software, and scaling issues are no longer any concern as everything happens on BigML-maintained infrastructure.  Moreover, you can interact graphically with any resources you create via script in the BigML interface.

The goal of making Machine Learning easier to use by anyone is a good one.  Certainly, there are a lot of Machine Learning sub-tasks that could bear automating, and part of the road to more accessible Machine Learning is probably paved with “one-click” style automations.  I would guess, however, that the larger part is paved with abstractions; ways of exposing the interactions people want and need to have with Machine Learning in an intuitive way, while hiding unnecessary details.  The research community is right to want both automation and interactivity: If we’re clever and careful we can have it both ways!

Reflecting on 2016 to Guide BigML’s Journey in 2017

2016 has proven a whirlwind year for BigML with substantial growth in users, customers and the team riding on the realization by businesses and experts that Machine Learning has transformational power in the new economy where data is in abundance but actionable insights have not been able to keep pace with improvements in storage, computational power and lowered costs. When things happen so fast, one can sometimes find it a challenge to stop and reflect on milestones and achievements. So below are the highlights of what made 2016 a special year for BigML.


Releases and Product Updates

In 2016, BigML users were greeted by many new capabilities that they were asking for. As a result, the platform is now more mature and versatile than ever.  Logistic Regression (Summer 2016 Release) and Topic Modeling (Fall 2016 Release) techniques beefed up existing supervised and unsupervised learning resources, while Workflow Automation with WhizzML (Spring 2016 Release) gave the platform a whole new dimension that can deliver huge productivity boosts to any analytics team in the form of reduced solution maintenance and mitigated model management risks.

Aside from those, we made many smaller but noteworthy improvements to the toolset including but not limited to: Scriptify, Objective-C bindings, Swift SDK and BigML for Alexa.

Events, Certifications and Awards

2016 has seen BigML being represented at 4YFNMachine Learning PraguePAPIs 2016 and PAPIs Connect, Legal Management ForumIEEE Rock Stars of Emerging TechWIRED, Mobey DayData Wrangling Automation and other industry events around the world with flattering reception and genuine enthusiasm that keeps pushing the team to innovate. Most notably, we have created a new and very hands-on BigML Certification Program that teaches participants how to solve practical real-life Machine Learning problems.  The next wave starts on January 19th, 2017!

After conducting the 2nd Valencian Summer School in Machine Learning followed by a special lecture by BigML advisor Professor Geoff Webb, BigML gave its first Brazilian Summer School in Machine Learning in São Paulo. Look for more education events to follow in 2017 as BigML has joined forces with CICE in Madrid to take its educational efforts to the next level to capitalize on the great hunger for Machine Learning from developers, analysts and scientist.


Although the biggest award for us are the compliments we receive from our users and customers, in 2016, we were also pleased to be recognized by DIA Barcelona and Zapier for best advanced analytics for insurance companies and BigML for Google Sheets respectively.

Popular Posts of 2016

Some of the Machine Learning veterans on our team, also were able to make time in sharing their career experiences over multiple posts that were well-received.


For reprise, here is a good selection to revisit for those who would like to gain new perspectives on the current market landscape and what has worked in real life situations right from the horses mouth.

Looking Forward to 2017

Now that the awareness of Machine Learning in general, and cloud-born Machine Learning platforms in particular, have reached a critical threshold, our go-to-market strategy will double up on communicating positive examples to the entire community rather than having to explain “Why Machine Learning Matters” to the uninitiated.  In that regard, we must also thank Google, Apple, Uber, Airbnb, Facebook, Amazon, Microsoft for putting Machine Learning squarely in the business lexicon.


In 2017, we also intend to intensify our educational efforts that promote learning by doing, while expanding the breadth and depth of capabilities to enable Agile Machine Learning at any organization in any industry. A big part of this will manifest itself through our active participation in technology events. We are kicking off the year with a trio of events, where BigML speakers will be on stage:

  • Anomaly Detection: Principles, Benchmarking, Explanation, and Theory

    Anomaly detection algorithms are widely applied in data cleaning, fraud detection, and cybersecurity. This talk will begin by defining various anomaly detection tasks and then focus on unsupervised anomaly detection. It will present a benchmarking study comparing eight state-of-the-art methods. Then it will discuss methods for explaining anomalies to experts and incorporating expert feedback into the anomaly detection process. The talk will conclude with a theoretical (PAC-learning) framework for formalizing a large family of anomaly detection algorithms based on discovering rare patterns.

    Speaker: Thomas G. Dietterich, Co-Founder and Chief Scientist.

  • FiturtechY

    FiturtechY is an event organized by the Instituto Tecnológico Hotelero (ITH), where innovation and technology meet to improve the tourism industry. FiturtechY will host four forums to discuss different topics: business, destiny, sustainability, and trends. BigML will be presenting at #techYnegocio forum, the meeting point for those professionals who seek to learn the latest tools that help revolutionize the tourism industry.

    Speaker: Dario Lombardi, VP of Predictive FinTech.

  • Computer-Supported Cooperative Work and Social Computing

    CSCW is the premier venue for presenting research in the design and use of technologies that affect groups, organizations, communities, and networks. Bringing together top researchers and practitioners from academia and industry, CSCW explores the technical, social, material, and theoretical challenges of designing technology to support collaborative work and life activities.

    Speaker: Poul Petersen, Chief Infrastructure Officer.

We also intend to put together the first BigML User Conference later in the year. So stay tuned for further event updates.

Hope this post gave a good crash course tour (especially for those of you that have recently joined BigML) of what’s been happening around our neck of the woods. Powered by your support, we’re hungrier than ever to bring to the market the best Machine Learning software platform there ever was. We’d also highly encourage you to take a look at our 2017 predictions, which will guide our roadmap in the remainder of the year.  As always, be sure to reach out to us with your ideas no matter how crazy they seem!

10 Offbeat Predictions for Machine Learning in 2017

As each year wraps up experts pull their crystal balls from their drawers and start peering into it for a glimpse of what’s to come in the next one. At BigML, We have been following such clairvoyance carefully this past holiday season to compare and contrast with our own take on what 2017 will have in store, which can come across as quite unorthodox to some experts out there.

Enterprise Machine Learning Predictions Nobody is Talking About

For the TL;DR crowd, our crystal ball is showing us a cloudy (no pun intended) 2017 Machine Learning market forecast with some sunshine behind the clouds for good measure. To put it more directly, enterprises need to look beyond the AI hype for practical ways to incorporate Machine Learning into their operations. This starts with the right choice of internal platform that will help them build on smaller, low hanging fruit type projects that leverage their proprietary datasets. In due time, those projects add up to create positive feedback effects that ultimately not only introduce decision automation on the edges, but help agile Machine Learning teams transform their industries.

Jumping back to our regularly scheduled programming, let’s start with a quick synopsis of the road traveled so far:

  • Machine Learning has already set on an irreversible path in becoming impactful (VERY impactful) on how we’ll do our jobs throughout many sectors and eventually touching the whole economy.

  • Machine Learning Use Cases by Industry

  • But digesting, adopting and profiting from 36 years of Machine Learning advances and best practices has been a very bumpy ride for many businesses few have managed to navigate so far.

  • There are many “New Experts” that read a couple of books or take a few online classes and are suddenly in a position to “alter” things just because they have access to cheap capital. While top technology companies have been “collecting” as much experienced Machine Learning talent as possible to get ready ready for the up and coming AI economy, other businesses are at the mercy of Machine Learning-newbie investors and inexperienced recent graduates with unicorn ambitions. It is wishfully assumed that versatile, affordable and scalable solutions based on a magical new algorithm will materialize out of these ventures.

  • In 2017, we suspect that the ecosystem is going to start converging around the right approach, albeit after otherwise avoidable roadkills.

Before we get to the specific predictions, we must note that 2016 was a special year in that it presented us with the watershed event such that the planet’s Top 5 most valuable companies are all technology companies for the first time in history. All five share the common traits of large scale network effects, highly data-centric company cultures and new economic value-added services built atop sophisticated analytics. Whats more they have been heavily publicizing their intent to make Machine Learning the fulcrum of their future evolution. With the addition of revenue generating unicorns like Uber and Airbnb the dominance of the tech sector is likely to continue in the coming years that will benefit immensely from the wholesale digitization of the World economy.

Changing of the Guard?

However, the trillion dollar question is how legacy companies (i.e., non-tech firms with rich data plus smaller technology companies) can counteract and become an integral part of the newly forming value chains to be able to not only survive, but thrive in the remainder of the decade. Today, these firms are stuck with rigid rear view mirror business intelligence systems and archaic workstation-based traditional statistical systems running simplistic regression models that fail to capture the complexity of many real life predictive use cases.

At the same time, they sit on growing heaps of hard to replicate proprietary datasets that go underutilized. The latest McKinsey Global Institute report named The Age of Analytics: Competing in a Data-driven World reveals that less than 30% of the potential of modern analytics technologies outlined in their 2011 report has been realized — not even counting the new opportunities made possible by the advent of the same technologies in the last five years. To make matters worse, the progress looks very unbalanced across industries (i.e., as low as 10% in U.S. Healthcare vs. up to 60% in the case of Smartphones) at a time analytics prowess is correlated with competitive differentiation more than ever.

Machine Learning Industry Adoption

Even if it maybe hidden behind polished marketing speak pushed by major vendors and research firms (e.g., “Cognitive Computing”, “Machine Intelligence” or even doomsday-like “Smart Machines”), the Machine Learning genie is out of the bottle without a doubt as its wide-ranging potential across the enterprise has already made it part of the business lexicon. This new found appetite for all things Machine Learning means many more legacy firms and startups will begin their Machine Learning journeys in the 2017. The smart ones will separate themselves from the bunch by learning from others’ mistakes. Nonetheless, some old bad habits are hard to kick cold turkey, so let’s dive in with some gloomier predictions and end on a higher note:


    “Big Data” soul searching leads to the gates of #MachineLearning.

    Tweet: PREDICTION#1: “Big Data” soul searching leads to the gates of #MachineLearning. @BigMLcom

  • The soul searching in the “Big Data” movement will continue as experts recognize the level of technical complexity that aspiring companies must navigate to piece together useful “Big Data” solutions that fit their needs. At the end of the day “Big Data” is tomorrow’s data but nothing else. The recent removal of the “Big Data” entry from the Gartner Hype Cycle is further testament to the same realization. All this will only hasten the pivot to analytics and specifically to Machine Learning as the center of attention so as to recoup the sunk costs from those projects via customer touching smart applications. Moreover, the much maligned sampling remains a great tool to rapidly explore new predictive use cases that will support such applications.

Big Data vs. Machine Learning Trends


    VCs investing in algorithm-based startups are in for a surprise.

    Tweet: PREDICTION #2: VCs investing in algorithm-based startups are in for a surprise. @BigMLcom

  • The education process of VCs will continue, albeit slowly and through hard lessons. They will keep investing in algorithm-based startups with the marketable academic founder resumes, while perpetuating myths and creating further confusion e.g., portraying Machine Learning as synonymous with Deep Learning, completely misrepresenting the differences between Machine Learning algorithms and Machine-learned models or model training and predicting from trained models1. A deeper understanding of the discipline with the proper historical perspective will remain elusive in the majority of the investment community that is on the look out for quick blockbuster hits. On a slightly more positive note, a small subset of the VC community seems to be waking up to the huge platform opportunity Machine Learning presents.

    Benedict Evans on Machine Learning


    #MachineLearning talent arbitrage will continue at full speed.

    Tweet: PREDICTION #3: #MachineLearning talent arbitrage will continue at full speed. @BigMLcom

  • The media frenzy around AI and Machine Learning will continue at full steam as humored by Rocket AI type parties, where young academics will be courted and ultimately funded by aforementioned investors. Ensuing portfolio companies will find it hard to compete on algorithms as few algorithms are really widely useful in practice although some do slightly better than other for very niche problems. Most will be cast as brides at shotgun weddings with corporate development teams looking to beef up on Machine Learning talent strictly for internal initiatives. In some nightmare scenarios, the acquirers will have no clear analytics charter, yet they will be in a frantic hunt to grab headlines to generate the illusion that they too are on the AI/Machine Learning bandwagon.

Machine Learning Talent Arbitrage


    Top down #MachineLearning initiatives built on Powerpoint slides will end with a whimper.

    Tweet: PREDICTION #4: Top down #MachineLearning initiatives built on Powerpoint slides will end with a whimper. @BigMLcom

  • Legacy company executives that opt for getting expensive help from consulting companies in forming their top-down analytics strategy and/or making complex “Big Data” technology components work together before doing their homework on low hanging predictive use cases will find that actionable insights and game-changing ROI will be hard to show. This is partially due to the requirement to have the right data architecture and flexible computing infrastructure already in place, but more importantly outperforming 36 years of collective achievements by the Machine Learning community with some novel approach is just a tall order regardless how relatively cheap computing has become.

Top Down Data Science Consulting Fail


    #DeepLearning commercial success stories will be few and far in between.

    Tweet: PREDICTION #5: #DeepLearning commercial success stories will be few and far in between. @BigMLcom

  • Deep Learning’s notable research achievements such as the AlphaGo challenge will continue generating media interest. Nevertheless, its advances in certain practical use cases such as speech recognition and image understanding will be the real drivers for it to find a proper spot in the enterprise Machine Learning toolbox alongside other proven techniques. Interpretability issues, dearth of experienced specialists, its reliance on very large labeled training datasets and significant computational resource provisioning will limit mass corporate adoption in 2017. In its current form, think of it as the Polo of Machine Learning techniques, a fun time perhaps that will let you rub elbows with the rich and famous provided that you can afford a well-trained horse, the equestrian services and upkeep, the equipment and a pricey club membership to go along with those. Nevertheless, not quite an Olympic sport. So short of a significant research breakthrough in the unsupervised flavors of Deep Learning, most legacy companies experimenting with Deep Learning are likely to come to the conclusion that they can get better results faster if they pay more attention to areas like Reinforcement Learning and the bread and butter Machine Learning techniques such as ensembles.

Deep Learning Hype


    Exploration of reasoning and planning under uncertainty will pave the way to new #MachineLearning heights.

    Tweet: PREDICTION #6: Exploration of reasoning and planning under uncertainty will pave the way to new #MachineLearning heights. @BigMLcom

  • Of course, Machine Learning is only a small part of AI. More attention to research and the resulting applications from startups in the fields of reasoning and planning under uncertainty and not only learning will help cover truly new ground beyond the better understood pattern recognition. Not surprisingly, Facebook’s Mark Zuckerberg has reached similar conclusions in his assessment of the state of AI/Machine Learning after spending nearly a year to code his intelligent personal assistant “Jarvis”, that was loosely modeled after the same in the Iron Man series.

Mark Zuckerberg's Jarvis AI


    Humans will still be central to decision making despite further #MachineLearning adoption.

    Tweet: PREDICTION #7: Humans will still be central to decision making despite further #MachineLearning adoption. @BigMLcom

  • Some businesses will see early shoots of faster and evidence-based decision making powered by Machine Learning, however humans will still be central to the decision making. Early examples of smart applications will emerge in certain industry pockets adding to the uneven distribution of capabilities due to differences in regulatory frameworks, innovation management approaches, competitive pressures, end customer sophistication and demand for higher quality experiences as well as conflicting economic incentives in some value chains. Despite the talk about the upcoming singularity and robots taking over the world, cooler heads in the space point out that it will take  a while to create truly intelligent systems. In the meanwhile, businesses will slowly learn to trust models and their predictions as they realize that algorithms can outperform humans in many tasks.

s. Machine Intelligence


    Agile #MachineLearning will quietly take hold beneath the cacophony of AI marketing speak.

    Tweet: PREDICTION #8: Agile #MachineLearning will quietly take hold beneath the cacophony of AI marketing speak. @BigMLcom

  • A more practical and agile approach to adopting Machine Learning will quietly take hold next year. Teams of doers not afraid to get their hands dirty with unruly yet promising corporate data will completely bypass the “Big Data” noise and carefully pick low hanging predictive problems that they can solve with well proven algorithms in the cloud with smaller sampled datasets that have a favorable signal to noise ratio. As they build confidence in their abilities, the desire to deploy what they have build in product as well as to add more use cases will mount. No longer bound by data access issues, complex, hard to deploy tools these practitioners not only start improving their core operations but also start thinking about predictive use cases with a higher risk-reward profiles that can serve as the enablers of brand new revenue streams.

Lean, Agile Data Science Stack


    MLaaS platforms will emerge as the “AI-backbone” for enterprise #MachineLearning adoption by legacy companies.

    Tweet: PREDICTION #9: MLaaS platforms will emerge as the “AI-backbone” for enterprise #MachineLearning adoption by legacy companies. @BigMLcom

    MLaaS platforms will emerge as the “AI Backbone” in accelerating the adoption of Agile Machine Learning practices. Consequently, commercial Machine Learning will get cheaper and cheaper thanks to a new wave of applications built on MLaaS infrastructure. Cloud Machine Learning platforms in particular will democratize Machine Learning by

    • significantly lowering costs by eliminating complexity or front-loaded vendor contracts
    • offering a preconfigured frameworks that packages the most effective algorithms
    • abstracting the complexities of infrastructure setup and management from the end user
    • providing easy integration, workflow automation and deployment options through REST APIs and bindings.

Machine Learning Platforms for Developers


    Data Scientists or not, more Developers will introduce #MachineLearning into their companies.

    Tweet: PREDICTION #10: Data Scientists or not, more Developers will introduce #MachineLearning into their companies. @BigMLcom

  • 2017 will be the year, when developers start carrying the Machine Learning banner easing the talent bottleneck for thousands of businesses that cannot compete with the Googles of the world in attracting top research scientists with over a decade of experience in AI/Machine Learning, which doesn’t automagically translate to smart business applications that deliver business value. The developers will start rapidly building and scaling such applications on MLaaS platforms that abstract painful details (e.g., cluster configuration and administration, job queuing, monitoring and distribution etc.)  that are better kept underground in the plumbing. Developers just need a well-designed and well-documented API instead of knowing what a LR(1) Parser is to compile and execute their Java code or knowing what Information Gain or the Wilson Score are to be able to solve a predictive use case based on a decision tree.

Developer-driven Machine Learning

We are still in the early innings of “The Age of Analytics”, so there is much more to feel excited about vs. dwelling on bruises from past false starts. Here’s to keeping calm and carrying on with this exciting endeavor that will take business as we know it through a storm by perfecting the alchemy between mathematics, software and management best practices. Happy 2017 to you all!

1: The A16Z presenter seems to think every self-driving car has to learn what a stop sign is by itself, thus reinventing the wheel many times over instead of relying on tons of historical sensor data from an entire fleet of such vehicles. In reality, few Machine Learning use cases require a continuously trained algorithm e.g., handwriting recognition.

Fourth Edition of the Startup Battle at 4YFN

Four Years From Now, the startup business platform of Mobile World Congress that enables startups, investors and corporations to connect and launch new ventures together, goes to Barcelona, Spain, from February 27 to March 1, 2017. We could not think of a better context to run the fourth edition of our series of Startup Battles.

Telefónica has invited PreSeries, the joint venture between Telefónica Open Future_ and BigML, to participate at the 4YFN event and showcase its early stage venture investing platform on the main stage on February 28 in front of an audience of over 500 technologists. In a nutshell, PreSeries provides insights and many other metrics to help investors make objective, data-driven decisions in investing in tomorrow’s leading technology companies.


In rapid fire execution mode, Valencia was the first city that witnessed the World’s premiere Artificial Intelligence Startup Battle last March 15, 2016. On October 12, the PreSeries Team travelled to Boston to celebrate the second edition at the PAPIs ‘16 conference. Less than two months later we celebrated the third edition in São Paulo, in the BSSML16 context. The fourth edition of our series of startup battles will be hosted in Barcelona, Spain. The distinguished audience and press members in Catalonia will discover how an algorithm is able to predict the success of a startup without any human intervention.

To recap the process, five startups from the Wayra Academy, Telefónica’s startups accelerator, will present their projects on stage through five-minute to-the-point pitches. Afterwards, PreSeries will ask a number of questions to each contender in order to provide a score between 0 and 100. The startup with highest score of all will win the battle. Having the opportunity to participate in the battle is key for participating startups as it will give them excellent exposure to potential corporate sponsors, strategic partners and the venture investments community. Stay tuned for future announcements, where we will reveal the contenders of the fourth edition of our Startup Battle as it may just prove to be the most competitive one so far.

Even Easier Machine Learning for Every Day Tasks


Recently, the “Machine Learning for Everyday Tasks” post suddenly rising to the top of Hacker News drew our attention. In that post, Sergey Obukhov, software developer at San Francisco-based startup Mailgun, tries to debunk the myth that Machine Learning is a hard task:

I have always felt like we can benefit from using machine learning for simple tasks that we do regularly.

This certainly rings true to our ears at BigML, since we aim to democratize Machine Learning by providing developers, scientists, and business users powerful higher-level tools that package the most effective algorithms that Machine Learning has produced in the last two decades.

In this post, we are going to show how much easier it is to solve the same problem tackled in the Hacker News post by using BigML. To this end, we have created a test dataset with similar properties to the one used in the original post so that we can replicate the same steps with analogous results.

Predicting processing time

The objective of this analysis is predicting how long it will take to parse HTML quotations embedded within e-mail conversations. Most messages are processed in a very short time, while some of them take much longer. Identifying those lengthier messages in advance is useful for several purposes, including load-balancing and giving more precise feedback to users.

Our analysis is based on a CSV file containing a number of fictitious track records of our system performance when handling email messages:

HTML length, Tag count, Processing Time

We would like to classify a given incoming message as either slow or fast given its length and tag count based on previously collected data.

Finding a definition for slow and fast through clustering

The first step in our analysis is defining what slow and fast actually mean. The approach in the original post is clustering, which identifies groups of relatively homogeneous data points. Ideally, we would hope that this algorithm is able to collect all slow executions in one cluster and all fast executions in another.

In the original post, the author has written a small program to calculate the optimal number of clusters. Then he uses that number as a parameter to actually build the clusters.

The task of estimating the number of clusters to look for is so common that BigML provides a ready-made WhizzML script that does exactly that: Compute best K-means Clustering. Alternatively, BigML also provides the G-means algorithm for clustering, which is able to automatically identify the optimal number of clusters. For our analysis, we will use the Compute best K-means Clustering script, following these steps:

  1. Create a dataset in BigML from your CSV file
  2. Execute the Compute best K-means clustering script using that dataset.

We can carry out those steps in a variety of ways, including:

  • Using BigML Dashboard, which makes it really easy to investigate a problem and build a machine learning solution for it in a pointing and click manner.
  • Writing a program that uses BigML REST API and the proper bindings that BigML provides for a number of popular languages, such as Python, Node.js, C#, Java, etc.
  • Using bigmler, a command-line tool, which makes it easier to automate ML tasks.
  • Using WhizzML, a server-side DSL (Domain Specific Language) that makes it possible to extend the BigML platform with your custom ML workflows.

We are going to use the BigML bindings for Python as follows:

import webbrowser
from bigml.api import BigML

api = BigML()

source = api.create_source('./post-data.csv')
dataset = api.create_dataset(source)
print "dataset" + dataset['resource']
execution = api.create_execution('script/57f50fb57e0a8d5dd200729f',
                                 {"inputs": [
                                     ["dataset", dataset['resource']],
                                     ["k-min", 2],
                                     ["k-max", 10],
                                     ["logf", True],
                                     ["clean", True]
best_cluster = execution['object']['execution']['result']"" + best_cluster)

The result tells us that we have:

  • Two clusters (green and orange) that contain definitely slow instances.



  • The blue cluster includes the majority of instances, both fast and not-so-fast, as its statistical distribution in the cluster detail panel indicates:


Seemingly, our threshold to distinguish fast tasks from slow tasks points to the green cluster.

At this point, the original post gives up on using clustering as a means to determine a sensible threshold, and reverts to plotting time percentiles against tag count. Luckily for them, the percentile distribution shows a nice bubbling up at the 78th percentile, but in general this kind of analysis may not always yield such obvious distributions. As a matter of fact, detecting such an abnormalities can be even harder with multidimensional data.

BigML makes it very simple to further investigate the properties of the above green cluster. We can simply create a dataset including only the data instances belonging to that cluster and then build a simple model to better understand its characteristics:

centroid_dataset = api.create_dataset(best_cluster, { 'centroid' : '000000' })

centroid_model = api.create_model(centroid_dataset)
api.ok(centroid_model)"" + centroid_model['resource'])

This, in turn, produces the following model:


If you inspect the properties of the tree nodes, you can see that the tree is clearly quickly split into two subtrees with all nodes on the left-hand subtree having processing times lower than 14.88 sec, and all nodes belonging to the subtree on the right with processing times greater than 16.13.


This suggests that a good choice for the threshold between fast and slow can be approximately 15.5 sec.

If we follow along the same steps as in the original post and apply the percentile analysis to our data instances here, we arrive at the following distribution:


This distribution clearly starts growing faster between the 88th to the 89th percentile, confirming our choice of threshold:


To summarize, we have found a comparable result by applying a much more generalizable analysis approach.

Feature engineering

With the proper threshold identified, we can mark all data instances with running times lower than 15.5 as fast and the rest as slow. This is another task that BigML can tackle easily via its built-in feature engineering capabilities on BigML Dashboard:


Alternatively, we do the same in Python:

extended_dataset = api.create_dataset(dataset, {
    "new_fields" : [
        { "field" : '(if (< (f "time") 15.5) "fast" "slow")',
          "name" : "processing_speed" }]})"" + extended_dataset['resource'])

Which produces the following dataset:


Predicting slow and fast instances

Once we have all our instances labeled a fast and slow, we can finally build a model to predict whether an unseen instance will be fast and slow to process. The following code creates the model from our extended_dataset:

extended_dataset = api.create_dataset(dataset, {
   "excluded_fields" : ["000002"],
    "new_fields" : [
        { "field" : '(if (< (f "time") 15.5) "fast" "slow")',
          "name" : "processing_speed" }]})
api.ok(extended_dataset)"" + extended_dataset['resource'])

Notice that we excluded the original time field from our model, since we are now relying on our new feature that tells apart the fast instances from slow ones. This step yields the following result that shows a nice split between fast and slow instances at around 9,589 tags (let’s call this _MAX_TAGS_COUNT):


Admittedly, our example here is pretty trivial. As was the case in the original post, our prediction boils down to this conditional:

def html_too_big(s):
    return s.count(' _MAX_TAGS_COUNT

But, what if our dataset were more complex and/or the prediction involved more intricate calculations? This is another situation, where using a Machine Learning platform such as BigML provides an advantage over an ad-hoc solution. With BigML, predicting is just a matter of calling another function provided by our bindings:

from bigml.model import Model

final_model = "model/583dd8897fa04223dc000a0c"
prediction_model = Model(final_model)
prediction1 = prediction_model.predict({
    "html length": 3000,
        "tag count": 1000 })
prediction2 = prediction_model.predict({
    "html length": 30000,
        "tag count": 500 })

What’s more, predictions are fully local, which means no access to BigML servers is required!


Machine Learning can be used to solve everyday programming tasks. There are certainly different ways to do that, including tools like R and various Python libraries. However, those options have a steeper learning curve to master the details of the algorithms inside as well as the glue code to make them work together. One must also take into account the need to maintain and keep alive such glue code that can result in considerable technical debt.

BigML, on the other hand, provides practitioners all the tools of the trade in one place in a tightly integrated fashion. BigML covers a wide range of analytics scenarios including initial data exploration, fully automated custom Machine Learning workflows, and production deployment of those solutions on large-scale datasets. A BigML workflow that solves a predictive problem can be easily embedded into a data pipeline, which unlike R or Python libraries does not require any desktop computational resources and can be reproduced on demand.

Predicting the Publication Year of NIPS Papers using Topic Modeling

The Neural Information Processing Systems (NIPS) conference is one of the most important events in Machine Learning. It receives hundreds of papers from researchers all over the world each year. On the occasion of the NIPS conference held in Barcelona last week, Kaggle published a dataset containing all NIPS papers between 1987 and 2016.

We found it an excellent opportunity to put BigML’s latest addition in practice: Topic Models.

Assuming that paper topics evolve gradually over the years, our main goal in this post will be to predict the decade in which the papers were published by using their topics as inputs. Then, by examining the resulting model, we can get a rough idea of which research topics are popular now, but not in the past, and vice-versa.

We will accomplish this in four steps: first, we will transform the data into a Machine Learning-ready format; second, we will create a Topic Model and inspect the results; then, we will add the resulting topics as input fields; finally, we will build the predictive model using the decade as the objective field.

1. The Data

We start by uploading the CSV file “papers” to BigML. BigML creates a source while automatically recognizing the field types and showing a sample of the instances so we can check that the data has been processed correctly. As you can see below, the source contains the title, the authors, the year, the abstracts and the full text for each paper.


Notice that BigML supports compressed files such as .zip files so we don’t need to decompress the source file first. Moreover, BigML automatically performs some text analysis that also aids Topic Models (e.g., tokenization, stemming, stop words and case sensitivity cleaning) so you don’t need to worry about any text pre-processing. You can read more about the text options for Topic Models  here.

When the source is ready, we can create a dataset, which is a structured version of your data interpretable by a Machine Learning model. We do this by using the 1-click option shown below.

1-click dataset.png

Since we want to predict the publication decade of the NIPS papers, we need to transform the “year” into a categorical field. This field will include four different categories: 80s, 90s, 2000s and 2010s. We can easily do so by clicking the option “Add fields to dataset”.


Then we need to select “Lisp s-expression” and use the Flatline editor to calculate the decade using the “year” field. We will not cover all the steps to create a field using the Flatline editor, but you can find a detailed explanation in Section 9.1.7 of the datasets document.


The formula we need to insert contains several “If…” statements to group years into decades:

(if (< (f "year") 1990) "80s" (if (< (f "year") 2000) "90s" (if (< (f "year") 2010) "2000s" "2010s")))

When the new field is created, we can find as the last field in the dataset. By mousing over the histogram we can see the different decades:


2. Discovering the Topics Underlying the NIPS Papers

Creating a Topic Model in BigML is very easy. You can either use the 1-click option or you can configure the parameters. To discover the topics for the NIPS papers, we are going to configure the following  parameters:

  • Number of top terms: by default, BigML shows top 10 terms per topic. We prefer set a higher limit this time, up to 30 terms, so we can have more terms glean the topic themes from.
  • Bigrams: we include bigrams in the Topic Model vocabulary since we expect the NIPS reports to show a high number of them, e.g., neural networks, reinforcement learning or computer vision.
  • Excluded terms: we exclude terms such as numbers and variables since they are not significant in delimiting the papers’ thematic boundaries over time and can generate some noise.

topic model conf.png

When the Topic Model is created, you can inspect the topic terms using two different visualizations: the topic map and the term chart. See both in the images below.



You can see the resulting Topic Model and play with the associated BigML interactive visualizations here!

The discovered topics provide a nice overview of most of the major subtopics in Machine Learning research, and we’ve renamed them to make them readable at a glance.  In the “north” of the topic model map, we have topics related to Bayesian and Probabilistic modeling, along with Text/Speech processing and computer vision, which represent domains where those techniques are popular. In the “south”, we get the topics that are heavily tilted towards matrix mathematics, including PCA and the specification of multivariate Gaussian probabilities. In the “west” we have supervised learning and optimization, with topics containing theorem proving along with various occurrences of numbers in this quadrant. In the “east” we have two rather isolated topics corresponding to data structures, specifically trees and graphs. Finally, in the center of the map, we have topics that occur across every discipline:  General AI terms (like “robot”), people talking about the real-world domain that they’re working in, and acknowledgements for collaborators and funding.

With the topics discovered, let’s try to predict the topic distribution for a new document.  A good way to visually analyze the Topic Model predictions is to use BigML Topic Distributions. You can use the corresponding option within the 1-click menu:

topic dist.png

A form containing the fields used to create the Topic Model will be displayed so you can insert any text and get the topic distributions.

We input the following data for our first Topic Distribution:

  • Title: “Deep Learning Models of the Retinal Response to Natural Scenes”
  • Abstract: “A central challenge in sensory neuroscience is to understand neural computations and circuit mechanisms that underlie the encoding of ethologically relevant, natural stimuli. (..) Here we demonstrate that deep convolutional neural networks (CNNs) capture retinal responses to natural scenes nearly to within the variability of a cell’s response, and are markedly more accurate than linear-nonlinear (LN) models and Generalized Linear Models (GLMs). (…) the injection of latent noise sources in intermediate layers enables our model to capture the sub-Poisson spiking variability observed in retinal ganglion cells. (..) Overall, this work demonstrates that CNNs not only accurately capture sensory circuit responses to natural scenes, but also can yield information about the circuit’s internal structure and function.”

The resulting topics, in the order of importance, include: Human Visual System (22.15%), Neurobiology (19.89%), Neural Networks (10.77%), Human Cognition (8.72%), Computer Vision (6.96%),  Noise (5.17%) among others with lower probabilities. You can see the resulting probabilities histogram in the image below:

topic dist 1.png

After making several predictions for different papers, we’re pretty confident that the predictions map fairly well with the judgements a human expert might make. Give it a try for yourself with this Topic Model link!

3. Including Topics as Input Fields

At this point, we know that the resulting topics are consistent and the model  satisfactorily calculates the different Topic Distributions for the papers. Now, we can try using the topic distribution to predict when the paper was written.

In order to incorporate the different Topic Distributions for all the papers in the dataset, we need to click in the “Batch Topic Distribution” option and select the dataset that contains the field “decade”(the field we created in the first step above).

topic dist 2.png

When the Batch Topic Distribution is created, we can find the resulting dataset containing all the topic distributions as fields.

topic dist 3.png

4. Predicting a Paper’s Decade

Finally, we are ready to build a model to predict any paper’s decade using the topics as inputs.

We first need to split the dataset into training and testing subsets randomly. In this case, we are going to use 80% of the dataset to build a Logistic Regression. For this, we remove all fields except the topics and the paper abstract and select the decade as the objective field.

BigML visualizations for Logistic Regression allow us to interpret the most influencing topics to predict the decade. By selecting a topic for the x-axis of the Logistic Regression chart, we can see the most sensitive topics that evolve over time vs. the most stable topics. The fluctuating topics will be better predictors of the decades than the more steady topics, which will be mostly irrelevant for our supervised model.

For example, we can see in the image below that as the probability for the topic “Circuits/Hardware” increases, it is more likely to appear in the papers from the 80’s and 90’s than in the papers from the 21st century. Therefore, it can be an important topic in determining which decade a paper was written.

LR circuits.png

The topic “Support Vector Machines” for example, tend to be very frequent in papers from the 2000’s while it is less probable in other decades.

LR SVM.png

Other topics like “Small numbers” (which includes all the numbers found in the papers) or “Probabilistic Distributions” tend to have a stable probability throughout all the decades. You can observe this in the image below, where the graph lines are pretty flat, i.e., the predicted probabilities for the decades do not change per topic probabilities.

The results seem to fit nicely vs. our expectations, but to objectively measure the overall predictive power of this model, we need to evaluate it by using the remaining 20% of the data.

The Logistic Regression evaluation shows around 80% accuracy, which is not bad. However, after trying other classification models, we find out that the best performing model is a Bagging ensemble of 300 trees, which achieves an accuracy of 84%. You can see its confusion matrix below.

confusion matrix.png

In here, we see the most difficult decade to predict for is the 80’s, very likely due to a smaller number of papers (57 in total) in the sample as compared to the other decades.

To improve the model performance further, we can try some more Feature Engineering such as the length of the text, the authors, the number of papers from the authors, various extracted entities like the university/country published in.

We encourage you to delve in to this fun dataset, and let us know of ways to improve it. If you haven’t got a BigML account yet, it’s high time you get one today. Sign up now, it’s free!

Brazilian Summer School in Machine Learning and AI Startup Battle in the Books!

The BigML Team has travelled to São Paulo, Brazil, to conduct another edition of our series of Machine Learning schools as part of our ongoing educational activities to help democratize Machine Learning not only across industries and job functions but also across geographies. Despite the fact that this was not our first such experience, we keep being positively surprised about the way these hands on training sessions are received with such enthusiasm. The Brazilian Summer School in Machine Learning was no exception in this regard.

It’s safe to say that the Brazilian techies have replied to our call in a big way. We received more than 450 applications from 9 different countries to join this event. However, due to space and travel/visa issues we had to accept the maximum of 202 attendees representing 6 different states in Brazil (173 from São Paulo, 11 from Minas Gerais, 10 from Rio de Janeiro, 4 from Santa Catarina, 2 from Paraná, and 2 from Rio Grande do Sul). There was a full house at the VIVO Auditorium in São Paulo. Check out all the pictures in Flickr, Google+, and Facebook


The two-day program was packed with topics such as supervised and unsupervised learning techniques, feature engineering, how to get your data ready for Machine Learning, as well as automating Machine Learning workflows.

Artificial Intelligence Startup Battle

Following the footsteps of the inaugural AI Startup Battle that took place in Valencia, Spain, last March 15, 2016, and the second edition in Boston last October 12, 2016, the closure of the BSSML16 was brought with the third edition of the Artificial Intelligence Startup Battle. Brazilian media outlets covered the battle, since it was the first time in history that Brazil held a contest where the jury was a Machine Learning algorithm that predicted the probability of success of an early stage startup. The four competitors (DataholicsPayGo Energy, Prognoos, and Sppin-Kapputo) and Saffe Payments took to the stage, although Saffe Payments did not participate in the competition because they are already part of the Wayra academy.


Contenders of the AI Startup Battle at the BSSML16. From left to right: Guilherme Paiva, Co-Founder and CEO of Sppin-Kapputo; Mark O’Keefe, Co-Founder of PayGo Energy; Daniel Mendes, Founder and CEO of Dataholics; Raul Magno, Co-Founder and CEO of Prognoos; and Renato Valente, Country Manager of Telefónica Open Future_ in Brazil.

The PreSeries Machine Learning algorithm interviewed the contenders until it had enough information to provide a score between 0 and 100. The winning company, with a score of 92.33, was Prognoos, a startup from São Paulo that has built an artificial intelligence platform applying e-commerce user interaction and browsing data to personalize their buying experience through its proprietary algorithm.  This startup is being invited to Telefónica Open Future_’s accelerator to enjoy access to the Wayra Academy (for up to six months) and to Wayra services and network of contacts e.g., training, coaching, a global network of talent, as well as the opportunity to reach many Telefónica enterprises in Brazil and abroad. After six months, the winning company will be evaluated and may apply to run for a full Wayra acceleration process, including up to USD $50,000 convertible note loan (versus a possible 7 to 10% equity).


Prognoos, winner of the AI Startup Battle at the BSSML16, represented by Raul Magno, its Co-Founder and CEO (left). Renato Valente, Country Manager of Telefónica Open Future_ in Brazil (right).

The second place was for PayGo Energy, from Nairobi, Kenya, with a score of 71.90, they seek to democratize LPG (Liquefied Petroleum Gas or Propane) for the 2.9 billion people worldwide who lack access to clean cooking fuel. The third position was for Dataholics, from São Paulo, with a score of 39.72, they focus on providing a solution to detect the products and services that fit a given consumer profile based on their social media and demographic information. And finally, the fourth place went to Sppin-Kapputo, from Belo Horizonte in Brazil, with a score of 28.14, this company is an information broker that uses Machine Learning to allow real estate investors and construction companies to make better decisions by relying on analytics tools and prediction models that evaluate the impact of construction on a real estate market.

At the end of the event, BigML’s CEO and President of PreSeries, Francisco J. Martin, highlighted: “We already knew from the growing number of active BigML users in Brazil that the region holds tremendous potential due to an abundance of young and hungry to learn minds as well as world class academics in Machine Learning and AI. This week was further testament that geographic barriers are no longer strong enough to prevent the spread of innovative and ambitious ideas that consider not only their local market but the whole worlds as their target audience for their data driven smart applications.”

The next edition of our Machine Learning schools and AI Startup Battles will take place soon, so stay tuned for new announcements on Twitter (@bigmlcom) and other social media channels: LinkedIn, Facebook, and Google+.

Looking forward to seeing you again in future editions of our Machine Learning training events and AI Startup Battles around the world.

%d bloggers like this: