Skip to content

STATS4TRADE Unlocks the Power of Machine Learning for Investors

The investment industry is an extremely competitive one, where fund managers work hard to demonstrate a strong track record that beats their respective benchmarks in order to be able to justify their fees and to partake in the profits from their assets under management. For retail investors, the recent years have been characterized by a significant shift towards passive investing instruments such as index funds at the expense of actively managed funds that have been struggling to justify higher expense ratios against the backdrop of high volatility markets and easy-money policies that interventionist central bank policies worldwide manufactured in response to the great recession.

In the meanwhile, the abundance of financial market data has given birth to new wave of startups looking to put Machine Learning to good use in order to create sustainable market edge at lower costs. One such exciting company is STATS4TRADE out of France.  We have caught up with the CEO – Founder of STATS4TRADE to see how his company is innovating with advanced analytics.

STATS4TRADE

 

BigML: Congrats on launching your startup Jean-Marc. Can you tell us what was the motivation behind founding STATS4TRADE?

Jean-Marc Guillard: It really starts with my conviction that the financial services industry is faced with drastic change in the coming years and actively-managed equity funds are not immune to that. Investors are rightfully questioning high fees in the face of continued poor performance compared to passive funds with much lower fees. Similar to the disruptive changes now occurring in the transport industry, active-fund managers must contemplate an “Uber-ization” of their business model with software driving innovation to provide investors promised returns at lower cost.

BigML: I understand that active managers are between a rock and a hard place, but what’s wrong with the good old buy-and-hold?

Jean-Marc Guillard: Consider putting your money into a diversified index fund and waiting decades. Would such a traditional buy-and-hold approach yield decent returns with low volatility? Normally yes – but beware. This strategy can yield poor results with rather high volatility for some indices. For example just consider the performance of French CAC 40 over the past two decades.

The CAC 40 index is a diverse weighted stock price average of France’s 40 largest public companies including such internationally famous names as Airbus, L’Oreal and Michelin. As a result, the CAC 40 should serve as an ideal index for the risk averse small investor in France with a buy-and-wait strategy over the long-run. But the performance of the CAC 40 between 1990 and 2015 has been a dismal 3.3% without dividends and 6.5% with dividends. Moreover these returns came with rather high volatility upward of 22%.

Overall we strongly believe that a buy-and-hold strategy is absolutely valid for the risk averse small investor – especially if one considers the cost of active funds. However the recent advent of no fee brokerages like Robinhood in the United States and Deziro in Europe offers investors the ability to actively manage their own investments at costs approaching those of index funds. We want to encourage this democratization process by offering investors an objective way to automatically select stocks that yields better results while bypassing high fees.

CAC40 vs. STATS4TRADE

BigML: That’s very interesting. How is STATS4TRADE’s approach to this problem different? How can the risk averse small investor earn decent returns – say in the range of 6-9% including all fees – with low volatility over shorter time periods than decades?

Jean-Marc Guillard: STATS4TRADE is uniquely positioned to help investors navigate this coming change. With the aid of Machine Learning and cloud computing technologies, we offer investors a new approach for selecting stocks and making buy/sell decisions – a data-driven approach that not only yields consistently better-than-index performance but also minimizes volatility and decreases operational costs while protecting capital.

Our trading applications leverage the power of BigML’s Machine Learning tools and allow investors both private and professional the opportunity to not only select but also simulate different investment strategies based on short-term price forecasts. Once an investor has selected a strategy corresponding to her particular risk-profile, the application automatically provides daily buy/sell signals for trading on no-fee platforms like Robinhood and Deziro.

Of course none of this is magic and our approach is not without its limitations. For example if one expects to become rich quickly, he will be sorely disappointed for no forecast is completely accurate. Normally one needs about six months to begin seeing the benefits of our method. Nonetheless the message is clear: through the power of statistics and data-driven approaches like ours ultimately yield better results at lower cost. The results certainly speak for themselves!

BigML: Thanks for the detailed explanation.  Can you also tell a bit about specifically how Machine Learning comes into play?

Jean-Marc Guillard: Someone once said that predicting the future is a fool’s errand. We agree. However, one can still use stastitics to estimate the likelihood of future events based on past data and an underlying statistical model. In fact, statistical methods have been used extensively for years in activities like consumer research, weather forecasting and of course finance. In our case we use Machine Learning methods powered by BigML to estimate the probability of short-term price movements of selected securities and indices, currencies and commodities. Namely, we aim to identify underlying statistical patterns for a given security, basket of securities, or an index and thereby accurately forecast upcoming movements in price.

BigML: What made you choose to build your models on the BigML platform?

Jean-Marc Guillard: A big part of it was the drastically faster iterative experimentation BigML Dashboard enables, which in turn allowed us to achieve faster time-to-market. One usually doesn’t know what the final Machine Learning workflow will look like when he sets on exploring possibilities in a large hypothesis space a complex problem like ours require. So it is essential that the tools you use afford you very quick and easy iterative exploration.  BigML excels on this front.

In addition, the automation options made available on the BigML platform let’s us decrease ongoing operational costs to a minimum level that can compete with passive index funds while further differentiating from actively-managed funds that rely on manual processes. Lastly, we have had phenomenal support from the BigML team throughout our evaluation, exploration and solutions implementation phases.

BigML: Thanks Jean-Marc. It is very impressive to see how you have been able to ramp up your Machine Learning efforts in such a limited time period despite constrained resources. We hope stories like yours inspire many more startups to realize that they too can turn their data and know-how into sustainable competitive advantages.

For our readers benefit, a downloadable PDF version of the STATS4TRADE case study is also available.

Machine Learning Automation: Beware of the Hype!

There’s a lot of buzz lately around “Automating Machine Learning”.  The general idea here is that the work done by a Machine Learning engineer can be automated, thus freeing potential users from the tyranny of needing to have specific expertise.

Presumably, the ultimate goal of such automations is to make Machine Learning accessible to more people.  After all, if a thing can be done automatically, that means anyone who can press a button can do it, right?

robot-vs-human

Maybe not.  I’m going to make a three-part argument here that “Machine Learning Automation” is really just a poor proxy for the true goal of making Machine Learning useable by anyone with data.  Furthermore, I think the more direct path to that goal is via the combination of automation and interactivity that we often refer to in the software world as “abstraction”.  By understanding what constitutes a powerful Machine Learning abstraction, we’ll be in a better position to think about the innovations that will really make Machine Learning more accessible.

Automation and Interaction

I had the good fortune to attend NIPS in Barcelona this year.  In particular, I enjoyed the (in)famous NIPS workshops, in which you see a lot of high quality work out on the margins of Machine Learning research.  The workshops I attended while at NIPS were each excellent, but were, as a collection, somewhat jarringly at odds with one another.

In one corner, you had the workshops that were basically promising to take Machine Learning away from the human operator and automate as much of the process as possible.  Two of my favorites:

  • Towards an Artificial Intelligence for Data Science – What it says on the box, basically trying to turn Machine Learning back around on itself and learn ways to automate various phases of the process.  This included an overview of an ambitious multi-year DARPA program to come up with techniques that automate the entire model building pipeline from data ingestion to model evaluation.
  • Bayesopt – This is a particular subfield in Machine Learning, where we try to streamline the optimization of any parameterized process that you’d usually figure out via trial and error.  The central learning task is, given all of the parameter sets you’ve tried so far, trying to choose the next one to evaluate so that you have the best shot at finding the global maximum.  Of course, Machine Learning algorithms themselves are parameterized processes tuned by trial and error, so these techniques can be used on them.  My own WhizzML script, SMACdown, is a toy version of one of these techniques that does exactly this for BigML ensembles.

In the other corner, you had several workshops on how to further integrate people into the Machine Learning pipeline, either by inserting humans into the learning process or finding more intuitive ways of showing them the results of their learning.

  • The Future of Interactive Learning Machines – This workshop featured a panoply of various human-in-the-loop learning settings,  From humans giving suggestion to Machine Learning algorithms, to machine-learned models trying to teach humans.  There was, in particular, an interesting talk on using reinforcement learning to help teachers plan lessons for children, which I’ll reference below.
  • Interpretable Machine Learning for Complex Systems – This workshop featured a number of talks on ways to allow humans to better understand what a classifier is doing, why it makes the predictions it does, and how best to understand what data the classifier needs to do its job better.

So what is going on here?  It seems like we want Machine Learning to be automatic . . . but we also want to find ways to keep people closely involved?  It is a strange pair of ideas to have at the same time.  Of course, people want things automated, but why do they want to stay involved, and how to those two goals co-exist?

A great little call-and-response on this topic happened between two workshops as I attended them.  Alex Wiltschko from Twitter gave an interesting talk on using Bayesian parameter optimization to optimize the performance of their Hadoop jobs (among other things) and he made a great point about optimization in general:  If there’s a way to “cheat” your objective, so that the objective increases without making things intuitively “better”, the computer will find it.  This means you need to choose your objective very carefully so the mathematical objective always matches your intuition.  In his case, this meant a lot of trial and error, and a lot of consultations with the people running the Hadoop cluster.

An echo and example came from the other side of the “interactivity divide”, in the workshop on interactive learning.  Emma Brunskill had put together a system that optimized the presentation of tutorial modules (videos, exercises, and so on) being presented to young math students.  The objective the system was trying to optimize was something like the performance on a test at the end of the term.  Simple enough, right?  Except that one of the subjects being taught was particularly difficult.  So difficult that few of the tutorial modules managed to improve the students’ scores.  The optimizer, sensing this futility, promptly decided not to bother teaching this subject at all.  This answer is of course unsatisfying to the human operator; the curriculum should be a constraint on the optimization, not a result of it.

Crucially though, there’s no way the computer could know this is the case without the human operator telling it so.  And there’s no way for the human to know that the computer needs to know this unless the human is in the loop.

Herein lies the tension between interactivity and automation.

On one hand, people want and need many of the tedious and unnecessary details around Machine Learning to be automated away; often such details require expertise and/or capital to resolve appropriately and end up as barriers to adoption.

On the other, people still want and need to interact with Machine Learning so they can understand what the “Machine” has learned and steer the learning process towards a better answer if the initial one is unsatisfactory.  Importantly, we don’t need to invoke a luddite-like mistrust of technology to explain this point of view.  The reality is that people should be suspicious of the first thing a Machine Learning algorithm spits out, because the numerical objective that the computer is trying to optimize often does not match the real-world objective.  Once the human and machine agree precisely on the nature of the problem, Machine Learning works amazingly well, but it sometimes takes several rounds of interaction to generate an agreement of the necessary precision.

Said another way, we don’t need Machine Learning that is “automatic”.  We need Machine Learning that is comfortable and natural for humans to operate.  Automating away laborious details is only a small part of this process.

If this sounds familiar to those of you in the software world, it’s because we’re here all the time.

From Automation to Abstraction

In the software world, we often speak in terms of abstractions.  A good software library or programming language will hide unnecessary details from the user, exposing only the modes of interaction necessary to operate the software in a natural way.  We say that the library or language is a layer of abstraction over the underlying software.

For those of you unfamiliar with the concept, consider the C programming language.  In C, we can write a statement like this:

x = y + 3

The C compiler converts this operation to machine code, which requires knowing where in memory the x and y variables are, loading these variables into registers, loading the binary value for “3” into a register, summing the values to a new register, assigning that result to a new variable, and so on.

The language hides machine code and registers from us so we can think in terms of operators and variables, the primitives of higher level problems.  Moreover, it exposes an interface (mathematical expressions, functions, structs, and so on) that allows us to operate the layer underneath in a way that’s more useful and natural than if we worked on the layer directly.  In this sense, the C language is a very good abstraction:  It hides many of the things we’re almost never concerned about, and exposes the relevant functionality in an easier-to-use way.

It’s helpful to think about abstractions in the same way we think about compression algorithms.  They can be “strong”, so that they hide a lot of details, or “weak” so they hide few.  They can also be “very lossy”, so that they expose a poor interface, up to “lossless”, where the interface exposed can do everything that the hidden details can do.  The devil of creating a good abstraction is rather the same as creating a good compression algorithm:  You want to hide as many unimportant details from your users as possible, while hiding as little as possible that those same users want to see or use.  The C language as an abstraction over machine code is both quite strong (hides virtually all of the details of machine code from the user) and near-lossless (you can do the vast majority of things in C that are possible directly via machine code).

The astute reader can likely see the parallel to our view of Machine Learning; We have the same sort of tension here between the hiding of drudgeries and complexities while still providing useful modes of interaction between tool and user.  Where, then, does “Machine Learning Automation” stand on our invented scale of abstractions?

Automations Are Often Lossy and Weak Abstractions

The problem (as I see it) with some of the automations on display at NIPS (and indeed in the industry at large) is that they are touted using the language of abstraction.  There are often claims that such software will “automate data science” or “allow non-experts to use Machine Learning”, or the like.  This is exactly what you might say about the C language; that it “automates machine code generation” or “allows people who don’t know assembly to program”, and you would be right.

As an example of why I find this a bit disingenuous, consider using Bayesian parameter optimization to tune the parameters of Machine Learning algorithms, one of my favorite newish techniques.  It’s a good idea, people in the Machine Learning community generally love it, and it certainly has the power to produce better models from existing software.  But how good of an abstraction is it, on the basis of drudgery avoided and the quality of the interface exposed?

Put another way, suppose we implemented some of these parameter optimizations on top of, say scikit-learn (and some people have).  Now suppose there’s a user that wants to use this on data she has in a CSV file to train and deploy a model.  Here’s a sample of some of the other details she’s worried about:

  1. Installing Python
  2. How to write Python code
  3. Loading a CSV in Python
  4. Encoding categorical / text / missing values
  5. Converting the encoded data to appropriate data structures
  6. Understanding something about how the learned model makes its predictions
  7. Writing prediction code around the learned model
  8. Writing/maintaining some kind of service that will make predictions on-demand
  9. Getting a sense of the learned model’s performance

Of course, things get even more complicated at scale, as is their wont:

  1. Get access to / maintain a cluster
  2. Make sure that all cluster nodes have the necessary software
  3. Load your data onto the cluster
  4. Write cluster specific software
  5. Deal with cluster machine / job limitations (e.g., lack of memory)

This is what I mean when I say Machine Learning automations are often weak abstractions:  They hide very few details and provide little in the way of a useful interface.  They simply don’t usually make realistic Machine Learning much easier to use.  Sure, they prevent you from having to hand-fit maybe a couple dozen parameters, but the learning algorithm is already fitting potentially thousands of parameters.  In that context, automated parameter tuning, or algorithm selection, or preprocessing doesn’t seem like it’s the thing that suddenly makes the field accessible to non-experts.

In addition, the abstraction is also “lossy” under our definition above; it hides those parameters, but usually doesn’t provide any sort of natural way for people to interact with the optimization.  How good is the solution?  How well does that match the user’s internal notion of “good”?  How can you modify it to do better?  All of those questions are left unanswered.  You are expected to take the results on faith.  As I said earlier, that might not be a good idea.

A Better Path Forward

So why am I beating on Bayesian parameter optimization?  I said that I think it’s awesome and I really do.  But I don’t buy that it’s going to be the thing that brings the masses to Machine Learning.  For that, we’re going to need proper abstractions; layers that hide details like those above from the user, while providing them with novel and useful ways to collaborate with the algorithm.

This is part of the reason we created WhizzML and Flatline, our DSLs for Machine Learning workflows and feature transformation.  Yes, you do have to learn the languages to use them.  But once you do, you realize that the languages are strong abstractions over the concerns above.  Hardware, software, and scaling issues are no longer any concern as everything happens on BigML-maintained infrastructure.  Moreover, you can interact graphically with any resources you create via script in the BigML interface.

The goal of making Machine Learning easier to use by anyone is a good one.  Certainly, there are a lot of Machine Learning sub-tasks that could bear automating, and part of the road to more accessible Machine Learning is probably paved with “one-click” style automations.  I would guess, however, that the larger part is paved with abstractions; ways of exposing the interactions people want and need to have with Machine Learning in an intuitive way, while hiding unnecessary details.  The research community is right to want both automation and interactivity: If we’re clever and careful we can have it both ways!

Reflecting on 2016 to Guide BigML’s Journey in 2017

2016 has proven a whirlwind year for BigML with substantial growth in users, customers and the team riding on the realization by businesses and experts that Machine Learning has transformational power in the new economy where data is in abundance but actionable insights have not been able to keep pace with improvements in storage, computational power and lowered costs. When things happen so fast, one can sometimes find it a challenge to stop and reflect on milestones and achievements. So below are the highlights of what made 2016 a special year for BigML.

img_universities

Releases and Product Updates

In 2016, BigML users were greeted by many new capabilities that they were asking for. As a result, the platform is now more mature and versatile than ever.  Logistic Regression (Summer 2016 Release) and Topic Modeling (Fall 2016 Release) techniques beefed up existing supervised and unsupervised learning resources, while Workflow Automation with WhizzML (Spring 2016 Release) gave the platform a whole new dimension that can deliver huge productivity boosts to any analytics team in the form of reduced solution maintenance and mitigated model management risks.

Aside from those, we made many smaller but noteworthy improvements to the toolset including but not limited to: Scriptify, Objective-C bindings, Swift SDK and BigML for Alexa.

Events, Certifications and Awards

2016 has seen BigML being represented at 4YFNMachine Learning PraguePAPIs 2016 and PAPIs Connect, Legal Management ForumIEEE Rock Stars of Emerging TechWIRED, Mobey DayData Wrangling Automation and other industry events around the world with flattering reception and genuine enthusiasm that keeps pushing the team to innovate. Most notably, we have created a new and very hands-on BigML Certification Program that teaches participants how to solve practical real-life Machine Learning problems.  The next wave starts on January 19th, 2017!

After conducting the 2nd Valencian Summer School in Machine Learning followed by a special lecture by BigML advisor Professor Geoff Webb, BigML gave its first Brazilian Summer School in Machine Learning in São Paulo. Look for more education events to follow in 2017 as BigML has joined forces with CICE in Madrid to take its educational efforts to the next level to capitalize on the great hunger for Machine Learning from developers, analysts and scientist.

img_certifications

Although the biggest award for us are the compliments we receive from our users and customers, in 2016, we were also pleased to be recognized by DIA Barcelona and Zapier for best advanced analytics for insurance companies and BigML for Google Sheets respectively.

Popular Posts of 2016

Some of the Machine Learning veterans on our team, also were able to make time in sharing their career experiences over multiple posts that were well-received.

img_blog

For reprise, here is a good selection to revisit for those who would like to gain new perspectives on the current market landscape and what has worked in real life situations right from the horses mouth.

Looking Forward to 2017

Now that the awareness of Machine Learning in general, and cloud-born Machine Learning platforms in particular, have reached a critical threshold, our go-to-market strategy will double up on communicating positive examples to the entire community rather than having to explain “Why Machine Learning Matters” to the uninitiated.  In that regard, we must also thank Google, Apple, Uber, Airbnb, Facebook, Amazon, Microsoft for putting Machine Learning squarely in the business lexicon.

img_events2

In 2017, we also intend to intensify our educational efforts that promote learning by doing, while expanding the breadth and depth of capabilities to enable Agile Machine Learning at any organization in any industry. A big part of this will manifest itself through our active participation in technology events. We are kicking off the year with a trio of events, where BigML speakers will be on stage:

  • Anomaly Detection: Principles, Benchmarking, Explanation, and Theory

    Anomaly detection algorithms are widely applied in data cleaning, fraud detection, and cybersecurity. This talk will begin by defining various anomaly detection tasks and then focus on unsupervised anomaly detection. It will present a benchmarking study comparing eight state-of-the-art methods. Then it will discuss methods for explaining anomalies to experts and incorporating expert feedback into the anomaly detection process. The talk will conclude with a theoretical (PAC-learning) framework for formalizing a large family of anomaly detection algorithms based on discovering rare patterns.

    Speaker: Thomas G. Dietterich, Co-Founder and Chief Scientist.

  • FiturtechY

    FiturtechY is an event organized by the Instituto Tecnológico Hotelero (ITH), where innovation and technology meet to improve the tourism industry. FiturtechY will host four forums to discuss different topics: business, destiny, sustainability, and trends. BigML will be presenting at #techYnegocio forum, the meeting point for those professionals who seek to learn the latest tools that help revolutionize the tourism industry.

    Speaker: Dario Lombardi, VP of Predictive FinTech.

  • Computer-Supported Cooperative Work and Social Computing

    CSCW is the premier venue for presenting research in the design and use of technologies that affect groups, organizations, communities, and networks. Bringing together top researchers and practitioners from academia and industry, CSCW explores the technical, social, material, and theoretical challenges of designing technology to support collaborative work and life activities.

    Speaker: Poul Petersen, Chief Infrastructure Officer.

We also intend to put together the first BigML User Conference later in the year. So stay tuned for further event updates.

Hope this post gave a good crash course tour (especially for those of you that have recently joined BigML) of what’s been happening around our neck of the woods. Powered by your support, we’re hungrier than ever to bring to the market the best Machine Learning software platform there ever was. We’d also highly encourage you to take a look at our 2017 predictions, which will guide our roadmap in the remainder of the year.  As always, be sure to reach out to us with your ideas no matter how crazy they seem!

10 Offbeat Predictions for Machine Learning in 2017

As each year wraps up experts pull their crystal balls from their drawers and start peering into it for a glimpse of what’s to come in the next one. At BigML, We have been following such clairvoyance carefully this past holiday season to compare and contrast with our own take on what 2017 will have in store, which can come across as quite unorthodox to some experts out there.

Enterprise Machine Learning Predictions Nobody is Talking About

For the TL;DR crowd, our crystal ball is showing us a cloudy (no pun intended) 2017 Machine Learning market forecast with some sunshine behind the clouds for good measure. To put it more directly, enterprises need to look beyond the AI hype for practical ways to incorporate Machine Learning into their operations. This starts with the right choice of internal platform that will help them build on smaller, low hanging fruit type projects that leverage their proprietary datasets. In due time, those projects add up to create positive feedback effects that ultimately not only introduce decision automation on the edges, but help agile Machine Learning teams transform their industries.

Jumping back to our regularly scheduled programming, let’s start with a quick synopsis of the road traveled so far:

  • Machine Learning has already set on an irreversible path in becoming impactful (VERY impactful) on how we’ll do our jobs throughout many sectors and eventually touching the whole economy.

  • Machine Learning Use Cases by Industry

  • But digesting, adopting and profiting from 36 years of Machine Learning advances and best practices has been a very bumpy ride for many businesses few have managed to navigate so far.

  • There are many “New Experts” that read a couple of books or take a few online classes and are suddenly in a position to “alter” things just because they have access to cheap capital. While top technology companies have been “collecting” as much experienced Machine Learning talent as possible to get ready ready for the up and coming AI economy, other businesses are at the mercy of Machine Learning-newbie investors and inexperienced recent graduates with unicorn ambitions. It is wishfully assumed that versatile, affordable and scalable solutions based on a magical new algorithm will materialize out of these ventures.

  • In 2017, we suspect that the ecosystem is going to start converging around the right approach, albeit after otherwise avoidable roadkills.

Before we get to the specific predictions, we must note that 2016 was a special year in that it presented us with the watershed event such that the planet’s Top 5 most valuable companies are all technology companies for the first time in history. All five share the common traits of large scale network effects, highly data-centric company cultures and new economic value-added services built atop sophisticated analytics. Whats more they have been heavily publicizing their intent to make Machine Learning the fulcrum of their future evolution. With the addition of revenue generating unicorns like Uber and Airbnb the dominance of the tech sector is likely to continue in the coming years that will benefit immensely from the wholesale digitization of the World economy.

Changing of the Guard?

However, the trillion dollar question is how legacy companies (i.e., non-tech firms with rich data plus smaller technology companies) can counteract and become an integral part of the newly forming value chains to be able to not only survive, but thrive in the remainder of the decade. Today, these firms are stuck with rigid rear view mirror business intelligence systems and archaic workstation-based traditional statistical systems running simplistic regression models that fail to capture the complexity of many real life predictive use cases.

At the same time, they sit on growing heaps of hard to replicate proprietary datasets that go underutilized. The latest McKinsey Global Institute report named The Age of Analytics: Competing in a Data-driven World reveals that less than 30% of the potential of modern analytics technologies outlined in their 2011 report has been realized — not even counting the new opportunities made possible by the advent of the same technologies in the last five years. To make matters worse, the progress looks very unbalanced across industries (i.e., as low as 10% in U.S. Healthcare vs. up to 60% in the case of Smartphones) at a time analytics prowess is correlated with competitive differentiation more than ever.

Machine Learning Industry Adoption

Even if it maybe hidden behind polished marketing speak pushed by major vendors and research firms (e.g., “Cognitive Computing”, “Machine Intelligence” or even doomsday-like “Smart Machines”), the Machine Learning genie is out of the bottle without a doubt as its wide-ranging potential across the enterprise has already made it part of the business lexicon. This new found appetite for all things Machine Learning means many more legacy firms and startups will begin their Machine Learning journeys in the 2017. The smart ones will separate themselves from the bunch by learning from others’ mistakes. Nonetheless, some old bad habits are hard to kick cold turkey, so let’s dive in with some gloomier predictions and end on a higher note:

  • PREDICTION #1:

    “Big Data” soul searching leads to the gates of #MachineLearning.

    Tweet: PREDICTION#1: “Big Data” soul searching leads to the gates of #MachineLearning. @BigMLcom https://ctt.ec/a1fa8+

  • The soul searching in the “Big Data” movement will continue as experts recognize the level of technical complexity that aspiring companies must navigate to piece together useful “Big Data” solutions that fit their needs. At the end of the day “Big Data” is tomorrow’s data but nothing else. The recent removal of the “Big Data” entry from the Gartner Hype Cycle is further testament to the same realization. All this will only hasten the pivot to analytics and specifically to Machine Learning as the center of attention so as to recoup the sunk costs from those projects via customer touching smart applications. Moreover, the much maligned sampling remains a great tool to rapidly explore new predictive use cases that will support such applications.

Big Data vs. Machine Learning Trends

  • PREDICTION #2:

    VCs investing in algorithm-based startups are in for a surprise.

    Tweet: PREDICTION #2: VCs investing in algorithm-based startups are in for a surprise. @BigMLcom https://ctt.ec/r3SnA+

  • The education process of VCs will continue, albeit slowly and through hard lessons. They will keep investing in algorithm-based startups with the marketable academic founder resumes, while perpetuating myths and creating further confusion e.g., portraying Machine Learning as synonymous with Deep Learning, completely misrepresenting the differences between Machine Learning algorithms and Machine-learned models or model training and predicting from trained models1. A deeper understanding of the discipline with the proper historical perspective will remain elusive in the majority of the investment community that is on the look out for quick blockbuster hits. On a slightly more positive note, a small subset of the VC community seems to be waking up to the huge platform opportunity Machine Learning presents.

    Benedict Evans on Machine Learning

  • PREDICTION #3:

    #MachineLearning talent arbitrage will continue at full speed.

    Tweet: PREDICTION #3: #MachineLearning talent arbitrage will continue at full speed. @BigMLcom https://ctt.ec/8Q43c+

  • The media frenzy around AI and Machine Learning will continue at full steam as humored by Rocket AI type parties, where young academics will be courted and ultimately funded by aforementioned investors. Ensuing portfolio companies will find it hard to compete on algorithms as few algorithms are really widely useful in practice although some do slightly better than other for very niche problems. Most will be cast as brides at shotgun weddings with corporate development teams looking to beef up on Machine Learning talent strictly for internal initiatives. In some nightmare scenarios, the acquirers will have no clear analytics charter, yet they will be in a frantic hunt to grab headlines to generate the illusion that they too are on the AI/Machine Learning bandwagon.

Machine Learning Talent Arbitrage

  • PREDICTION #4:

    Top down #MachineLearning initiatives built on Powerpoint slides will end with a whimper.

    Tweet: PREDICTION #4: Top down #MachineLearning initiatives built on Powerpoint slides will end with a whimper. @BigMLcom https://ctt.ec/_I589+

  • Legacy company executives that opt for getting expensive help from consulting companies in forming their top-down analytics strategy and/or making complex “Big Data” technology components work together before doing their homework on low hanging predictive use cases will find that actionable insights and game-changing ROI will be hard to show. This is partially due to the requirement to have the right data architecture and flexible computing infrastructure already in place, but more importantly outperforming 36 years of collective achievements by the Machine Learning community with some novel approach is just a tall order regardless how relatively cheap computing has become.

Top Down Data Science Consulting Fail

  • PREDICTION #5:

    #DeepLearning commercial success stories will be few and far in between.

    Tweet: PREDICTION #5: #DeepLearning commercial success stories will be few and far in between. @BigMLcom https://ctt.ec/8f0ac+

  • Deep Learning’s notable research achievements such as the AlphaGo challenge will continue generating media interest. Nevertheless, its advances in certain practical use cases such as speech recognition and image understanding will be the real drivers for it to find a proper spot in the enterprise Machine Learning toolbox alongside other proven techniques. Interpretability issues, dearth of experienced specialists, its reliance on very large labeled training datasets and significant computational resource provisioning will limit mass corporate adoption in 2017. In its current form, think of it as the Polo of Machine Learning techniques, a fun time perhaps that will let you rub elbows with the rich and famous provided that you can afford a well-trained horse, the equestrian services and upkeep, the equipment and a pricey club membership to go along with those. Nevertheless, not quite an Olympic sport. So short of a significant research breakthrough in the unsupervised flavors of Deep Learning, most legacy companies experimenting with Deep Learning are likely to come to the conclusion that they can get better results faster if they pay more attention to areas like Reinforcement Learning and the bread and butter Machine Learning techniques such as ensembles.

Deep Learning Hype

  • PREDICTION #6:

    Exploration of reasoning and planning under uncertainty will pave the way to new #MachineLearning heights.

    Tweet: PREDICTION #6: Exploration of reasoning and planning under uncertainty will pave the way to new #MachineLearning heights. @BigMLcom https://ctt.ec/1GAi3+

  • Of course, Machine Learning is only a small part of AI. More attention to research and the resulting applications from startups in the fields of reasoning and planning under uncertainty and not only learning will help cover truly new ground beyond the better understood pattern recognition. Not surprisingly, Facebook’s Mark Zuckerberg has reached similar conclusions in his assessment of the state of AI/Machine Learning after spending nearly a year to code his intelligent personal assistant “Jarvis”, that was loosely modeled after the same in the Iron Man series.

Mark Zuckerberg's Jarvis AI

  • PREDICTION #7:

    Humans will still be central to decision making despite further #MachineLearning adoption.

    Tweet: PREDICTION #7: Humans will still be central to decision making despite further #MachineLearning adoption. @BigMLcom https://ctt.ec/iBjl8+

  • Some businesses will see early shoots of faster and evidence-based decision making powered by Machine Learning, however humans will still be central to the decision making. Early examples of smart applications will emerge in certain industry pockets adding to the uneven distribution of capabilities due to differences in regulatory frameworks, innovation management approaches, competitive pressures, end customer sophistication and demand for higher quality experiences as well as conflicting economic incentives in some value chains. Despite the talk about the upcoming singularity and robots taking over the world, cooler heads in the space point out that it will take  a while to create truly intelligent systems. In the meanwhile, businesses will slowly learn to trust models and their predictions as they realize that algorithms can outperform humans in many tasks.

s. Machine Intelligence

  • PREDICTION #8:

    Agile #MachineLearning will quietly take hold beneath the cacophony of AI marketing speak.

    Tweet: PREDICTION #8: Agile #MachineLearning will quietly take hold beneath the cacophony of AI marketing speak. @BigMLcom https://ctt.ec/5eO8B+

  • A more practical and agile approach to adopting Machine Learning will quietly take hold next year. Teams of doers not afraid to get their hands dirty with unruly yet promising corporate data will completely bypass the “Big Data” noise and carefully pick low hanging predictive problems that they can solve with well proven algorithms in the cloud with smaller sampled datasets that have a favorable signal to noise ratio. As they build confidence in their abilities, the desire to deploy what they have build in product as well as to add more use cases will mount. No longer bound by data access issues, complex, hard to deploy tools these practitioners not only start improving their core operations but also start thinking about predictive use cases with a higher risk-reward profiles that can serve as the enablers of brand new revenue streams.

Lean, Agile Data Science Stack

  • PREDICTION #9:

    MLaaS platforms will emerge as the “AI-backbone” for enterprise #MachineLearning adoption by legacy companies.

    Tweet: PREDICTION #9: MLaaS platforms will emerge as the “AI-backbone” for enterprise #MachineLearning adoption by legacy companies. @BigMLcom https://ctt.ec/6RuU9+

    MLaaS platforms will emerge as the “AI Backbone” in accelerating the adoption of Agile Machine Learning practices. Consequently, commercial Machine Learning will get cheaper and cheaper thanks to a new wave of applications built on MLaaS infrastructure. Cloud Machine Learning platforms in particular will democratize Machine Learning by

    • significantly lowering costs by eliminating complexity or front-loaded vendor contracts
    • offering a preconfigured frameworks that packages the most effective algorithms
    • abstracting the complexities of infrastructure setup and management from the end user
    • providing easy integration, workflow automation and deployment options through REST APIs and bindings.

Machine Learning Platforms for Developers

  • PREDICTION #10:

    Data Scientists or not, more Developers will introduce #MachineLearning into their companies.

    Tweet: PREDICTION #10: Data Scientists or not, more Developers will introduce #MachineLearning into their companies. @BigMLcom https://ctt.ec/LCRKX+

  • 2017 will be the year, when developers start carrying the Machine Learning banner easing the talent bottleneck for thousands of businesses that cannot compete with the Googles of the world in attracting top research scientists with over a decade of experience in AI/Machine Learning, which doesn’t automagically translate to smart business applications that deliver business value. The developers will start rapidly building and scaling such applications on MLaaS platforms that abstract painful details (e.g., cluster configuration and administration, job queuing, monitoring and distribution etc.)  that are better kept underground in the plumbing. Developers just need a well-designed and well-documented API instead of knowing what a LR(1) Parser is to compile and execute their Java code or knowing what Information Gain or the Wilson Score are to be able to solve a predictive use case based on a decision tree.

Developer-driven Machine Learning

We are still in the early innings of “The Age of Analytics”, so there is much more to feel excited about vs. dwelling on bruises from past false starts. Here’s to keeping calm and carrying on with this exciting endeavor that will take business as we know it through a storm by perfecting the alchemy between mathematics, software and management best practices. Happy 2017 to you all!

1: The A16Z presenter seems to think every self-driving car has to learn what a stop sign is by itself, thus reinventing the wheel many times over instead of relying on tons of historical sensor data from an entire fleet of such vehicles. In reality, few Machine Learning use cases require a continuously trained algorithm e.g., handwriting recognition.

Fourth Edition of the Startup Battle at 4YFN

Four Years From Now, the startup business platform of Mobile World Congress that enables startups, investors and corporations to connect and launch new ventures together, goes to Barcelona, Spain, from February 27 to March 1, 2017. We could not think of a better context to run the fourth edition of our series of Startup Battles.

Telefónica has invited PreSeries, the joint venture between Telefónica Open Future_ and BigML, to participate at the 4YFN event and showcase its early stage venture investing platform on the main stage on February 28 in front of an audience of over 500 technologists. In a nutshell, PreSeries provides insights and many other metrics to help investors make objective, data-driven decisions in investing in tomorrow’s leading technology companies.

battle_-4yfn

In rapid fire execution mode, Valencia was the first city that witnessed the World’s premiere Artificial Intelligence Startup Battle last March 15, 2016. On October 12, the PreSeries Team travelled to Boston to celebrate the second edition at the PAPIs ‘16 conference. Less than two months later we celebrated the third edition in São Paulo, in the BSSML16 context. The fourth edition of our series of startup battles will be hosted in Barcelona, Spain. The distinguished audience and press members in Catalonia will discover how an algorithm is able to predict the success of a startup without any human intervention.

To recap the process, five startups from the Wayra Academy, Telefónica’s startups accelerator, will present their projects on stage through five-minute to-the-point pitches. Afterwards, PreSeries will ask a number of questions to each contender in order to provide a score between 0 and 100. The startup with highest score of all will win the battle. Having the opportunity to participate in the battle is key for participating startups as it will give them excellent exposure to potential corporate sponsors, strategic partners and the venture investments community. Stay tuned for future announcements, where we will reveal the contenders of the fourth edition of our Startup Battle as it may just prove to be the most competitive one so far.

Even Easier Machine Learning for Every Day Tasks

by

Recently, the “Machine Learning for Everyday Tasks” post suddenly rising to the top of Hacker News drew our attention. In that post, Sergey Obukhov, software developer at San Francisco-based startup Mailgun, tries to debunk the myth that Machine Learning is a hard task:

I have always felt like we can benefit from using machine learning for simple tasks that we do regularly.

This certainly rings true to our ears at BigML, since we aim to democratize Machine Learning by providing developers, scientists, and business users powerful higher-level tools that package the most effective algorithms that Machine Learning has produced in the last two decades.

In this post, we are going to show how much easier it is to solve the same problem tackled in the Hacker News post by using BigML. To this end, we have created a test dataset with similar properties to the one used in the original post so that we can replicate the same steps with analogous results.

Predicting processing time

The objective of this analysis is predicting how long it will take to parse HTML quotations embedded within e-mail conversations. Most messages are processed in a very short time, while some of them take much longer. Identifying those lengthier messages in advance is useful for several purposes, including load-balancing and giving more precise feedback to users.

Our analysis is based on a CSV file containing a number of fictitious track records of our system performance when handling email messages:

HTML length, Tag count, Processing Time

We would like to classify a given incoming message as either slow or fast given its length and tag count based on previously collected data.

Finding a definition for slow and fast through clustering

The first step in our analysis is defining what slow and fast actually mean. The approach in the original post is clustering, which identifies groups of relatively homogeneous data points. Ideally, we would hope that this algorithm is able to collect all slow executions in one cluster and all fast executions in another.

In the original post, the author has written a small program to calculate the optimal number of clusters. Then he uses that number as a parameter to actually build the clusters.

The task of estimating the number of clusters to look for is so common that BigML provides a ready-made WhizzML script that does exactly that: Compute best K-means Clustering. Alternatively, BigML also provides the G-means algorithm for clustering, which is able to automatically identify the optimal number of clusters. For our analysis, we will use the Compute best K-means Clustering script, following these steps:

  1. Create a dataset in BigML from your CSV file
  2. Execute the Compute best K-means clustering script using that dataset.

We can carry out those steps in a variety of ways, including:

  • Using BigML Dashboard, which makes it really easy to investigate a problem and build a machine learning solution for it in a pointing and click manner.
  • Writing a program that uses BigML REST API and the proper bindings that BigML provides for a number of popular languages, such as Python, Node.js, C#, Java, etc.
  • Using bigmler, a command-line tool, which makes it easier to automate ML tasks.
  • Using WhizzML, a server-side DSL (Domain Specific Language) that makes it possible to extend the BigML platform with your custom ML workflows.

We are going to use the BigML bindings for Python as follows:

import webbrowser
from bigml.api import BigML

api = BigML()

source = api.create_source('./post-data.csv')
dataset = api.create_dataset(source)
api.ok(dataset)
print "dataset" + dataset['resource']
execution = api.create_execution('script/57f50fb57e0a8d5dd200729f',
                                 {"inputs": [
                                     ["dataset", dataset['resource']],
                                     ["k-min", 2],
                                     ["k-max", 10],
                                     ["logf", True],
                                     ["clean", True]
                                 ]})
api.ok(execution)
best_cluster = execution['object']['execution']['result']
webbrowser.open("https://bigml.com/dashboard/" + best_cluster)

The result tells us that we have:

  • Two clusters (green and orange) that contain definitely slow instances.

post-2

post-3

  • The blue cluster includes the majority of instances, both fast and not-so-fast, as its statistical distribution in the cluster detail panel indicates:

post-1

Seemingly, our threshold to distinguish fast tasks from slow tasks points to the green cluster.

At this point, the original post gives up on using clustering as a means to determine a sensible threshold, and reverts to plotting time percentiles against tag count. Luckily for them, the percentile distribution shows a nice bubbling up at the 78th percentile, but in general this kind of analysis may not always yield such obvious distributions. As a matter of fact, detecting such an abnormalities can be even harder with multidimensional data.

BigML makes it very simple to further investigate the properties of the above green cluster. We can simply create a dataset including only the data instances belonging to that cluster and then build a simple model to better understand its characteristics:

centroid_dataset = api.create_dataset(best_cluster, { 'centroid' : '000000' })
api.ok(centroid_dataset)

centroid_model = api.create_model(centroid_dataset)
api.ok(centroid_model)

webbrowser.open("https://bigml.com/dashboard/" + centroid_model['resource'])

This, in turn, produces the following model:

post-4

If you inspect the properties of the tree nodes, you can see that the tree is clearly quickly split into two subtrees with all nodes on the left-hand subtree having processing times lower than 14.88 sec, and all nodes belonging to the subtree on the right with processing times greater than 16.13.

post-5

This suggests that a good choice for the threshold between fast and slow can be approximately 15.5 sec.

If we follow along the same steps as in the original post and apply the percentile analysis to our data instances here, we arrive at the following distribution:

percentiles

This distribution clearly starts growing faster between the 88th to the 89th percentile, confirming our choice of threshold:

table

To summarize, we have found a comparable result by applying a much more generalizable analysis approach.

Feature engineering

With the proper threshold identified, we can mark all data instances with running times lower than 15.5 as fast and the rest as slow. This is another task that BigML can tackle easily via its built-in feature engineering capabilities on BigML Dashboard:

post-10

Alternatively, we do the same in Python:

extended_dataset = api.create_dataset(dataset, {
    "new_fields" : [
        { "field" : '(if (< (f "time") 15.5) "fast" "slow")',
          "name" : "processing_speed" }]})

webbrowser.open("https://bigml.com/dashboard/" + extended_dataset['resource'])

Which produces the following dataset:

post-11

Predicting slow and fast instances

Once we have all our instances labeled a fast and slow, we can finally build a model to predict whether an unseen instance will be fast and slow to process. The following code creates the model from our extended_dataset:

extended_dataset = api.create_dataset(dataset, {
   "excluded_fields" : ["000002"],
    "new_fields" : [
        { "field" : '(if (< (f "time") 15.5) "fast" "slow")',
          "name" : "processing_speed" }]})
api.ok(extended_dataset)
webbrowser.open("https://bigml.com/dashboard/" + extended_dataset['resource'])

Notice that we excluded the original time field from our model, since we are now relying on our new feature that tells apart the fast instances from slow ones. This step yields the following result that shows a nice split between fast and slow instances at around 9,589 tags (let’s call this _MAX_TAGS_COUNT):

post-12

Admittedly, our example here is pretty trivial. As was the case in the original post, our prediction boils down to this conditional:

def html_too_big(s):
    return s.count(' _MAX_TAGS_COUNT

But, what if our dataset were more complex and/or the prediction involved more intricate calculations? This is another situation, where using a Machine Learning platform such as BigML provides an advantage over an ad-hoc solution. With BigML, predicting is just a matter of calling another function provided by our bindings:

from bigml.model import Model

final_model = "model/583dd8897fa04223dc000a0c"
prediction_model = Model(final_model)
prediction1 = prediction_model.predict({
    "html length": 3000,
        "tag count": 1000 })
prediction2 = prediction_model.predict({
    "html length": 30000,
        "tag count": 500 })

What’s more, predictions are fully local, which means no access to BigML servers is required!

Conclusion

Machine Learning can be used to solve everyday programming tasks. There are certainly different ways to do that, including tools like R and various Python libraries. However, those options have a steeper learning curve to master the details of the algorithms inside as well as the glue code to make them work together. One must also take into account the need to maintain and keep alive such glue code that can result in considerable technical debt.

BigML, on the other hand, provides practitioners all the tools of the trade in one place in a tightly integrated fashion. BigML covers a wide range of analytics scenarios including initial data exploration, fully automated custom Machine Learning workflows, and production deployment of those solutions on large-scale datasets. A BigML workflow that solves a predictive problem can be easily embedded into a data pipeline, which unlike R or Python libraries does not require any desktop computational resources and can be reproduced on demand.

Predicting the Publication Year of NIPS Papers using Topic Modeling

The Neural Information Processing Systems (NIPS) conference is one of the most important events in Machine Learning. It receives hundreds of papers from researchers all over the world each year. On the occasion of the NIPS conference held in Barcelona last week, Kaggle published a dataset containing all NIPS papers between 1987 and 2016.

We found it an excellent opportunity to put BigML’s latest addition in practice: Topic Models.

Assuming that paper topics evolve gradually over the years, our main goal in this post will be to predict the decade in which the papers were published by using their topics as inputs. Then, by examining the resulting model, we can get a rough idea of which research topics are popular now, but not in the past, and vice-versa.

We will accomplish this in four steps: first, we will transform the data into a Machine Learning-ready format; second, we will create a Topic Model and inspect the results; then, we will add the resulting topics as input fields; finally, we will build the predictive model using the decade as the objective field.

1. The Data

We start by uploading the CSV file “papers” to BigML. BigML creates a source while automatically recognizing the field types and showing a sample of the instances so we can check that the data has been processed correctly. As you can see below, the source contains the title, the authors, the year, the abstracts and the full text for each paper.

source.png

Notice that BigML supports compressed files such as .zip files so we don’t need to decompress the source file first. Moreover, BigML automatically performs some text analysis that also aids Topic Models (e.g., tokenization, stemming, stop words and case sensitivity cleaning) so you don’t need to worry about any text pre-processing. You can read more about the text options for Topic Models  here.

When the source is ready, we can create a dataset, which is a structured version of your data interpretable by a Machine Learning model. We do this by using the 1-click option shown below.

1-click dataset.png

Since we want to predict the publication decade of the NIPS papers, we need to transform the “year” into a categorical field. This field will include four different categories: 80s, 90s, 2000s and 2010s. We can easily do so by clicking the option “Add fields to dataset”.

add_fields.png

Then we need to select “Lisp s-expression” and use the Flatline editor to calculate the decade using the “year” field. We will not cover all the steps to create a field using the Flatline editor, but you can find a detailed explanation in Section 9.1.7 of the datasets document.

flatline.png

The formula we need to insert contains several “If…” statements to group years into decades:

(if (< (f "year") 1990) "80s" (if (< (f "year") 2000) "90s" (if (< (f "year") 2010) "2000s" "2010s")))

When the new field is created, we can find as the last field in the dataset. By mousing over the histogram we can see the different decades:

dataset2.png

2. Discovering the Topics Underlying the NIPS Papers

Creating a Topic Model in BigML is very easy. You can either use the 1-click option or you can configure the parameters. To discover the topics for the NIPS papers, we are going to configure the following  parameters:

  • Number of top terms: by default, BigML shows top 10 terms per topic. We prefer set a higher limit this time, up to 30 terms, so we can have more terms glean the topic themes from.
  • Bigrams: we include bigrams in the Topic Model vocabulary since we expect the NIPS reports to show a high number of them, e.g., neural networks, reinforcement learning or computer vision.
  • Excluded terms: we exclude terms such as numbers and variables since they are not significant in delimiting the papers’ thematic boundaries over time and can generate some noise.

topic model conf.png

When the Topic Model is created, you can inspect the topic terms using two different visualizations: the topic map and the term chart. See both in the images below.

topics

chart-bar

You can see the resulting Topic Model and play with the associated BigML interactive visualizations here!

The discovered topics provide a nice overview of most of the major subtopics in Machine Learning research, and we’ve renamed them to make them readable at a glance.  In the “north” of the topic model map, we have topics related to Bayesian and Probabilistic modeling, along with Text/Speech processing and computer vision, which represent domains where those techniques are popular. In the “south”, we get the topics that are heavily tilted towards matrix mathematics, including PCA and the specification of multivariate Gaussian probabilities. In the “west” we have supervised learning and optimization, with topics containing theorem proving along with various occurrences of numbers in this quadrant. In the “east” we have two rather isolated topics corresponding to data structures, specifically trees and graphs. Finally, in the center of the map, we have topics that occur across every discipline:  General AI terms (like “robot”), people talking about the real-world domain that they’re working in, and acknowledgements for collaborators and funding.

With the topics discovered, let’s try to predict the topic distribution for a new document.  A good way to visually analyze the Topic Model predictions is to use BigML Topic Distributions. You can use the corresponding option within the 1-click menu:

topic dist.png

A form containing the fields used to create the Topic Model will be displayed so you can insert any text and get the topic distributions.

We input the following data for our first Topic Distribution:

  • Title: “Deep Learning Models of the Retinal Response to Natural Scenes”
  • Abstract: “A central challenge in sensory neuroscience is to understand neural computations and circuit mechanisms that underlie the encoding of ethologically relevant, natural stimuli. (..) Here we demonstrate that deep convolutional neural networks (CNNs) capture retinal responses to natural scenes nearly to within the variability of a cell’s response, and are markedly more accurate than linear-nonlinear (LN) models and Generalized Linear Models (GLMs). (…) the injection of latent noise sources in intermediate layers enables our model to capture the sub-Poisson spiking variability observed in retinal ganglion cells. (..) Overall, this work demonstrates that CNNs not only accurately capture sensory circuit responses to natural scenes, but also can yield information about the circuit’s internal structure and function.”

The resulting topics, in the order of importance, include: Human Visual System (22.15%), Neurobiology (19.89%), Neural Networks (10.77%), Human Cognition (8.72%), Computer Vision (6.96%),  Noise (5.17%) among others with lower probabilities. You can see the resulting probabilities histogram in the image below:

topic dist 1.png

After making several predictions for different papers, we’re pretty confident that the predictions map fairly well with the judgements a human expert might make. Give it a try for yourself with this Topic Model link!

3. Including Topics as Input Fields

At this point, we know that the resulting topics are consistent and the model  satisfactorily calculates the different Topic Distributions for the papers. Now, we can try using the topic distribution to predict when the paper was written.

In order to incorporate the different Topic Distributions for all the papers in the dataset, we need to click in the “Batch Topic Distribution” option and select the dataset that contains the field “decade”(the field we created in the first step above).

topic dist 2.png

When the Batch Topic Distribution is created, we can find the resulting dataset containing all the topic distributions as fields.

topic dist 3.png

4. Predicting a Paper’s Decade

Finally, we are ready to build a model to predict any paper’s decade using the topics as inputs.

We first need to split the dataset into training and testing subsets randomly. In this case, we are going to use 80% of the dataset to build a Logistic Regression. For this, we remove all fields except the topics and the paper abstract and select the decade as the objective field.

BigML visualizations for Logistic Regression allow us to interpret the most influencing topics to predict the decade. By selecting a topic for the x-axis of the Logistic Regression chart, we can see the most sensitive topics that evolve over time vs. the most stable topics. The fluctuating topics will be better predictors of the decades than the more steady topics, which will be mostly irrelevant for our supervised model.

For example, we can see in the image below that as the probability for the topic “Circuits/Hardware” increases, it is more likely to appear in the papers from the 80’s and 90’s than in the papers from the 21st century. Therefore, it can be an important topic in determining which decade a paper was written.

LR circuits.png

The topic “Support Vector Machines” for example, tend to be very frequent in papers from the 2000’s while it is less probable in other decades.

LR SVM.png

Other topics like “Small numbers” (which includes all the numbers found in the papers) or “Probabilistic Distributions” tend to have a stable probability throughout all the decades. You can observe this in the image below, where the graph lines are pretty flat, i.e., the predicted probabilities for the decades do not change per topic probabilities.

The results seem to fit nicely vs. our expectations, but to objectively measure the overall predictive power of this model, we need to evaluate it by using the remaining 20% of the data.

The Logistic Regression evaluation shows around 80% accuracy, which is not bad. However, after trying other classification models, we find out that the best performing model is a Bagging ensemble of 300 trees, which achieves an accuracy of 84%. You can see its confusion matrix below.

confusion matrix.png

In here, we see the most difficult decade to predict for is the 80’s, very likely due to a smaller number of papers (57 in total) in the sample as compared to the other decades.

To improve the model performance further, we can try some more Feature Engineering such as the length of the text, the authors, the number of papers from the authors, various extracted entities like the university/country published in.

We encourage you to delve in to this fun dataset, and let us know of ways to improve it. If you haven’t got a BigML account yet, it’s high time you get one today. Sign up now, it’s free!

Brazilian Summer School in Machine Learning and AI Startup Battle in the Books!

The BigML Team has travelled to São Paulo, Brazil, to conduct another edition of our series of Machine Learning schools as part of our ongoing educational activities to help democratize Machine Learning not only across industries and job functions but also across geographies. Despite the fact that this was not our first such experience, we keep being positively surprised about the way these hands on training sessions are received with such enthusiasm. The Brazilian Summer School in Machine Learning was no exception in this regard.

It’s safe to say that the Brazilian techies have replied to our call in a big way. We received more than 450 applications from 9 different countries to join this event. However, due to space and travel/visa issues we had to accept the maximum of 202 attendees representing 6 different states in Brazil (173 from São Paulo, 11 from Minas Gerais, 10 from Rio de Janeiro, 4 from Santa Catarina, 2 from Paraná, and 2 from Rio Grande do Sul). There was a full house at the VIVO Auditorium in São Paulo. Check out all the pictures in Flickr, Google+, and Facebook

group

The two-day program was packed with topics such as supervised and unsupervised learning techniques, feature engineering, how to get your data ready for Machine Learning, as well as automating Machine Learning workflows.

Artificial Intelligence Startup Battle

Following the footsteps of the inaugural AI Startup Battle that took place in Valencia, Spain, last March 15, 2016, and the second edition in Boston last October 12, 2016, the closure of the BSSML16 was brought with the third edition of the Artificial Intelligence Startup Battle. Brazilian media outlets covered the battle, since it was the first time in history that Brazil held a contest where the jury was a Machine Learning algorithm that predicted the probability of success of an early stage startup. The four competitors (DataholicsPayGo Energy, Prognoos, and Sppin-Kapputo) and Saffe Payments took to the stage, although Saffe Payments did not participate in the competition because they are already part of the Wayra academy.

_s3x3051

Contenders of the AI Startup Battle at the BSSML16. From left to right: Guilherme Paiva, Co-Founder and CEO of Sppin-Kapputo; Mark O’Keefe, Co-Founder of PayGo Energy; Daniel Mendes, Founder and CEO of Dataholics; Raul Magno, Co-Founder and CEO of Prognoos; and Renato Valente, Country Manager of Telefónica Open Future_ in Brazil.

The PreSeries Machine Learning algorithm interviewed the contenders until it had enough information to provide a score between 0 and 100. The winning company, with a score of 92.33, was Prognoos, a startup from São Paulo that has built an artificial intelligence platform applying e-commerce user interaction and browsing data to personalize their buying experience through its proprietary algorithm.  This startup is being invited to Telefónica Open Future_’s accelerator to enjoy access to the Wayra Academy (for up to six months) and to Wayra services and network of contacts e.g., training, coaching, a global network of talent, as well as the opportunity to reach many Telefónica enterprises in Brazil and abroad. After six months, the winning company will be evaluated and may apply to run for a full Wayra acceleration process, including up to USD $50,000 convertible note loan (versus a possible 7 to 10% equity).

_s3x3029

Prognoos, winner of the AI Startup Battle at the BSSML16, represented by Raul Magno, its Co-Founder and CEO (left). Renato Valente, Country Manager of Telefónica Open Future_ in Brazil (right).

The second place was for PayGo Energy, from Nairobi, Kenya, with a score of 71.90, they seek to democratize LPG (Liquefied Petroleum Gas or Propane) for the 2.9 billion people worldwide who lack access to clean cooking fuel. The third position was for Dataholics, from São Paulo, with a score of 39.72, they focus on providing a solution to detect the products and services that fit a given consumer profile based on their social media and demographic information. And finally, the fourth place went to Sppin-Kapputo, from Belo Horizonte in Brazil, with a score of 28.14, this company is an information broker that uses Machine Learning to allow real estate investors and construction companies to make better decisions by relying on analytics tools and prediction models that evaluate the impact of construction on a real estate market.

At the end of the event, BigML’s CEO and President of PreSeries, Francisco J. Martin, highlighted: “We already knew from the growing number of active BigML users in Brazil that the region holds tremendous potential due to an abundance of young and hungry to learn minds as well as world class academics in Machine Learning and AI. This week was further testament that geographic barriers are no longer strong enough to prevent the spread of innovative and ambitious ideas that consider not only their local market but the whole worlds as their target audience for their data driven smart applications.”

The next edition of our Machine Learning schools and AI Startup Battles will take place soon, so stay tuned for new announcements on Twitter (@bigmlcom) and other social media channels: LinkedIn, Facebook, and Google+.

Looking forward to seeing you again in future editions of our Machine Learning training events and AI Startup Battles around the world.

Brazilian AI Startup Battle: Meet the Contenders!

As a thriving innovation hub, São Paulo’s economic impact can be felt all across Brazil, South America and even the world. The city that is colloquially known as Sampa or Terra da Garoa (Land of Drizzle) will also be the host to the third installment of the AI Startup Battle powered by PreSeries and Telefónica Open Future_.

battle_brazil

The AI Startup Battle is a one-of-kind startup contest, where five early-stage ventures compete on stage and are judged and ranked by Artificial Intelligence (AI). The AI is completely autonomous, no human intervention compromises the bias-free algorithm. How does it work? The contenders first pitch their startup and are then asked questions by the AI live on-stage. All the information gathered is then processed by the AI to generate a score from 0 to 100, which represents the startup’s long-term estimated likelihood of success. Paving the way for a more quantifiable early stage investing, previous editions of the battle were held in Valencia, Spain and Boston, USA as part of the PAPIs conferences.

The impending edition of the AI Startup Battle will be part of the Brazilian Summer School in Machine Learning 2016 (BSSML16). The contest is taking place on December 9 at the VIVO auditorium in São Paulo and is part of a series of Machine Learning courses organized by BigML in collaboration with VIVO and Telefónica Open Future_. BSSML16 is a two-day course for industry practitioners, advanced undergraduates, as well as graduate students seeking a fast-paced, practical, and hands-on introduction to Machine Learning. The Summer School will also serve as the ideal introduction to the kind of work that students can expect if they enroll in advanced Machine Learning masters.

Meet the contenders of the 1st Brazilian edition

PayGo Energy

screen-shot-2016-12-06-at-1-03-59-pm

PayGo Energy from Nairobi, Kenya believes in equal opportunities and seeks to democratize LPG (Liquefied Petroleum Gas or Propane) for the 2.9 billion people worldwide who lack access to clean cooking fuel. PayGo Energy allows families to purchase gas in small amounts, making LPG more affordable than ever. They operate on a pay-as-you-go basis. Their micro-payment structure critically aligns with existing consumer spending habits to overcome current cost barriers and enable access to a stable supply of cooking gas.

Dataholics

screen-shot-2016-12-06-at-1-05-50-pm

DATAHOLICS from São Paulo, Brazil provides a solution to detect the products and services that fit a given consumer profile based on his/her social media and demographic information. This rich data significantly improves the targeting of direct marketing campaigns, product recommendations for e-commerce, market research, database enrichment and the generation of highly qualified trade leads.

Prognoos

screen-shot-2016-12-06-at-1-18-32-pm

Prognoos built an artificial intelligence platform with a very low operational cost. Their first product presents a unique browsing experience with matchmaking algorithms. It uses e-commerce user interaction and browsing data to personalize their buying experience through its proprietary algorithm, assuring a good match between the ideal product to the right customer with the aim of decreasing the churn rate.

Kapputo

screen-shot-2016-12-06-at-1-15-21-pm

Kapputo is an information broker that uses Big Data and Machine Learning in order to allow real estate investors and construction companies to make better decisions by relying on analytics tools, prediction models that evaluate the impact of construction on a real estate market.

Saffe

Screen Shot 2016-12-07 at 5.07.56 PM.png

What if you could pay with just a selfie? Saffe Payments is a mobile payment app that leverages world-class facial recognition technology to make your life easier and more secure.

Good luck to all participants, and stay tuned for the results for the Brazil’s first ever AI Startup Battle!

BigML Fall 2016 Release Webinar Video is Here!

Thank you to all webinar attendees for the active feedback and questions about BigML’s Fall 2016 Release that includes Topic Models, our latest resource that helps you find thematically related terms in your text data. Our implementation of the underlying Latent Dirichlet Allocation (LDA) technique, one of the most popular probabilistic methods for topic modeling tasks, is now available from the BigML Dashboard and API. As is the case for any Machine Learning workflow, you can also automate your Topic Model workflows with WhizzML!

If you missed the webinar, it’s not a problem. Now you can watch the complete session on the BigML Youtube channel.

Please visit our dedicated Fall 2016 Release page for more resources, including:

  • The Topic Models documentation to learn how to create, interpret and make predictions from the BigML Dashboard and the BigML API.

  • The series of six blog posts that explain Topic Models step by step, starting with the basics and wrapping up with the mathematical insights of the LDA algorithm.

Many thanks for your time and attention. We are looking forward to bringing you our next release!

%d bloggers like this: