Skip to content

Announcing BigML’s SDK for Swift


The last month of 2015 has seen two significant announcements concerning the Swift programming language. On the one hand, Apple has finally fulfilled their promise to make its source code available under an open source license; on the other, Swift has surpassed Objective-C on the TIOBE index. Thus we are especially proud to announce our SDK for Swift, bigml-swift, which evolves and supersedes the old BigMLKitConnector library.

Apple Swift

Our BigML SDK for Swift will provide all iOS, OS X, watchOS, and tvOS developers with the possibility of integrating the BigML platform into their apps while also benefitting from Swift’s type safety, modern syntax, and performance.

The main features that BigML SDK for Swift provides can be divided into two areas:

  • Remote resource processing: BigML SDK exposes BigML’s REST API through a higher-level Swift API that will make is easier for you to create, retrieve, update, and delete remote resources. Supported resources are:

    • data sources
    • datasets
    • models
    • clusters
    • anomalies
    • ensembles
    • predictions,
    • centroids,
    • anomaly scores.
  • Local resource processing: BigML SDK allows you to mix local and remote distributed processing in a seamless and transparent way. You will be able to download your remote resources (e.g. a cluster) and then apply supported algorithms to them (e.g. calculate its nearest centroid based on your input data). This is one definite advantages that BigML offers in comparison to competing services, which mostly bind you into either using their remote services or do everything locally. BigML’s SDK for Swift combines the benefits of both approaches by making it possible to use the power of a cloud solution and to enjoy the flexibility/transparency of local processing right when you need it. The following is a list of currently supported algorithms that BigML’s SDK for Swift provides:

    • Model predictions
    • Ensemble predictions
    • Clustering
    • Anomaly detections.

A dive into BigML’s Swift API

The BMLConnector class is the workhorse of all remote processing: it allows you to create, delete, and get remote resource of any supported type. When instantiating it, you should provide your BigML’s account credentials and specify whether you want to work in development or production mode:

let connector = BMLConnector(username:"your-username-here",

Once you connector is instantiated, you can use it to create a data source from a local CSV file:

let myDatasource : BMLMinimalResource
let filePath : String = ...
let source = BMLMinimalResource(name:"My Data source",
     name: "testCreateDatasource",
     options: [:],
     from: source) { (resource, error) -> Void in
          if let resource = resource {
              myDatasource = resource!

As you can see, BMLConnector’s createResource provides a strongly typed interface, which allows you to specify the type of resource you want to create, its name, a set of options, and the resource that should be used to create it.

BigML’s SDK for Swift API is entirely asynchronous and relies on completion blocks, where you will get the resource that has been created, if any, or the error that aborted the operation as applicable. The resource you will receive in the completion block is an instance of the BMLMinimalResource type, which conforms to the BMLResource protocol.

The BMLResource protocol encodes the most basic information that all resources share: a name, a type, a UUID, the resource’s current status, and a JSON object that describes all of its properties. You are supposed to create your own custom class that conforms to the BMLResource protocol and that best suits your needs e.g., it might be a Core Data class that allows you to persist your resource to a local cache. Of course you are welcome to reuse our BMLMinimalResource implementation as you wish.

In a pretty similar way you can create a dataset from the data source just created:

     name: "My first dataset",
     options: [:],
     from: myDatasource) { (resource, error) -> Void in
         //-- your processing of the dataset here

If you know the UUID of an existing resource of a given type and want to retrieve it from BigML, you can use BMLConnector’s getResource method:

self.connector!.getResource(BMLResourceType.Model, uuid: modelId) {
     (resource, error) -> Void in

     if let model = resource {

Local algorithms

The most exciting part of BigML’s SDK for Swift is surely its support for a collection of the most widely used ML algorithms such as model prediction, clustering, anomaly detection etc. What is even more exciting is that the family of algorithms that BigML’s SDK for Swift supports is constantly growing!

As an example, say that you have a model in your BigML account and that you want to use it to make a prediction based on some set of data that you have got. This is a two step process:

  • retrieve the model from your account, as shown above, with getResource;
  • use BigML’s SDK for Swift to calculate a prediction locally.

The second step can be executed inside of the completion block that you pass to getResource. This could look like the following:

self.connector!.getResource(BMLResourceType.Model, uuid: modelId) {
(resource, error) -> Void in
if let model = resource {

    let pModel = Model(jsonModel: model.jsonDefinition)
    let prediction = pModel.predict([
            "sepal width": 3.15,
            "petal length": 4.07,
            "petal width": 1.51],
        options: ["byName" : true])


The prediction object returned is a dictionary containing the value of the prediction and its confidence. In similar ways, you can calculate the nearest centroid, or do anomaly scoring.

Practical Info

The BigML SDK for Swift is compatible with Swift 2.1.1. You can fork BigML’s SDK for Swift from BigML’s GitHub account and send us your PRs. As always, let us know what you think about it and how we can improve it to better suit your requirements!

PAPIs Connect 2016 – Call for Proposals on track!

PAPIs Connect, Europe’s first artificial intelligence event for decision makers is coming to Valencia, Spain, on March 14-15, 2016.

PAPIs (the International Conference on Predictive APIs and Apps) started in November 2014. It also organizes PAPIs Connect event series that focuses on how Data, Machine Learning and Artificial Intelligence are used by companies to create predictive apps that solve real problems in business and everyday life.

For the first time in history, this edition of PAPIs Connect will perform a startup battle in which the selection of contestants and winners will be chosen completely without human intervention thanks to our special and unique jury — an algorithm! This competition will be powered by the new technology Telefonica Open Future and BigML are developing together in order to rapidly discover the best startups actively working on state-of-the-art predictive apps.  The aim is to use a multitude of signals to pinpoint and support these innovators at a very early stage given a high confidence in their likelihood of success.

In addition to the startup battle decision makers, CEO’s, CTO’s, developers, data scientists and technology leaders from around the world will come to Valencia to discuss real life cases, where they have used Artificial Intelligence and Machine Learning to create value from data. Furthermore, the timing of the event gives participants more inspiration for creativity since it conveniently coincides with the world-famous Valencian Fallas!

PAPIS connect cover.001

As we are wrapping up 2015, now is the time to Call for Proposals. PAPIs Connect is looking for creative speakers like you! Please submit your proposal before January 8, 2016, 11:59pm CET. Here are some topics we would love to hear about next March in Valencia:

  • Deep Learning: demystifying enterprise use cases, business impact, how-tos for business leaders and developers
  • AI / predictive technology for good causes
  • AI for improved decision making
  • Societal impact of AI
  • Enterprise predictive apps as competitive weapons
  • Introduction to APIs for decision makers
  • Automatic Data Science / Machine Learning
  • Benchmarks and comparisons between available technical tools and solutions (commercial as well as open source)
  • Open Source: projects that make it easier to process data, to create and experiment with predictive models, or to operationalize those with predictive APIs.

Do you think you might be a good candidate? Then, this is your opportunity to join in the proceedings as it is a great chance to share your story with an audience of likeminded innovators and strategic management practitioners in turn receiving support and valuable community feedback.

About PAPIs

The PAPIs conference series has started in November 2014 in Barcelona, continued in Paris last May 2015, and Sydney in August 2015. So far more than 500 attendees from 25+ different countries have enjoyed discussions and presentations about predictive apps and their impact in business and real life. In 2016 PAPIs will take place in Boston, US, the dates and location of which will be announced shortly. Stay tuned to book your flight to Boston!

Exploring 250,000+ Movies with Association Discovery

Hot on the heels of our Fall 2015 Release webinar including our Association Discovery (aka Association Rule Learning) implementation, we wanted to give this new capability a spin on our blog in order to get our readers warmed up. It is worth noting that there are many potential use cases of this technique including promotional pricing or bundling of items that are closely related, market basket analysis, Web usage mining, intrusion detection, and bioinformatics to analyze public genomic and proteomic databases among others.

Generally speaking, anytime you are challenged with uncovering statistically significant relations between variables with potentially thousands or even millions of different values, you will find Association Discovery handy as it does a great job in weeding out spurious associations to let you concentrate on the interesting ones. In a prior post, we covered some of the ways our proprietary Association Discovery algorithm differs from typical statistical approaches to categorical association analysis (e.g. log-linear analysis) so we won’t go into the same details here.

Actor and Director Associations

To quickly demonstrate how it works, we used the Home Theatre Info dataset containing movie metadata information on over 250,000 DVDs offered in North America. As usual, we completed some basic table joins and data wrangling in order to be able to feed this raw DVD data to BigML in a Machine Learning ready format. We’ll save you the gory details here and instead concentrate on the results. However, we would encourage you to download our User Guide and follow the end-to-end process later on.

The first exercise we ran uses a simpler subset of our dataset consisting of only the Director-Actor pairs. In our example, the Association Discovery task we ran on the Director-Actor pairs quickly found the top 100 association rules involving 69 different variable – value pairs e.g. Actor ID = 72342 etc.

Looking at the visual above (which we manually augmented with some movie images), we can see the top associations between Actors and/or Directors. Even though the dataset had separate fields for Actor and Director IDs the Association Discovery algorithm was able to identify different combinations of rules involving Actor vs. Actor, Actor vs. Director, and Director vs. Director. The Actors are automatically marked in orange and directors in blue. There are some interesting World Cinema associations that were discovered without any supervision:

  • The most prolific network of collaboration is in the bottom left, which points out to Japanese Director Daisuke Nishio’s Dragonball Z creative team including many voice over artists bringing his anime characters to life.
  • A lesser-known relationship between the middleweight boxer turned low budget film actor Dick Miller and the prolific engineer turned independent film producer/director Roger Corman known for his horror flicks. IMDB cites “Miller settled in Los Angeles in the mid-1950s, where he was noticed by producer/director Roger Corman, who cast him in most of his low-budget films, usually playing unlikeable sorts, such as a vacuum-cleaner salesman in Not of This Earth (1957).”
  • Also of note is the unique self-referential relationship exists between actor/director Clint Eastwood and himself. Talk about a Do-It-Yourself guy!

Just like that the history of chaotic moviemaking dating back to the silent era comes into focus pointing out to some of the strongest collaborations from its past that have soundly “beaten the odds” of randomness. Those of you that are more experienced Machine Learning practitioners may be thinking “So what, we had a bunch of paired data that…I could just run a ‘group by’ SQL query against that same table to arrive at similar results.” That is indeed true, but even with this simple case you get the added benefit of association strength measures and a network visualization view that may not be apparent at a glance by looking at a long list database query results.

World Cinema AssociationsOnce we turned our attention to our main dataset that combines data fields that our warm up exercise left out, the highest leverage association rules that floated up to the top consistently involved Genre, Studio and Rating fields. These rules show significant “Lift” which justifies that we take them seriously. For example, the first rule stating that whenever the Genre is “Anime” the Rating is “MA13” has a lift score of approximately 26. This means the association between this variable-value pair is 26 times more likely in our dataset than what would be expected from simple coincidence. It is also worth noticing that the second rule in the result set is the reverse of the first rule with the same Leverage and Lift values as expected. Yet the Coverage, Support and Confidence scores are different because those are calculated with respect to the “Antecedent”, which is different in each case.  For a quick explanation of the Association Discovery terminology, you can refer to the related documentation.

Looking at the results of our second exercise we found strong associations that we were able to confirm from our prior knowledge serving as proof points:

  • Director ID 732 associated with the movie studio Dreamworks is none other than Mr. Steven Spielberg.
  • Director ID 3974 associated with Nickelodeon is Chris Gifford – the creator of the hugely popular children’s cartoon series ‘Dora the Explorer’.
  • PBS (Public Broadcasting Service) is strongly associated with the directorial godfather to many in the documentary genre, Ken Burns as well as Stephen Ives (known for The American Experience series among others).
  • It was also very plausible that the Pokemon creator Masamitsu Hidaka and Dave Filoni of Star Wars anime fame are tied with the ‘General Audiences’ movie rating.

This analysis also raises the challenge of identifying what makes an “interesting” association. There are many academic papers written on this seemingly simple question so it is more complex that it appears. For the sake of our exercise, we would leave this as a judgment the subject-matter expert has to make. If it is useful in his/her setting than it is good to go. If not, just skip to the next association.

The key idea here is that Association Discovery is your best friend when you have the challenge of finding non-spurious relationships in heaps of data that include many variables each involving many values of their own. If you think about it the alternative as being exhaustively searching BILLIONS of variable – value pairs and their relationships with another, then you come to appreciate how much productivity can be saved with this methodology.

As all our users have simultaneously gotten access to Association Discovery with this week’s launch, we hope that you give it a spin and let us know what you think by sending us a note at with your ideas and feedback.

BigML’s Fall 2015 Release and Webinar: Association Discovery, Logistic Regression, Correlations and More!

We are getting close to the end of fall, but before the official winter welcome, we are very excited to share with you a bevy of key improvements to the BigML platform. Hungry for discovering how BigML is evolving? To increase your Machine Learning appetite, here is a short description of what we will explain in full detail during our upcoming Webinar, on Tuesday, December 15, 2015 at 10:00AM US Pacific Time (Portland, Oregon / GMT -08:00). Signup and reserve your spot today.

Association Discovery & Logistic Regression:

AD final

BigML is the first Machine Learning service offering Association Discovery on the cloud. Our newest addition to the toolbox, Association Discovery, can be used for many different tasks such as market basket analysis, web usage patterns, intrusion detection, fraud detection, or bioinformatics. Since its acquisition, BigML’s team has been working hard to integrate the very unique algorithms of Magnum Opus, which has been developed by Professor Geoff Webb of Monash University. With Association Discovery you can pinpoint hidden relations between values of your variables in high-dimensional datasets with just one click. In this webinar, you’ll learn how to visualize key relationships, export rules, and use BigML’s API to program your own discovery workflows.

LR final

BigML’s latest version also offers best-in-class Logistic Regression via our REST API. This absolute work horse can help you solve many classification problems. Logistic Regression is a new service also included in our Python and Node.js bindings so you can easily create models in the cloud, then download these models to your application for fast and local predictions. Logistic Regression serves as an excellent benchmark in many use case contexts.

Partial Dependence Plots, Statistical Tests and More:

PDP final

Now you can better analyze and visualize the impact that a set of selected fields have on the ensemble predictions improving their interpretability. Partial Dependence Plots (PDP) can be used for both classification and regression ensembles. BigML provides a two-way PDP where you can select the fields you want for both axes. You can access this feature from an ensemble as well as from our Labs section.

statistical tests final

BigML’s Fall 2015 Release also includes new exploratory data analysis tools to help you understand the statistical nature of the numeric fields of your dataset. Through our REST API and also via our Python, Node.js and C# bindings, you’ll be able to perform statistical tests for normality, fraud or outlier detection. These tests allow you to check whether the distribution of the values of numeric fields follow certain statistical properties.


Selecting the right features for your Machine Learning model can be a hard task, our REST API now helps you by providing advanced statistics to find correlations between your dataset fields. This allows you to select better predictors for your models. Correlations feature is also available in our Python, Node.js and C# bindings.

flatline editor

It’s time to easily create and validate your flatline expressions in a friendly inline editor. Flatline is a LISP-like language that can help you engineer new features and filter your datasets in infinite ways, so you can get higher quality predictors. Flatline is open sourced by BigML and can be found on Github.

Plus More Goodies…

Interested in what you’ve read so far? That’s not all! We will also showcase other updates on our call. Be sure to reserve your webinar spot before long as space is limited!

Machine Learning with Python at PyConES 2015

Since we established our new European Headquarters in Valencia last July, we have been carrying the Machine Learning banner at all major tech events in the city.  This time we were proud sponsors of the third edition of PyConES, the marquee Python event held in Spain. During the weekend event, almost 400 developers attended the conference.  Many stopped by the BigML stand to pick up a BigML t-shirt and most stuck around to find out how they can up their Machine Learning game while working in their favorite software development ecosystem.


Mercè Martín Prats, our VP of Insights and Applications, gave a very popular Machine Learning crash course, which is further evidence of the subject’s meteoric rise in developer communities.  This is music to our ears as we have been on this bandwagon since our inception four years ago.  We truly believe that the days when Machine Learning will be an integral skill set for any developer worth their salt are not too far off.  After all, Machine Learning is another way to elegantly solve real-life problems by way of leveraging ever cheaper computing power. So, why exclude capable techies willing to learn and instead opt to limit its use to a select few academically minded researchers? It just does not make any sense.

During her presentation, Mercè dove right into the Python code, demonstrating the steps to follow in using BigML’s REST API from within your Python environment that let’s you remotely manage BigML’s Machine Learning resources programmatically. Also among the topics she covered was BigMLer, our command line tool built on our Python bindings. With BigMLer you can automate even complex Machine Learning workflows with few lines of code. It is poetry in minimalism!

We would like to thank the community for the great questions covering critical aspects such as service architecture, scalability and the model optimization controls in addition to all their positive feedback . We are looking forward to add to the Python cause in the future iterations of the event.  If you would like to contribute to BigML’s Python projects or if you missed the chance to grab a t-shirt at the event, just give us a holler on Twitter and we will projectile one your way in quick succession.

From Big Blue’s Predictive Analytics to Machine Learning with BigML

Within 24 hours of turning in my IBM badge, laptop and signed exit papers, I found myself on a plane to Buenos Aires followed by Melbourne, and Sydney for conferences and client meetings representing a company I was not even officially working for yet. I am now one of the most recent additions to the BigML team and my motivation to give up my IBM career and to make a fresh start with a startup did not come up as a sudden urge.  Rather, it was a gradual process of observing the sea change in the marketplace.

BigML-Train to Future of ML

Having interfaced with many analytics organizations as part of my tenure at IBM, it is my conviction that we have entered a new era, where the democratization of machine learning is allowing organizations large and small to add repeatable statistical rigor to all kinds of processes that up to now have been predominantly influenced by human bias e.g. candidate identification and the interview process (HR), predicting vacation rental prices, athlete profiling by scouts, sizing and pricing complex services projects, and optimizing crop yields and farming operations. No doubt all of those business profiles, including the guy predicting vacation rental prices will one day utilize machine learning – without having to reinvent themselves as hackers that is.

Speed, Deployability, and Costs

Like most digital technologies, Machine Learning is in the process of becoming automated and commoditized with BigML, Amazon, Microsoft, and Google leading the charge.

The business drivers?  There are many:

  • Transforming manual set of processes into a single fluid one by leveraging easy to use services
  • Lowering the complexity and cost of building and deploying predictive models
  • Increasing business performance by applying machine learning in daily operations to speed up the time-to-market of more and more data-driven decisions.

With tools like BigML, in a fraction of the time that it takes to install and configure any statistical software package like R, SAS or SPSS, you can create an online account, load your source data, train, test, and boom! You have built a predictive model that helps you score all your present and future data.  Best of it, the same process template can be applied to any of your functional areas (marketing, sales, risk, compliance, maintenance etc.) accelerating the scale of data driven actions in the whole company.

Oh, and did I mention exportable models that can be shared with anybody in your organization, be enabled to run remotely on IOT devices or other high value assets (i.e. cell towers, manufacturing equipment, infrastructure pipes, etc.) and supporting most popular runtime environments such as Python, C++, Java, Node.js and more.  That’s right, you can create models, then export them to anybody or practically anything and have them run locally at a single machine or send them to a million machines at no additional cost.  Yes, democratization of advanced analytics is literally here.

All this means the traditional enterprise software vendor approach (SAS, SAP, IBM etc.) of selling large bundles of software that include components/modules that sit unused is slowly seeing the end of days.  Users want innovation, not re-stitching together of 20+ year old platforms buttressed with spend heavy traditional marketing to keep up with the budding tools that have adopted cloud-based distributed machine learning driven approaches from birth. On top of that, over the last decade companies have become much more comfortable working with innovative startup providers that offer a SaaS model built on IaaS and PaaS substrates, thus not shying away from passing the savings from low overhead costs to their customers. Since the playbook of best practices needed to operate a cloud-born company are common knowledge these days, we will likely witness the product selection bias tipping increasingly away from the incumbents.

Machine Learning made “beautiful”?

When I first heard BigML’s motto, my thought was “Who would associate Machine Learning with beauty?” However, after seeing the audience reactions ranging from “Wow!”, “Very cool”, and from time to time “This is like IBM Watson and Tableau put together” I have come to appreciate the effort BigML team has put in enabling a highly streamlined and understandable machine learning workflow that is capable of demystifying the mathematically complex Machine Learning concepts for even the uninitiated. Beautifully simple indeed. It has finally occurred to me that living in an era ushered in by Mr. Steve Jobs has gotten us all spoiled with much higher expectations from any and all products that we come to touch. Naturally, there is no reason why the same should not apply to Machine Learning software.

While such meaningful progress is being made towards making machine learning more usable and understandable for a broader set of technical and business users, today’s typical practitioners are mainly Developers, Data Scientists, and “Data Wranglers”—the unsung heroes of any analytics project. Which brings us to an inconvenient reality: There are just not enough of these well-balanced teams of “Practitioners” in the market to meet the exploding demand. So what are companies to do at a time Machine Learning is supposed to take center stage in their irreversible digital evolution?

If you are tasked with building a Data Science team, I recommend getting started with assigning tasks to Business Analysts and Data Wranglers. They should clearly prioritize and formulate the business problem, extract, transform and get ready the relevant data sources for predictive modeling. Most practitioners agree that up to 80% of the time and effort is spent with those stages of the process. In parallel, you can aim to find a person with “practical” experience in the area of machine learning. WARNING – if during your hiring process you come across a Data Scientist whose career highlight is an exotic algorithm he has been working on for the last 2 years, say THANK YOU and RUN the other way.  That’s more of a Machine Learning Researcher profile than a practitioner. Frankly many companies don’t need that level of specialization in a field that has been around for over half a century with many proven techniques and approaches already productized and available as RESTful API end points – just add data.

As far as BigML is concerned, we are Data Scientist agnostic. We recognize that a “practical” and well aligned Data Scientist can empower a broader team with relevant Machine Learning knowledge allowing them to more efficiently explore problems that matter to the business. Equally important is a scalable, programmable, and easy to use MLaaS (Machine Learning as a Service) tool that let’s capable Developers and in-house SMEs build, test and deploy predictive use cases that can learn and get better and better over time.  The results speak for themselves as far as the business impact is concerned regardless if one can derive the mathematical formulae that gave rise to the Random Decision Forest algorithm.  Bootstrapping is the call of the day, which is only fitting in a assume nothing,  test everything, let the data be your guide Lean Startup world.

Admittedly, some of these steps are easier written here than actually delivered, so as expected, companies will need to do their homework and identify the differences between these MLaaS providers with care.  Well, I already did my due diligence, and now I am sprinting with BigML.

BigML and The Polytechnic University of Valencia join forces to promote Machine Learning

We are happy to announce that BigML and The Polytechnic University of Valencia (UPV) have signed an agreement creating a new University-Business chair in order to support new Machine Learning research and to promote the use of Machine Learning technologies at UPV. The official signature ceremony of the agreement took place on the 20th of October at The Polytechnic University of Valencia Rector’s office with representatives from both parties:

As part of the agreement BigML will collaborate with UPV on organizing lectures, teaching activities, idea competitions, conferences, courses, seminars, pre-doctoral and post-doctoral grants, internships and more. These activities are expected to aid the students in maximizing the knowledge transfer from BigML while creating exciting career opportunities for the graduates where they can apply their newly acquired technical skills.

On the research front, the business chair plans to multiply Machine Learning research projects coming out of Valencia, Spain (and Europe at large through UPV connections). Promoting hi-tech entrepreneurship in advanced analytics and software disciplines in the region has been one of the primary motivations behind our recent decision to locate our European headquarters in Valencia so we are very glad that UPV has agreed to join forces in this worthy endeavor.

The Future Impact of Machine Learning

We are happy to announce a special full day event bringing together some of the best minds in machine learning and business management to discuss The Future Impact of Machine Learning in depth.  The event will take place at Las Naves in Valencia on October 20, 2015. Registration is free and by invitation only. To apply and reserve your place for this unique event, be sure to fill out the event form ASAP as available spots will not last long.

Future Impact of Machine Learning


Where will the Machine Learning revolution lead The World? Will Machine Learning create more jobs than it destroys? These remain some of the most controversial topics for technology pundits and world leaders alike to discuss either on Bloomberg TV or at venues like World Economic Forum’s annual meeting at Davos. They are fair questions in that we are observing a greater momentum in Artificial Intelligence based technologies and products launched into the new connected world economy – think Siri, Nest, Cortana, Google Now, the self-driving car and the military grade drones. In fact, Google’s complete reorganization into an Alphabet soup of project oriented entities may have been a bit confusing to some, but it was undertaken to prepare for this new AI future to a large extent. There is now serious plans even about creating ships that can cross the sea without humans on board. So long for the Captain Phillips sequel!

This is a very complex subject and we are collectively barely scratching the surface. As a result, we need many more informed debates about the economic impact of new technologies such as Machine Learning. We are hopeful that the Las Naves gathering will help move this crucial debate further with participation from:

Ramon López de Mántaras will be covering the historical perspective of AI. Dr. Dietterich will inform the audiences on recent advances and breakthroughs with real world stories as well as the long-term risks and opportunities he sees. Finally, Enrique Dans will be highlighting the expected business impact of AI to make it a multidimensional discussion on the topic. Once again, the registration for this event is free and by invitation only. To apply and reserve your spot, be sure to fill out this event form.  If you would like to promote the event among your circle, we also encourage you to download the poster in English, Spanish or Valencian.

Looking for Connections in Your Data – Correlation Coefficients

This is part of an ongoing statistics-related blog post series in anticipation of BigML’s upcoming statistical tests resource. The previous post was about fraud detection with Benford’s Law. In this post, we will explore the topic of correlation, and how it can help you in designing and applying machine learning models.

Bill Nye - Unrelated but Related

Consider the following…

Put in plain terms, correlation is a measure of how strongly one variable depends on another. Consider a hypothetical dataset containing information about professionals in the software industry. We might expect a strong relationship between age and salary, since senior project managers will tend to be paid better than young pup engineers. On the other hand, there is probably a very weak, if any, relationship between shoe size and salary. Correlations can be positive or negative. Our age and salary example is a case of positive correlation. Individuals with a higher age would also tend to have a higher salary. An example of negative correlation might be age compared to outstanding student loan debt: typically older people will have more of their student loans paid off.

Correlation can be an important tool for feature engineering in building machine learning models. Predictors which are uncorrelated with the objective variable are probably good candidates to trim from the model (shoe size is not a useful predictor for salary). In addition, if two predictors are strongly correlated to each other, then we only need to use one of them (in predicting salary, there is no need to use both age in years, and age in months). Taking these steps means that the resulting model will be simpler, and simpler models are easier to interpret.

There are many measures for correlation, but by far the most widely used one is Pearson’s Product-Moment coefficient, or Pearson’s r. Given a collection of paired (x,y) values, Pearson’s coefficient produces a value between -1 and +1 to quantify the strength of dependence between the variables x and y. A value of +1 means that all the (x,y) points lie exactly on a line with positive slope, and inversely, a value of -1 means that all of the points lie exactly on a line with negative slope. A Pearson’s coefficient of 0 means that there is no relationship between the two variables. To see this visually, we can look at plots of our hypothetical data, and the Pearson’s coefficient computed from them.

An example of positive correlation

Age vs. Salary

An example of negative correlation

Age vs. Student Loan Debt

An example of uncorrelated data.

Shoe Size vs. Salary

We observe that the sign of the correlation coefficient matches the slope of the line of best fit. Note, however that the magnitude of the coefficient is not related to the slope of the line, but only on how well the points fit the line. Also notice from our Age vs. Debt plot that the relationship between the two variables is more exponential rather than linear. Despite seeing a clear relationship between the two variables, a good straight line fit can not be made and so the resulting correlation coefficient is smaller than one might expect. To elaborate on how non-linear data can confound Pearson’s r, we turn our attention to Anscombe’s Quartet. This is a set of four cleverly constructed datasets which have the same Pearson’s correlation coefficient, as well as other summary statistics, yet are significantly dissimilar when seen on a graph.

Anscombe's Quartet with Pearson's r and Spearman's rho coefficients

Anscombe’s Quartet with Pearson’s r and Spearman’s rho coefficients

In cases such as this, Spearman’s rank-correlation coefficient, or Spearman’s rho, may be a good alternative measure. Spearman’s rho quantifies how monotonic the relationship between the two variables is, i.e. “Does an increase in x usually result in an increase in y?” (technically it is equivalent to computing Pearson’s r for a rank-transformed version of the data). We can see that while the four Anscombe datasets have equivalent Pearson’s r, Spearman’s rho does a good job discriminating them.

Correlation coefficients are a useful tool for exploring relationships within your data. Having been introduced to the topic of correlations, we invite you explore it further with BigML’s new correlations resource, and start exploring your data!

Why I Left My Bank Executive Job to Join BigML

“If you want to hear my honest opinion, what you’re about to do looks like a desperate move to me.” This was the reaction of a good friend and a mentor of mine, when he heard that I am about to accept BigML’s offer to join them as a V.P. of Business Development. Mind you, he has spent most of his career working for large international companies at the C-level. To be fair to him, his first conclusion was based on a five minute phone introduction on BigML and my future role, while I was on the TGV train to Paris on my way to and he was trying to catch a plane in London himself. He continued: “Mobile payments are finally about to hit the mass market and there’s a huge buzz about all the FinTech topics you’ve been working on for years and instead of cashing out you want to join a small startup in what was the name again? Predictive analytics?”

I should have gone differently about explaining my motivation in exchanging a comfortable life and a well-paid FinTech job. While I am proud to have headed a great team at one of the most innovative Swiss banks that afforded me the opportunity to frequently exchange experiences with other FinTech professionals as a speaker and as a board director of Mobey Forum, I’m leaving it behind for three things:

  • The opportunity to actively participate in this unique moment in time, when predictive analytics technologies are starting to changing the world as we know it
  • A company that is perfectly positioned to ride this big wave by offering Machine Learning as a Service (MLaaS) to enable not just one but a multitude new business use cases
  • A team that is very passionate about making machine learning simple, beautiful and easily applicable into predictive apps and services for all interested parties out there including developers, students, analysts and business experts.

Let me further explain.

Predictive Analytics in Business Context

Coincidentally, my predictive analytics story began around when BigML was founded in 2011. At that time, I had initiated the card linked offers program for my bank in Switzerland. It was in essence pretty similar to what some startups in the U.S. had been offering, but with a twist. Instead of using only simple descriptive data analytics focusing on a selection of customers that have shown a certain behavior in the last X months, we focused our energy to develop a service based on predictive models that could tell us what the customer is likely to do next. This predictive capability was based on a huge amount of data our customers had given us permission to mine. This quickly became our unique selling proposition in the market and turned me and my colleagues into believers.

So what is so exciting about making predictions about unknown events? It boils down to the ability to learn from large datasets rather than following preset business rules, which is a huge improvement in the business world. The majority of the existing business rules within our companies are a part of the heritage of the pre-digital world that have been transferred to the digital age without questioning their present purpose. Meanwhile, we have access to technologies in the area of modeling, machine learning and data mining that enable us to build automated, self-improving processes to better adapt our companies to the present environment. In this context, being able to predict unknown outcomes will make the difference between acting and reacting, between smart and (to state it politely) not-so-smart and between winners and losers of tomorrow’s competitive landscape. Those who don’t count cards will ultimately leave the poker table penniless.

And how about those who specialize on counting cards? Kevin Kelly from Wired magazine wrote: “In fact, the business plans of the next 10,000 startups are easy to forecast: Take X and add AI.” If we take the FinTech space as an example of how many opportunities lie in front of new players who will benefit from the incapability of banks to unlock the full potential of digitization just imagine what kind of potential lies across different industries when it comes to artificial intelligence and predictive apps. Some claim smart apps will have an even bigger impact on our businesses and private lives then the introduction of mobile computing had. As in the case of my good friend from the beginning of this post, being a senior manager and running a multi-million euro business operation doesn’t automatically put you into a position to understand the potential of this technology and the forces at work, which are bound to redefine how we run our day-to-day businesses.

Why BigML?

Granted, the idea of machine learning in the cloud is no longer brand new. Last week Computer World UK published the list of “7 cloud tools to harness artificial intelligence for your business”. And the participant list is impressive: Google, Microsoft, Amazon, Alibaba (the announcement was made last week – “huānyíng” guys), IBM and two other startups. One of them is BigML. But back in 2011, only two of those seven companies were at the start line: Google and BigML.

BigML’s early start to introduce large scale machine learning to the masses has turned into a healthy mix of subscription customers as well as corporate engagements including hybrid installations and professional services in a variety of industry segments and predictive use cases. Increasing demand in BigML’s expertise in this field has also resulted in high profile strategic joint ventures such as the partnership with Telefónica Open Future_ to build an automated platform for assess early stage technology startup investment opportunities. There has never been a better time to be in machine learning and it will likely remain a central area of business innovation for the next 5 years as per Gartner’s newly published hype cycle curve. We have got work to do!

The Team

What struck me right away when I met the team in Barcelona during last year was the fact that many of the people working for the company have followed our energetic CEO for the second or a third time into a new venture. This is indispensable proof of trust and leadership qualities. Like in the lyrics of that Bob Marley’s song: “You can fool some people sometimes, but you can’t fool all the people all the time.” Take my first conversation with JAO, BigML’s CTO. He wasted no time coming straight to the point: “Don’t consider starting here by lining up bunch of integration projects as a success. I want our developers to continue further developing and improving our predictive platform rather than running a bunch of short-term snowflake projects leaving it to stagnate.” Bang! This was music to my ears. A company ran by the product guys that aim to first and foremost make a great product. So no more writing about it for me, let’s start!

%d bloggers like this: