Skip to content

The Importance of Feature Engineering



When people first see a demo of BigML, there is often a sense that it is magical. The surprise likely stems from the fact that this isn’t how people are accustomed to computers working; rather than acting like a calculator creating a fixed outcome, the process is able to generalize from patterns in the data and make seemingly sentient predictions.

However it is important to understand that predictive analytics is not magic, and although the algorithm is learning on a very basic level, it can only extract meaning from the data you give it. It does not have the wealth of intuition that a human has, whether that’s good or bad, and subsequently the success of the algorithm can often hinge on how you engineer the input features.

Let’s consider a very simple learning task — please keep in mind that this is a contrived example to make explaining the problem of feature engineering clear, and does not necessarily represent an actual useful end result itself.

Assume you are working on a navigational system, and at some point in the system you would like a way to predict the principal direction of a highway knowing only its assigned number. For example, if a user wants to go north, and there are two nearby highways, Interstate 5 and Interstate 84, which should they take?

Now, you could use a list of known highways, but this would require you to regularly update the list as new highways are built, or removed. Instead, if there was a pattern relating principal direction to highway number, this might be a useful thing for you device to know.

So, let’s take a list of primary interstates in the US and let BigML train a model to predict the principal direction: East-West or North-South. The resulting tree looks like this:

Click on the image to interact with the model

Click on the image to interact with the model

In the highlighted node, you can see that the learning algorithm has discovered that if the highway number is greater than 96, then the highway is principally North-South. And indeed, if we look at the dataset there are only two highways that match this pattern, 97 and 99 and they are both North-South and so this pattern is relevant.

However, as you navigate around the tree it becomes obvious that each split is simply creating bounds that eventually isolate a single highway or a small group of highways, at which point the prediction is less of a generalization and more of a truism:

Click on the image to interact with the model

Click on the image to interact with the model

In other words, the model doesn’t seem to be generalizing in a meaningful way from the highway number to the principal direction.

Now if you are familiar with the US highway numbering system then you might know that there is significance in whether the highway number is even or odd. Lets re-engineer our dataset to include this property and see if the model changes. We can do this by selecting the “Add Fields to Dataset” option:

Add Fields to Dataset

We’ll call the new field “isEven” and define it with the JSON s-expression:

[ "=", 0, [ "mod", [ "field", "Highway Number" ], 2 ]]

Reading from the inside brackets, we take the field named “Highway Number” and compute this value mod 2. If it equals 0, then this expression will return True meaning the highway number is an even integer, and False if odd:

JSON s-expression to select even/odd

Now we re-build the model including this new feature:

Click on the image to interact with the model in BigML

Click on the image to interact with the model in BigML

And now we get a very simple tree which generalizes to the following rules:

IF isEven = false THEN
        direction = North-South
IF isEven != false THEN
        direction = East-West

This is a much more useful generalization! But what is happening here? Why didn’t the machine learning algorithm find this pattern in the first tree?

Remember the first dataset: all we gave the algorithm to learn from was an integer. And the only thing the algorithm knows about integers is that they have a natural order. That’s it. And so, it tried to find a pattern relating the natural order of the integers to the principal highway direction.

As humans, we potentially know a *lot* more about integers: some are squares, some are prime, some are perfect, and some are even. In the second dataset, we added some of this additional information about integers, specifically the even-ness, to the algorithm. By engineering this feature, we gave the algorithm the extra information it needed to find the pattern. In other words:

1) The “Feature Engineering” was adding the even/odd property.
2) The “Machine Learning” was the discovery that the even/odd property determines the principal direction.

The insight here is that learning algorithm can only discover the patterns that we provide in the data, either intentionally or accidentally.

In this rather contrived example, it might seem circular. That is, we start with an insight that even/odd has meaning, add that property, and then discover that even/odd has meaning. However, it is important to remember that this is a very simple example. When working with real data you may have hundreds or thousands of features and the patterns will be much more nuanced.

In that real world case, the importance of feature engineering is to use domain specific knowledge and human insight to ensure that the data contains relevant indicators for the prediction task. And, in that case, the beauty of machine learning is that it discovers the relevant patterns and filters out the incorrect human insights.

If you would like to run this example in your own account, here is a little python script which reproduces all the steps in development mode (FREE):

My favorite bugs


Arguments missing when calling a function,
mispelling keys setting some advanced option,
accessing attributes your object lacks:
These are a few of my favorite bugs.

Non-closing commas on deep JSON structures,
long late-night hacking that ends up in rapture:
Found a solution, that actually sucks.
These are a few of my favorite bugs.

Wrong file permissions on root system folders,
new model’s methods that break on the olders,
out of range values on sliders and bars:
These are a few of my favorite bugs.

When the job fails,
and the app blocks,
or the cloud turns black,
I simply remember my favorite bugs
and then I don’t feel so bad.


When the job fails,
and the app blocks,
or the cloud turns black,
I simply remember it’s wintermute’s fault
and then I don’t care at all!

my favorite bugs

NB: wintermute is our backend’s codename

How to build a Predictive Lead Scoring App

Predictive Lead Scoring is a crucial task to maximize the efforts of any sales organization.

There are a few applications on market today like Fliptop, KXEN, or Infer that already allow you to score your sales leads. These offerings have validated both the importance and market appetite for predictive scoring solutions.

However, you may want to have greater flexibility in choosing your CRM system, or perhaps you want to build your own predictive model to do the scoring, or you might want to integrate the scoring process within related services in your organization. In this post, I’m going to show you how to build an application to score your leads using Talend Open Studio and BigML—three great tools working together to build a flexible predictive solution for a common business problem!

To complement this post I’ve created a complete step-by-step tutorial that will guide you on the implementation for this use case. No matter if you are a developer, a business analyst or a data scientist, this tutorial is made for you :-). At the end of each section of this post, you will find references to the related parts in the tutorial.

The Example

To illustrate the post, I am going to use a fictitious company named AllYouCanBuy that uses Salesforce and wants to prioritize their sales leads automatically. The objective is to provide AllYouCanBuy’s sales team with an automated solution that provides a panel like the one below where leads can be sorted by priority score. Each lead should be automatically labeled with a score so that the top priority leads (green bars on the picture) represent the leads with higher confidence of becoming customers.



To implement a fully automated solution, we need to accomplish the following  tasks:

  1. Automatically generate scores for sales leads.  I’ll solve this task with Machine Learning. We’ll use historical data on which leads converted into customers to predict future scores. We need an engine or service that allows us to programmatically build predictive models and generate predictions. Obviously, I am going to use BigML but another machine learning package could be used in a similar fashion.
  2. Automatically extract historical data from the CRM and return it with new scores. I’ll solve this task using an ETL tool.  ETL tools are great helpers on the integration of disparate sources of data and services. They help Extract data from a multitude of sources including Salesforce, Transform them with using a programmable toolkit with many pre-built functions and techniques, and finally Load the results into other system or service. For this example, I am going to use the Talend Open Studio tool, from the Talend Company

The following picture shows a high-level architecture of the predictive lead scoring solution:

Next I’m going to describe each of the high-level components of the architecture with a little more detail, but first let me tell you about more Predictive Lead Scoring.

Predictive Lead Scoring

Predictive Lead Scoring is one of the most active fields today within the set of problems that can be solved using Machine Learning algorithms, and therefore with the use of BigML.

This technique seeks to improve the results obtained by sales teams during the qualification stages of their leads, helping them to predict leads that have a higher probability of success. This will improve sales results by focusing efforts first on the most important leads rather on those with lower chances of success—thereby helping teams organize their time and work. When one considers the cost associated with lead qualification it is very logical to spend time looking for ways to optimize the process.

Imagine a company that buys a database of 5,000 new leads. Without a tool that allows them to set priorities, they would be calling from first to last in the list without having any idea what will happen.  Ad hoc intuition would be the guiding light on lead prioritization. What a waste of time, right?

However, with solutions like we are detailing here, you can solve this kind of business challenge quite easily—helping your sales team be more effective through the power of machine learning (as opposed to gut instinct).


AllYouCanBuy (our fictitious company) fortunately has been operating for some time and has been storing data about prior leads using  Salesforce. Their data includes the following types of custom fields:

  • Input fields: fields that hold information about each lead (city, sector, number of interactions, etc). These are the fields that we will use as input data for making prediction. This set of features can be as reach as you want and can use both specific data about the lead as well as data about the interactions kept with the lead.
  • Fact field: this is field that the sales team has been using to label if a lead became a new customer or not. This is the label (in supervised learning parlance) that will be used to train and evaluate the predictive models. In a more sophisticated application, this field could be automatically computed.
  • Output fields: these are the fields that we will use to store the output of the model, the confidence of the prediction returned by the model, and a priority field (the lead score). We compute the priority field based on the output of the model and its confidence. For example, leads with a ‘true’ prediction and a high confidence will become our top priority and leads with a ‘false’ prediction and high confidence will become our low priority leads.   Priority fields will allow for a more user-friendly representation of the predictions.

In the tutorial, you can read more on how to create a Developer Account, how create custom fields of leads object, how to customize leads objects in Salesforce.


bigml_whiteBigML not only provides a great set of 1-click functions and visualizations for predictive modeling but also a REST API to programmatically run diverse sophisticated predictive modeling workflows.

To simplify the first version of our predictive lead scoring app, I am going to create the model directly in BigML and use it to make predictions. In a second iteration, I’ll automate the model creation too. So basically, we only need to export Salesforce data to a CSV file, upload the file to BigML and let it do the data modeling. I will use BigML’s 1-Click Ensemble to create a very robust predictive model ready to make predictions in a matter of seconds.


I the tutorial you will find more details on how to create an account in BigML  and how to create a predictive model in BigML

Talend Open Studio

talend_logo2_trans Talend Open Studio provides an extensible, highly-performant, open source set of tools to access, transform and integrate data from any business system in real time or batch to meet both operational and analytical data integration needs. It has more than 800 connectors, and can help you integrate almost any data source. The broad range of use cases addressed includes: massive scale integration (big data/ NoSQL), ETL for business intelligence and data warehousing, data synchronization, data migration, data sharing, data services, and now predictions!!!

We will use Talend Open Studio to not only perform the data transformations requested but also to communicate with Salesforce and BigML. The transformations will be only simple metadata mapping between the respective output data and input data of both services as in our case with AllYouCanBuy, all the information about our leads actually come from the same place: Salesforce. However, the transformations can be more sophisticated for more complex applications—for in the real world, you may want to get information from other internal and external sources that can help create richer predictive models.

Talend allows you to use a high-level visual component to design complex ETL processes without writing a single line of code!  You can see what the Talend ETL process looks like below:


BigML has developed a Talend Component named tBigMLPredict that you can download here and incorporate in your own installation of Talend. This component will help you make predictions with a predictive model you have previously created in BigML.

Once installed, this component will be visible in the Palette of components, inside the Business IntelligenceBigML Components category.


The tBigMLPredict component allows us to set the following configuration parameters:

Configuration of the tBigMLPredict component

In the tutorial, you can read more on how to download and install Talend Open Studio,  how to download and configure the BigML Talend Components, how to design the integration job in Talend, and how to execute it in Talend Open Studio.


We have outlined in this blog (and the tutorial + related documents!) how you can orchestrate a flow that automatically scores sales leads in Salesforce using Talend and BigML.  It shouldn’t be difficult to create similar flows for other CRM services or using other ETL platforms–please let us know which ETL and CRM tools we should work on next!

I hope this post inspired you to start building your own predictive lead scoring application or more sophisticated predictive flows

Introducing: Magic Data Goggles!

Bad things happen, but thankfully they tend to happen rarely. For example, you’d expect a small fraction of network traffic to be hackers, and a minority of patients to have a serious disease. (I was going to add that we expect a small percentage of credit card transactions to be fraud, but these days that feels a bit optimistic.) We obviously want to identify and avert these rare bad events, and anomaly detection—which BigML just launched last week—is a powerful way to achieve this.

In the disease category, there’s a well-known dataset of breast cancer biopsies from University of Wisconsin Hospitals, including measurements from each biopsy and the result of “benign” or “malignant”. Of course, you can use BigML to train a highly accurate predictive model on this labeled data, but that’s almost too easy. So here’s a challenge: what if we remove the labels of “benign” and “malignant”? Can we still find useful patterns in the data?

BigML makes it simple to create a dataset with only the measurements and not the “benign” or “malignant” labels. We then train an anomaly detector on this unlabeled data, and BigML displays the 10 biopsies with the highest “outlier-ness”:


That’s interesting, but I want more insight into what makes a biopsy anomalous. To do this, I create anomaly scores for the entire dataset, give each biopsy a label of “high” or “low” (with “high” defined as the top third of anomaly scores), then train a model to predict this new label. (I’m working on a video for David’s Corner that shows how this all takes just a few mouse clicks—which is exactly what we expect from BigML!)

This new model finds a striking pattern: most high-anomaly biopsies have “uniformity of cell size” greater than 2. Of the 231 high-anomaly biopsies in the entire dataset, a whopping 207 (almost 90%) are covered by this single rule. A higher “uniformity of cell size” means (unintuitively) that the size is less uniform, which is a feature of cancer cells, so experts would conclude that this pattern is worth investigating further.

And they would be right. Because if we let BigML use the labels of benign or malignant, it tells us that biopsies with “uniformity of cell size” higher than 2 are almost always malignant. Think about that: the anomaly detector, having no idea which examples are actually malignant, still managed to figure out that this cell size attribute is important, and specifically that it’s important when it’s greater than 2.

Another way to see the power of anomaly detection is to predict the outcome of a biopsy using only the high/low anomaly attribute. This correctly predicts the result 89% of the time, and detects 83% of malignant biopsies. Again, not bad considering the anomaly detector has no idea which examples are actually benign or malignant!

Screen Shot 2014-10-02 at 10.10.28 PM

Finally, we can simply compare the histogram of anomaly scores for malignant and benign biopsies. This clearly shows how well the anomaly score lines up with the biopsy results!


Hopefully I’ve conveyed how insanely useful anomaly detection can be for finding patterns in unlabeled data, especially if you expect the data to contain a highly interesting (and often unwelcome) smaller class. This is particularly useful for large datasets where it is not feasible to label all the “bad” examples: millions of credit card transactions, for example, or billions of network events.

Moreover, you expect your adversary—fraudsters, hackers or even cancer—to change tactics over time. Because anomaly detection doesn’t require you to know exactly what you’re looking for, it can pick up on new types of attacks and warn you that something weird is going on.

Anomaly detection is like magic goggles for your data, helping you find patterns in a completely automated and unsupervised way. Of course, it’s not really magic: we’re just reaping the benefits of assuming, correctly, that there is a minority class to be found. And as long as we have adversaries, that will continue to be a good assumption.

For serious data enthusiasts, here are the ingredients for this analysis:

BigML Down Under—Introducing

Machine Learning for everyone also means Machine Learning everywhere. This month we will get a little closer. We are crossing the Pacific and launching BigML in Australia and New Zealand. From today on, BigML users in Australia and New Zealand can enjoy BigML at This site will have identical functionality as, only it will run directly on local cloud-based infrastructure. In addition, we’re very excited to detail a unique alliance that BigML has launched with a leading data intelligence company in the region, GCS Agile.


While BigML makes it easy for you to build predictive models and perform a variety of machine learning tasks, there are many other related activities that are required for enterprise-grade deployment of machine learning solutions. For example, data transformations, feature engineering, finding the best modeling and prediction strategies, and measuring the impact are key to maximizing the power of machine-learned models. GCS Agile’s data intelligence team is uniquely qualified to support this type of holistic approach to machine learning, and BigML is pleased to announce that our companies have entered into a strategic alliance.


GCS Agile is comprised of seasoned leaders who have extensive experience providing data-driven solutions to leading companies in diverse sectors such as telecommunications, finance, and government.  BigML will rest at the heart of GCS Agile’s data intelligence practice.  GCS Agile and BigML teams will  work together to bring BigML private deployments to mid-sized and big companies in Australia and New Zealand.

In addition, the GCS Agile team has been busy organizing a series of public events where you will be able to meet face to face with GCS Agile and BigML leadership team members. If you want to know more please contact the GCS Agile team at

 See below an incomplete list of public events:

  • Cloud-based Machine Learning [Open to everyone]
Swinburne University of Technology in Melbourne

Tuesday, 14 October 2014, 10am – 4pm

  • Why data needs to be considered a strategic asset by organisations in 2015? [By invite only]
RACV: 501 Bourke Street, Melbourne

Wednesday, 15 October 2014, 5:30pm – 8:00pm

Data Science Melbourne
Inspire 9, Level 1 41 Stewart Street, Richmond

Thursday, October 16, 2014 

Stay tuned for further announcements from GCS Agile and BigML. In the interim, if you are in Australia or New Zealand don’t wait to give the new site a spin…


BigML Late Summer Release: Anomaly Detection and More!

Before your sunburns subside and the leaves begin to turn from green to brown, the BigML team is excited to share our Late Summer Release which includes a bunch of new functionality to empower many new predictive applications.

Headlining this release is Anomaly Detection, which can help automate a number of  predictive tasks for fraud detection, security, quality control, diagnoses and more.  Also included in the release are support for model clusters, missing splits, client-side predictions and more!


For starters, we’re excited to announce that BigML now allows you to automatically create top performer Anomaly Detectors in just one click, or programmatically via BigML’s REST API.

An anomaly detector is a predictive model that can help identify the instances within a dataset that do not conform to a regular pattern. This can be useful for tasks like data cleansing, identifying unusual instances, or, given a new data point, deciding whether a model is competent to make a prediction or not.  Thus anomaly detectors are not only critical tools to work on fraud detection, medical diagnosis, or preventing defects but also they do a great job removing outliers, which in turn helps increase performance of other modeling tasks.

When you create a new anomaly detector, it automatically returns an anomaly score for the top n most anomalous instances. The newly created anomaly detector can later be used to create anomaly scores for new data points or batch anomaly scores for all the instances of a dataset.

BigML anomaly detectors are built using an unsupervised anomaly detection technique that helps isolate those instances that are unusual, and you do not need to explicitly label each instance in your dataset as “normal” or “abnormal.” We’ll be explaining the technique further in blog posts soon and also will be layering in added functionality (e.g., the ability to work with text fields). You can get started today with just about any dataset—and as always, you can work for free in BigML’s Development Mode!


Model Clusters

Now you can automatically create a dataset and model for each cluster. This will not only help you better understand the cluster, but you can also use model clusters to classify new instances.

To use this functionality, be sure to click the “create model clusters” option when configuring your cluster. Then, if you want to build a model from one of your clusters, simply hit ‘shift’ on any cluster and then choose ‘create a model of this cluster” from beneath the right-hand summary box.

And, voila—you’ll have a new model comprised of your cluster’s data, which you can then interpret to find key patterns associated with whether or not data is likely to be within that cluster. Check this section out to learn how to use this feature via BigML’s API.

Missing Splits

As we know that cleaning up data can be hard and having all the input data handy at prediction time is important, we have built a new option to create models that will generate predicates that explicitly deal with missing values.

To leverage this capability, go into the “Configure” subpanel when configuring your model, and click on the “missing splits” icon as follows:

The model that is created will look the same as before, but now you can see new predicates that directly check for missing values. See the example in the picture below.


Online Predictions

New client-side predictions make it easier than ever to explore the influence of each field in your models, ensembles or clusters. Whereas you previously had to rebuild predictions for each set of variables, you can now simply change your fields’ inputs and see the predicted output change in realtime!  In addition, the prediction form also includes the relative importance of each field so you can quickly select / de-select them for your predictions:

Some added benefits of online predictions are that they’re free to use–both for pay-as-you-go customers, and also for anyone predicting against a model that’s been shared in BigML’s gallery and/or through a private link.

Also, we are open sourcing the related Javascript libraries so that you can easily leverage this functionality to build very powerful and dynamic apps and web services.

Faster Ensembles

As you already know, ensembles provide greater generalization than single decision trees, and BigML makes it easy for you to tap into this functionality with just a few mouse clicks.

With our latest release, BigML’s ensembles now run much faster than before–meaning that you can more quickly build fully actionable ensembles to underpin predictive analyses and applications. Basically, we have reengineered the way ensembles deal with all the data processing that BigML needs before creating a model.

And More..

You’ll also notice a bunch of UI and workflow improvements, which we’re constantly bringing into production.  These typically have a “new” image next to them. One of the features that we like most is the new option to automatically generate a dataset using the output of of batch process. That is, when you request a batch prediction, a batch centroid, or batch anomaly score you can optionally request to build a new dataset with the results. This is particularly useful to implement iterative flows where you use the output (prediction, centroid, or score) as an additional input for building another model.  We’ll elaborate more on this in a future blog post.

The Late Summer Release features are available immediately–simply log into your account and get started today! And be sure to let us know your feedback both on these features, and on what you’d like to see next!

Interested in seeing these new features in action?  Check out the archived video and slides from our Late Summer Release launch webinar.

PAPIs’14 – The First International Conference on Predictive APIs and Apps

In 2006,  Strands, my previous company, organized and sponsored a Late Summer School on Recommender Systems in Bilbao. With a little help from our friends and a grant from the Basque Country Government, we invited a number of  experts and promising Ph.D. students to a series of workshops to discuss the Present and Future of Recommender Systems. Before the workshops were over, a few of the attendees, including Professor John Riedl, were already talking about building an international conference on the wonderful interactions we were experiencing (some pictures here). One year later, the First ACM Recommender Systems conference (RecSys), was held in Minneapolis. In a few weeks, RecSys will get to its 8th edition. It’s incredibly rewarding to see how companies like Netflix, Linkedin, Google, Yahoo!, Baidu, IBM, Comcast, and Facebook picked up the baton and they currently sponsor what we started 8 years ago. Today, we announce that we’re launching a similar endeavor to create a new community on Predictive APIs and Applications.

Early this year, I electronically met Louis Dorard and Ali Syed. It turned out that we were all on the same mission to democratize machine learning with our respective companies. We also shared the vision that a predictive world will be a much better world. Louis and I later met in San Francisco. Shortly thereafter, we started discussing the need for a world-wide practical community that gathers annually to discuss the latest technical advancements and challenges on Predictive APIs and Applications. So I am thus very happy to announce that PAPIs 2014 – The First International Conference on Predictive Application and APIs will be held in Barcelona on November 17-18.  We even have plans for repeating the conference in Sydney right before KDD next year and afterwards bringing it to the US in 2016 or wherever the community takes it to.

Over the last few weeks, we’ve been busy building up a highly technical and diverse Program Committee that will help select the first presenters at PAPIs 2014:

Erick Alphonse (Deloitte), Sébastien Arnaud (OpinionLab), Richard Benjamins (Telefonica), Misha Bilenko (Microsoft), Jason Brownlee (Machine Learning Mastery), Natalino Busa (ING), Eric Chen (NTT Innovation), Mike Cossy (IBM), Beau Cronin (Salesforce), Ricard Gavaldà (UPC), Andrés González (CleverTask), Matthew Grover (Walmart), Harlan Harris (Education Advisory Board), Jeroen Janssens (YPlan), Benedikt Koehler (, Maite López (UB), Gideon Mann (Bloomberg), Jordi Nin (Barcelona Supercomputing Center), Mark Reid (ANU), Juan Antonio Rodriguez (IIIA), Marc Torrens (Strands), Jordi Torres (Barcelona Supercomputing Center), and Zygmunt Zając (FastML).

A first group of companies like BigMLCleverTask, CodoleDataikuGCS AgilePersontyleStrands, and Taiger, and much larger ones like Microsoft and IBM are helping us impulse PAPI’s and create the community. If your company is interested in helping us in this endeavor please contact us here.

Problem-First Approach

We want PAPIs to become an open forum for technologists and researchers on distributed, large-scale machine learning services and developers of real-world predictive applications. We aim at seeding highly technical discussions on the way common and uncommon predictive problems are being solved. However, we want PAPIs to be an eminently hands-on conference.

One of the features that characterized the RecSys community was its original mix of researchers and practitioners. However, it had a clear imbalance on the academic side. In the 2009 edition in New York City, I gave a provocative opening keynote arguing that too much emphasis was given to new algorithms rather than to other under-represented topics like data, evaluations, and interfaces that were essential to build real-world recommender systems. I still remember the (positive and negative) reactions of many attendees. In my opinion, an incredible group of talented scientists were not paying that much attention to the crucial problems. A couple of years ago, Domonkos Tikk sent me a note that said:

I often cite your provocative statement on the 5% contrib of algorithms to the success of recsys. 3y ago I completely disagreed. Now I see that it might be 10 or 15%, but there are other major factors too. Well, we are also more in the industry than in academia…

In this first edition, we will focus on the pragmatic issues and challenges that companies on the trenches must face to make predictive APIs and applications a reality, and add academic tracks on future editions, once we understand them better.

So if you are working on an interesting Predictive API or Application and want to show the rest of the world your new advancements or discuss the challenges that you are facing please send us your proposal.

Predictive APIs and Applications cover a wider area of application than Recommender Systems. Therefore, their impact in our everyday’s life will be orders of magnitude higher and affect more industries than we can now imagine. So please don’t miss the opportunity to join this nascent community early on.

See you all at PAPIs in Barcelona!!!

Enhancing your Web Browsing Experience with Machine Learned Models (part II)

In the first part of this series, we saw how to insert machine-learned model predictions into a webpage on-the-fly using a content script browser extension which grabs the relevant pieces of the webpage and feeds them into a BigML actionable model. At the conclusion of that post, I noted that actionable models were very straightforward to use, but their static nature would lead to lots of effort in maintaining the browser extension if the underlying model were constantly updated with new data. In this post, we’ll see how to use BigML’s powerful API to both keep your models up to date, and write a browser extension that stays current with your latest models.

Staying up to date

In keeping with our previous post, we will be working on a model to predict the repayment status of loans on the microfinance site Kiva. In the last post, we used a model that was trained nearly one year ago on a snapshot of the Kiva database.  Hundreds of new loans become available every day, which means this model is starting to look a bit long in the tooth. We’ll start off by learning a fresh model from the latest Kiva database snapshot, and in doing so, we’ll get to use some of the new features we’ve added to BigML over the last year, such as objective weighting and text analysis. However, the Kiva database snapshot is several gigabytes of JSON-encoded data, much of which won’t be relevant to our model, so we really don’t want to relearn from a new snapshot several months down the line when our hot new model becomes stale. Fortunately, we can avoid this situation using multi-datasets. Periodically, say once a month, we’ll grab the newest loan data from Kiva and build a BigML dataset. With multi-datasets, we can take our monthly dataset, our original snapshot dataset, and all the previous monthly datasets, concatenate them together and learn a model from the whole thing. With the BigML API, this all happens behind the scenes, and all we need to do is supply a list of the individual datasets.  We can do all of this with a little Python script. Here is the main loop:

Every dataset created by this script will have “kiva-data” as its name, so we can use the BigML API to list the available datasets and filter by that name. If none show up, then we know we need to create a base dataset from a Kiva snapshot; otherwise, we’ll use the Kiva API to create an incremental update dataset. In either case, we then proceed to create a new model using multi-datasets. All we need to do is pass a list of our desired dataset resource IDs. We assign the name “kiva-model” to the model so that it can be easily found by our browser extension. We employ a few other minor tricks in our script, such as avoiding throttling in the Kiva API. You can check out the whole thing at this handy git repo.

API-powered browser extension

Our first browser extension was a content script that fired whenever we navigated to a Kiva loan page. It would grab the loan information, feed it to the model, and then use jQuery to insert a status indicator into the webpage’s DOM tree. Our new extension won’t be very different with regards to loan data scraping and DOM manipulation; the main difference will be how the model predictions are generated. Whereas our first extension featured an actionable model, i.e. a series of nested if-else statements, our new extension will perform predictions with a live BigML model, using the REST-ful API. To interface with the model, our extension now needs to know about our BigML credentials. In order to store the BigML credentials, we create a simple configuration page for the extension, consisting of two input boxes and a Save button.

Behind the scenes, we have a little bit of JavaScript to store and fetch our credentials using the API.

Whenever we save a new set of credentials, we want to lookup the most recent Kiva loans model found in that BigML account. However, this isn’t the only context in which we want to fetch models. For instance, it would be a good idea to get the latest model every time the web browser is started up. We’ll implement our model fetching procedure inside an event page, so it can be accessed from any part of the extension.

At the bottom of event page, we see that it listens to the browser startup event, and any messages with the greeting “fetchmodel” to fire the model fetching procedure. We also see that it listens for the onInstalled to open up the configuration page for the first time. With this extra infrastructure in place, we are ready to make our modifications to the content script. Here is the script in its entirety:

Compared to the previous version of the extension, the bulk of the differences lie in the predictStatus function. Rather than evaluating a hard-coded nested if-then-else structure, the loan data is POSTed to the BigML prediction URL with AJAX, and most of the DOM manipulation has been refactored to occur in the AJAX callback function. One perk we get from using a live model is that we get a confidence value along with our prediction of the loan status. We’ve used that to add a nice little meter alongside our status indicator icon.


Better browsing through big data

You can grab the source code for this Chrome browser extension here, where it is also available as a Greasemonkey user script for Firefox. We hope that this post, and its prequel, have been able to show the relative ease with which BigML models can be incorporated into content scripts and browser extensions. Beyond this example, there are a myriad of potential applications on the internet where machine learned models can be used to provide a richer and more informed web browsing experience. It’s now up to you to use what you’ve learned here to go out and realize those applications!

Using Machine Learning to Gain Divine Insights into Kindle Ratings

I recently came across a 1995 Newsweek article titled “Why the Web Won’t be Nirvana” in which author and astronomer Cliff Stoll posited the following:

How about electronic publishing? Try reading a book on disc. At best, it’s an unpleasant chore: the myopic glow of a clunky computer replaces the friendly pages of a book. And you can’t tote that laptop to the beach. Yet Nicholas Negroponte, director of the MIT Media Lab, predicts that we’ll soon buy books and newspapers straight over the Internet. Uh, sure.”

Well, it turns out that Mr. Stoll was slightly off the mark (not to mention his bearish predictions on e-commerce and virtual communities). Electronic books have been a revelation, with the Kindle format being far and away the most popular.  In fact, over 30% of books are now purchased and read electronically.  And Kindle readers are diligent with providing reviews and ratings for the books that they consume (and of course the helpful prod at the end of each book doesn’t hurt).

So that got us thinking: are there hidden factors in a Kindle book’s data that are impacting its rating?  Luckily, makes it easy to grab data for analysis, and we did exactly that:  pulling down over 58,000 kindle reviews which we could quickly import into BigML for more detailed analysis.


My premise going into the analysis was that author and words in the book’s description, along with length of book, would have the greatest impact on the number of stars that a book receives in its rating.  Let’s see what I found out after putting this premise (and the data) to the machine learning test via BigML…


We uploaded over 58,000 Kindle reviews, capturing URL, title, author, price, save (whether or not the book was saved), pages, text description, size, publisher, language, text-to-speech enabled (y/n), x-ray enabled (y/n), lending enabled (y/n), number of reviews and stars (the rating).

This data source includes both text, numeric and categorical fields.  To optimize the text processing for authors, I selected “Full Terms only” as I don’t think that first names would have any bearing on the results.

kindle author full term

I then created a dataset with all of the fields, and from this view, I can see that several of the key fields have some missing values:


Since I am most interested in seeing how the impact of the book descriptions impact the model, I decide to filter my dataset so that only the instances that contain descriptions will be included.  BigML makes it easy to do this by simply selecting “filter dataset”:

filter dataset

and from there I can choose which fields to filter, and how I’d like them to be filtered.  In this case I selected “if value isn’t missing” so that the filtered dataset will only include instances where those fields have complete values:


And just like that, I now have a new dataset with roughly 50,000 rows of data called “Amazon Kindle – all descriptions” (you can clone this dataset here). I then take a quick look at the tag cloud for description, which is always interesting:

word cloud amzn

In the above image we see generic book-oriented terms like “book,” “author,” “story” and the like coming up most frequently – but we also see terms like “American,” “secret,” and “relationship” which may end up influencing ratings.

Building my model:

My typical approach is to build a model and see if there are any interesting patterns or findings.  If there are, I’ll then go back and do a training/test split on my dataset so I can evaluate the strength of said model.  For my model, I tried various iterations of the data (this is where BigML’s subscriptions are really handy!).  I’ll spare you the gory details of my iterations, but for the final model I used the following fields:  price, pages, description, lending, and number of reviews. You can clone the model into your own dashboard here.

What we immediately see in looking at the tree is a big split at the top, based on description, with the key word being “god”.  By hovering over the nodes immediately following the root node, we see that any book that contains the description “god” has a likely review of 4.46 stars:

god ratings

while those without “god” in the description have a rating of 4.27:

no god

Going back to the whole tree, I selected the “Frequent Interesting Patterns” option in order to quickly see which patterns in the model are most relevant, and in the picture below we see six branches with confident, frequent predictions:

freq patterns

The highest predicted value is on the far right (zoomed below), where we predict 4.57 stars for a book that contains “god” in the description (but not “novel” or “mystery”), costs more than $3.45 and has lending enabled.


Conversely, the prediction with the lowest rating does not contain “god,” “practical,” “novel” or several other terms, is over 377 pages, cannot be loaned, and costs between $8.42 and $11.53:


Looking through the rest of the tree, you can find other interesting splits on terms like “inspired” and “practical” as well as the number of pages that a book contains.

Okay, so let’s evaluate

Evaluating your model is an important step as data most certainly can lie (or mislead), so it is critical to test your model to see how strong it truly is.  BigML makes this easy: with a single step I can create a training/test split (80/20), which will enable me to build a model with the same parameters from the training set, and then evaluate that against the 20% hold-out set. (You can read more about BigML’s approach to evaluations here).

The results are as follows:


You can see that we have some lift over mean-based or random-based decisioning, albeit somewhat moderate.  Just for kicks, I decided to see how a 100-model Ensemble would perform, and as you’ll see below we have improvement across the board:



This was largely a fun exercise, but it demonstrates that machine learning and decision trees can be informative beyond the predictive models that they create.  By simply mousing through the decision tree I was able to uncover a variety of insights on what datapoints lend themselves to positive or less positive Kindle book reviews.  From a practical standpoint, a publisher could build a similar model and factor it into its decision-making before green-lighting a book.

Of course my other takeaway is that if I want to write a highly-rated Kindle title on Amazon, it’s going to be have something to do with God and inspiration.

PS – in case you didn’t catch the links in the post above, you can view and clone the original dataset, filtered dataset and model through BigML’s public gallery.

Did Germany Cheat in World Cup 2014?

This is a guest post by Andy Thurai (@AndyThurai). Andy has held multiple technology and business development leadership roles with large enterprise companies including Intel, IBM, BMC, CSC, Netegrity and Nortel for the past 20+ years. He’s a strategic advisor to BigML.

Now that I got your attention about Germany’s unfair advantage in the World Cup, I want to talk about how they used analytics to their advantage to win the World Cup—in a legal way.


I know the first thing that comes to everyone’s mind talking about unfair advantage is either performance-enhancing drugs (baseball & cycling) or SpyCam (football, NFL kind). Being a Patriots fan, it hurts to even write about SpyCam, but there are ways a similar edge can be gained without recording the opposing coaches’ signals or play calling.

It looks like Germany did a similar thing, legally, and had a virtual 12th man on the field all the time. For those who don’t follow football (the soccer kind) closely, it is played with 11 players on the field.

So much has been spoken about Big Data, Analytics and Machine Learning from the technology standpoint. But the World Cup provided us all with an outstanding use case on the application of those technologies.

SAP (a German company) collaborated with the German Football Association to create Match Insights analytics software, dubbed as the ultimate solution for football/soccer. This co-innovation project allows teams to not only analyze their own players, but also learn about their opponents as well to create a game plan. The goal of the co-innovation project between SAP and the coaches of the German national football team was to build an innovative solution that enhanced the on-field performance leading up to and winning the World Cup.

Considering that in 10 minutes, 10 players who touch the ball can produce as high as 7 million data points, imagine how many data points will be created in one game. When you analyze multiple games played by a specific opponent, your tape study and napkin cheat sheets won’t be that effective anymore.  By being able to effectively harness insights from this massive collection of data points, the German team had a leg up heading into the World Cup.

The proof was in the pudding when Germany used this program to thump Brazil 7-1 in the World Cup and then went on to triumph in the finals against Argentina. In their 7-1 win against Brazil, during a 3 minute stretch in the first half, 3 goals were scored while Brazil owned the ball for 52% of the time during that period. If you watched the game, Germany’s teamwork was apparent; but what was not apparent is that the software provided information on the possession time, when to pass, whom to hold the ball against vs. whom to pass against, etc. The Defensive Shadow analysis portion of the software shows teams exactly how to beat the opponent’s defensive setup based on a specific opponent alignment and movement.

In a recent article, Sophie Curtis of Telegraph explains a tactic used by the Germans, which was to reduce their average possession time from 3.4 seconds in the 2010 World Cup to 1.1 seconds in 2014. This not only confused the defenders, but also made the defenders uncertain whom to defend and quickly tired from randomly chasing their opponents.

If you watched the match closely, the passes and ball advancement all seemed to be executed with clinical precision. I thought they were just well coached or figured it out during the game, but apparently they had outside help which decided the win – before they ever set foot on the field. While the same advantage can be gained by watching tapes (or videos), the software’s uncanny prediction ability goes far beyond traditional mechanisms by converting all game data and player skills to actionable directives that can be implemented on the field. I don’t think you can make a more powerful statement than thumping a popular, soccer crazy host nation which fielded a decent football team. I doubt even the legendary Pele would have done any better against Big Data analytics.

The Match Insights tool is exclusive to the German team right now, but according to SAP they have plans to sell it more broadly in the future.  In the meanwhile, it is wise to choose Germany if you are betting with your friends.

Big Data is generally a massive collection of data points. Most companies that I deal with think just by collecting data, securing and storing it for later analysis, they are doing “Big Data.” However, they are missing the key element of creating real-time actionable intelligence so they can make decisions on the fly about their processes. Most companies either don’t realize this, or are not set up to do this. This is where companies like BigML can add a lot of value. BigML is a machine learning company (best of breed in my mind) which helps you do exactly that.


For example, in a recent blog post, Ravi Kankanala of Xtream IT Labs talks about how they were able to predict opening weekend box office revenues for a new movie using BigML’s predictive modeling. The point that caught my attention was that they made 200,000 predictions with 90%+ accuracy. This means their clients can look at these results and decide the best time to release their movie (day of the week, day of the month and the month) rather than treating it as a guessing game. This insight also helps studios segment and concentrate their marketing campaigns to improve box office results.

In a blog posted 3 days before the Super Bowl (on Jan 31, 2014), Andrew Shikiar used BigML to predict a Seattle win with an uncanny 76% accuracy. In the same article he also predicted Seattle covering the spread with 72% accuracy. But, if you were watching the pundits on ESPN, they were split 50-50% even an hour before the game with everyone of them yapping about the best offense ever going against a solid Seattle defense so the results are unpredictable.

While Moneyball, and Billy Beane, may have introduced the public to this concept of manipulating a sport based on statistical analysis, that approach was based on how overall statistical numbers can be applied to a team’s roster composition. But these precision statistics by BigML can help teams adjust every pitch (as in baseball), every throw (as in football), and every pass (as in soccer).

And the goodness doesn’t stop there. The BigML team is busy developing technology that underpins a growing number of predictive initiatives, including:

  1. Predicting the symptoms of customers thinking of discontinuing subscription services and the best way to target them with exclusive campaigns, promotions, or incentives to retain them before their customers think of switching. – Churn rate analysis.
  2. Predicting fraudster behavior amongst apparently normal customers.
  3. Predicting future life events based on changing shopping patterns or suggest different shopping patterns based on life event/style changes.

The key takeaway here is that you can do more than just collect data and store it: with the right strategy and software you can get meaningful insights. If you gain an edge over your competitors, you can win your business World Cup too.

It sure pays to have an edge!

%d bloggers like this: