Skip to content

BigML 2015 Winter Release Webinar Video is Here

We had a great a turnout for our 2015 Winter Release webinar earlier this week. Poul Petersen, BigML’s CIO, demoed many features including G-Means Clustering, Dynamic Scatterplot Visualization, BigML Projects, Google Integration and Sample Service — all recently deployed in the BigML platform. Thank you to all those who attended!

In case you missed it, you can view the webinar in its entirety below and you can also check it out on the BigML Training page, where it will be posted shortly along with our other video resources.

As always, continue to stay up to date with all things BigML and machine learning on our blog or our Twitter feed. We’ll see you on our next webcast. Meanwhile, keep on predicting!

BigML in Texas: AAAI 2015

This past week BigML was a proud sponsor of the 29th Annual AAAI Conference in Austin, TX. The conference drew over 1,200 attendees with members from across the AI community: AI researchers to practitioners to scientists and engineers. The program was filled with incredible presentations, discussions and professionals passionate about what they do. The main conference hall was buzzing at all hours of the day and BigML was right in the middle of it giving live demonstrations of our powerful platform and discussing all facets of machine learning with students, professors,  and seasoned AI professionals, alike.

Aside from the generous servings of coffee and donuts, we are very fortunate to have interacted with such a talented community! Read more about our adventures below!

ACM SIGAI Career Panel

At one point in our lives we’ve all been there, searching for a job in an often precarious job market. Writing and re-writing resumes, submitting job applications, exhausting your personal network and hoping for that phone call back. It can be a daunting and unflattering experience but once you receive that job offer, all the hard work you put in comes full circle. Once on the other side of the desk, so to speak, it’s often easy to overlook your past experience. However, it’s always important to give back and put yourself in a position where you can help those who are sitting exactly where you once were.

BigML had the opportunity to do just that this week. Poul Petersen, CIO of BigML, was selected to be a panelist for the AMC SIGAI Career Panel hosted by SIGAI in Austin. Poul was one of the experts seated alongside Peter Clark from the Allen Institute for AI, Eric Horvitz from Microsoft Research, and Ben Kuperman from Oberlin College; each bringing their own unique perspective to the table. The goal of the event was for the panelists to engage with the 40+ senior PhD’s and post-doctorates in order to address their questions and concerns about the AI job market, applying for the right position, career planning and positioning for success. All too often workshops such as these are overlooked and we are proud to have participated in helping the next generation of AI professionals take the appropriate steps in reaching their dream job! Below are a few quick insights regarding the size and importance of AI in the job market:

  • There are more than 26,000 computer and information science researchers in the US
  • Artificial Intelligence is growing by 20% annually
  • Since 2012, more than 170 AI dedicated start-ups have entered the market
  • VC’s pledged $188M to AI companies in Q4 2014
  • Top U.S. salaries are reported in California, Texas, the Northwest, and the Northeast (MA to VA) with an overall median base salary of $105k.

BigML Meet-Up: Austin Data Geeks!

Not only is the AI scene ingrained in the city, but the technology industry itself has also found a home in Austin. In recent years, Austin was the foundation of four start-ups that were either acquired or reached public offering at more than $1 billion each. In addition, in Q1 2014 Austin based start-ups received more than $350 million in VC funding. This city is bursting with opportunity.

IMG_0399 IMG_0402

So, it’s no wonder that we were thrilled to participate in a MeetUp in such a vibrant and active community. The amount of talent and excitement around technology made it a perfect fit for us. BigML sat down with the Austin Data Geeks at RackSpace to discuss our platform. The meet-up drew 60+ attendees from Austin-area technology companies, start-ups and all around tech enthusiasts! The session was very interactive and Poul Petersen, CIO of BigML, gave a presentation showcasing our intuitive interface can help shorten the time to actionable insights even for the most hardcore data wranglers out there.

We left Austin with a smile on our face — and stomachs full of Texas BBQ — after a fantastic week of emerging ourselves in the AI world. We look forward to all that’s coming our way and you should too. BigML has plenty to offer and we are adding new features to our platform everyday. Join us on February 11 for our 2015 Winter Release webinar and see what’s new!

If you haven’t yet experienced BigML, we’d like to remind you that you can try it for FREE with tasks under 16MB in development mode. If you would like to add capacity, keep in mind that we always offer plans for as low as $15/month with the 50% special discount for students, public researchers, or NGOs.

We look forward to circling back with all of you, and until then, keep predicting!

BigML 2015 Winter Release and Webinar: Sample Service, G-means, Projects, Labs and More!

BigML is kicking off 2015 with many new great capabilities that are included in our Winter 2015 Release, which we’ll be sharing with you in a webinar on February 11 at 9:00 Pacific / 17:00 GMT.

If you are located in Australia, New Zealand or elsewhere in the Pacific Rim, we’ll be co-hosting a separate webinar along with our alliance partner GCS Agile.  It will be on February 12 at Noon Australia Eastern Daylight Time / 4AM GMT.  Please register for our ANZ/APAC webinar to reserve your seat.

Some key highlights from the release include: 

Sample Service

BigML’s new Sample Service provides fast access to datasets that are kept in an in-memory cache which enables a variety of sampling, filtering and correlation techniques. We have leveraged BigML’s sample service to create a Dynamic Scatterplot visualization that we’ve released into BigML Labs, and which we’ll showcase on the webinar.

G-Means Clusters

This latest addition to BigML’s unsupervised learning algorithms is ideal for when you may not know how many clusters you wish to build from your dataset.

BigML Projects

We’re happy to introduce Projects to help you organize your machine learning resources. On the webinar we’ll show you how to create a project from a new data source and how to manage your associated tasks and workflows.

Google Integration

With the Winter Release, you’ll now be able to add sources to BigML through Google Cloud Storage and Google Drive, similar to our prior integrations with Dropbox and Azure Data Marketplace. You can also now log into BigML using your Google ID.

BigML Labs

Our team is constantly working on innovative applications built on top of BigML’s API. We’re now unveiling several of these in early access through our “BigML Labs”. Join us to see in action two of our latest applications codenamed BigML GAS and BigML X.

And More

We’ve also made many UI tweaks, API bindings updates, BigMLer enhancements and general improvements that we’ll highlight in the webinar as we show off the Winter Release.

Once again, webinar space is limited, so please register today! 

So you want to predict the Super Bowl again?

As faithful BigML blog readers will recall, last year we used the BigML platform to build a variety of analyses to predict the outcome of Super Bowl XLVIII.  You can read full details here, but a short summary is this:

  • The Denver Broncos were favored to beat the Seattle Seahawks by 2.5 points
  • BigML predicted the Seahawks not just to cover the point spread, but also to win the game outright
  • BigML also predicted that Seattle would win by the exact score of 43-8

Well, the last point isn’t true but the first two are.  We weren’t the only machine learning or analytics vendor aiming to pick the Super Bowl.  SAP took at crack at it (perhaps they should stick to futbol), as did a Microsoft researcher, as did Facebook.  The difference between these predictions and our own is that our prediction was not only bolder (by picking not only against the spread but also an outright Seattle victory), but it was accurate.

But rather than gloating at being 1-0, let’s advance to 2015 and take another crack at the Big Game.sb logo

Our data source and approach:
We used team rankings data again from Football Outsiders, which features their DVOA (Defensive-adjusted Value Over Average) system.  I like DVOA as it’s a more advanced way of judging a team’s performance. Football Outsider’s very short summary of DVOA is “DVOA measures a team’s efficiency by comparing success on every single play to a league average based on situation and opponent” — so rather than just looking at raw statistics, we have a much more nuanced basis. This is especially critical since we’re working with a narrow range of features in addition to a ridiculously small number of data points.

If you’re interested in learning more about the DVOA approach (and to get a snapshot in the type thinking behind advanced sports analytics in general), this summary from Football Outsiders is a great read.

What fields were selected for the ensembles?
We used Football Outsider’s NFL-wide rankings for Team Efficiency–both weighted (factoring in late-season performance) and non-weighted; Team Offense & Defense ranks (weighted & non-weighted), Special Teams ranks, the point spread, and the over/under total.  We also included historical points scored for both the AFC & NFC team from past Super Bowls.  You can view and clone the full dataset that we built here, although you’ll have to deselect some of the fields if you want to replicate this specific ensemble.

The same caveats as last year pertain — namely that these predictions are based on a mere 25 rows of data and are only using aggregate season rankings for team performance, whereas professional handicapping systems will incorporate much more nuanced data.  If you want to get a professional’s point of view, you can check out the Prediction Machine or many other pundits who leverage machine learning to try and beat the odds and/or help out your average wagering enthusiast.

Evaluating our models:
As the data is so limited it’s not surprising that single models for any outcome evaluated poorly — we had to build 100-model ensembles to achieve evaluations that performed better than mode or random guessing.  Even then, it’s important to note that the hold-out set (using a standard 80/20 train/test split) only has a handful of instances.  I was curious to see how my evaluations stacked up, so I used BigML’s handy new tweak to the interface that allows you to sort evaluations by their performance (I also leveraged the search feature to filter for “superbowl ensemble”):

superbowl ensemble evals

Cut to the chase:  who’s going to win?
Noting our prior caveats regarding sample size and mixed evaluations, here are our predictions against the 100-tree ensembles for each bet (all using plurality weighting):

  • NFC (Seattle Seahawks) points: 39.37 ±14.6
  • AFC (New England Patriots) points: 19.41 ±6.83
  • Outright Winner: Seattle Seahawks (73.13% confidence)
  • ATS (Against the Spread) Winner: Seattle Seahawks (66.30% confidence)
  • Underdog/Favorite prediction: Underdog (Seahawks) (61.33% confidence)
  • Over/under 49 prediction: Over (69.51% confidence)

There you have it:  Seattle will beat the odds again and in taking home the championship will become the first repeat Super Bowl champions in a decade.  It also looks to be a high-scoring affair, although the NFC total is undoubtedly skewed by the 43 points Seattle put up last year.

What to do now?
So should you run to your local betting hall or online sportsbook and bet your life savings on the Seahawks?  Of course not — at least not based on this blog post.  As stated above, these Super Bowl models have been built with very limited data, and perhaps have been subconsciously biased by the author’s Seahawks fanship.  Last but not least, there’s the Beli-cheat / Deflate-gate factor — who knows what’s going to come from the team that has pushed all limits in the name of securing victories?

What we’d love to see you do is leverage BigML to build your own predictive model for the Super Bowl.  For example, you could access data on all past games (not just super bowls) and use BigML’s clustering algorithm to find most similar matchups and then make a prediction based on those results.  Or, you can simply access more data — factoring in games beyond the super bowl, but also more advanced statistics from outcomes on team trends, match-up trends (e.g., what happens in *any* game when the top-ranked defense faces the 5th ranked offense), individual player match-ups, etc.  Are you into the “wisdom of crowds”?  Pull social media data and leverage BigML’s text analysis to see what the general public is thinking.

Interested?  We’re happy to help you out with your efforts. For starters, we’ll give you a free one-month Pro subscription – simply enter coupon code “XLIX” in the payment form.  Stumped?  You can always send us email at info@bigml.com, or join our weekly “Meet the Team” hangout next Wednesday at 9:30 AM Pacific.

No matter if you’re wagering or simply cheering for one team or the other, we hope you enjoy the Super Bowl.

Everything’s Big(ML) in Texas!

AAAI-15

We’re excited for a great visit to Austin, Texas in conjunction with BigML’s sponsorship of the AAAI-15 conference, which will take place during the week of January 25. The AAAI (Association for the Advancement of Artificial Intelligence) conference takes place every year and brings together the brightest minds and viewpoints on artificial intelligence from research, industry and academia.  This is the first ever Winter edition of the event, which certainly reflects global growth and awareness of AI, machine learning and related disciplines.

For starters, Poul Petersen will be sitting on the SIGAI (the ACM Special Interest Group on Artificial Intelligence) career panel on Monday the 26th alongside Oren Etzioni from the Allen Institute for Artificial Intelligence, Eric Horvitz of Microsoft Research, and Ben Kuperman of Oberlin College.

From Tuesday through Thursday BigML will be sponsoring and demoing at the AAAI-15 conference itself, and on Tuesday evening we’re very pleased to be a guest of the Austin Data Geeks MeetUp!  We’ll additionally be meeting with customers and partners in the area — so be sure to catch us at one of the events listed above and/or ping us at info@bigml.com to set up a time to chat.

The Importance of Feature Engineering

by

engineer_csv

When people first see a demo of BigML, there is often a sense that it is magical. The surprise likely stems from the fact that this isn’t how people are accustomed to computers working; rather than acting like a calculator creating a fixed outcome, the process is able to generalize from patterns in the data and make seemingly sentient predictions.

However it is important to understand that predictive analytics is not magic, and although the algorithm is learning on a very basic level, it can only extract meaning from the data you give it. It does not have the wealth of intuition that a human has, whether that’s good or bad, and subsequently the success of the algorithm can often hinge on how you engineer the input features.

Let’s consider a very simple learning task — please keep in mind that this is a contrived example to make explaining the problem of feature engineering clear, and does not necessarily represent an actual useful end result itself.

Assume you are working on a navigational system, and at some point in the system you would like a way to predict the principal direction of a highway knowing only its assigned number. For example, if a user wants to go north, and there are two nearby highways, Interstate 5 and Interstate 84, which should they take?

Now, you could use a list of known highways, but this would require you to regularly update the list as new highways are built, or removed. Instead, if there was a pattern relating principal direction to highway number, this might be a useful thing for you device to know.

So, let’s take a list of primary interstates in the US and let BigML train a model to predict the principal direction: East-West or North-South. The resulting tree looks like this:

Click on the image to interact with the model

Click on the image to interact with the model

In the highlighted node, you can see that the learning algorithm has discovered that if the highway number is greater than 96, then the highway is principally North-South. And indeed, if we look at the dataset there are only two highways that match this pattern, 97 and 99 and they are both North-South and so this pattern is relevant.

However, as you navigate around the tree it becomes obvious that each split is simply creating bounds that eventually isolate a single highway or a small group of highways, at which point the prediction is less of a generalization and more of a truism:

Click on the image to interact with the model

Click on the image to interact with the model

In other words, the model doesn’t seem to be generalizing in a meaningful way from the highway number to the principal direction.

Now if you are familiar with the US highway numbering system then you might know that there is significance in whether the highway number is even or odd. Lets re-engineer our dataset to include this property and see if the model changes. We can do this by selecting the “Add Fields to Dataset” option:

Add Fields to Dataset

We’ll call the new field “isEven” and define it with the JSON s-expression:

[ "=", 0, [ "mod", [ "field", "Highway Number" ], 2 ]]

Reading from the inside brackets, we take the field named “Highway Number” and compute this value mod 2. If it equals 0, then this expression will return True meaning the highway number is an even integer, and False if odd:

JSON s-expression to select even/odd

Now we re-build the model including this new feature:

Click on the image to interact with the model in BigML

Click on the image to interact with the model in BigML

And now we get a very simple tree which generalizes to the following rules:

IF isEven = false THEN
        direction = North-South
IF isEven != false THEN
        direction = East-West

This is a much more useful generalization! But what is happening here? Why didn’t the machine learning algorithm find this pattern in the first tree?

Remember the first dataset: all we gave the algorithm to learn from was an integer. And the only thing the algorithm knows about integers is that they have a natural order. That’s it. And so, it tried to find a pattern relating the natural order of the integers to the principal highway direction.

As humans, we potentially know a *lot* more about integers: some are squares, some are prime, some are perfect, and some are even. In the second dataset, we added some of this additional information about integers, specifically the even-ness, to the algorithm. By engineering this feature, we gave the algorithm the extra information it needed to find the pattern. In other words:

1) The “Feature Engineering” was adding the even/odd property.
2) The “Machine Learning” was the discovery that the even/odd property determines the principal direction.

The insight here is that learning algorithm can only discover the patterns that we provide in the data, either intentionally or accidentally.

In this rather contrived example, it might seem circular. That is, we start with an insight that even/odd has meaning, add that property, and then discover that even/odd has meaning. However, it is important to remember that this is a very simple example. When working with real data you may have hundreds or thousands of features and the patterns will be much more nuanced.

In that real world case, the importance of feature engineering is to use domain specific knowledge and human insight to ensure that the data contains relevant indicators for the prediction task. And, in that case, the beauty of machine learning is that it discovers the relevant patterns and filters out the incorrect human insights.


If you would like to run this example in your own account, here is a little python script which reproduces all the steps in development mode (FREE):

highways.py

My favorite bugs

by

Arguments missing when calling a function,
mispelling keys setting some advanced option,
accessing attributes your object lacks:
These are a few of my favorite bugs.

Non-closing commas on deep JSON structures,
long late-night hacking that ends up in rapture:
Found a solution, that actually sucks.
These are a few of my favorite bugs.

Wrong file permissions on root system folders,
new model’s methods that break on the olders,
out of range values on sliders and bars:
These are a few of my favorite bugs.

When the job fails,
and the app blocks,
or the cloud turns black,
I simply remember my favorite bugs
and then I don’t feel so bad.

(repeat)

When the job fails,
and the app blocks,
or the cloud turns black,
I simply remember it’s wintermute’s fault
and then I don’t care at all!

my favorite bugs

NB: wintermute is our backend’s codename

How to build a Predictive Lead Scoring App

Predictive Lead Scoring is a crucial task to maximize the efforts of any sales organization.

There are a few applications on market today like Fliptop, KXEN, or Infer that already allow you to score your sales leads. These offerings have validated both the importance and market appetite for predictive scoring solutions.

However, you may want to have greater flexibility in choosing your CRM system, or perhaps you want to build your own predictive model to do the scoring, or you might want to integrate the scoring process within related services in your organization. In this post, I’m going to show you how to build an application to score your Salesforce.com leads using Talend Open Studio and BigML—three great tools working together to build a flexible predictive solution for a common business problem!

To complement this post I’ve created a complete step-by-step tutorial that will guide you on the implementation for this use case. No matter if you are a developer, a business analyst or a data scientist, this tutorial is made for you :-). At the end of each section of this post, you will find references to the related parts in the tutorial.

The Example

To illustrate the post, I am going to use a fictitious company named AllYouCanBuy that uses Salesforce and wants to prioritize their sales leads automatically. The objective is to provide AllYouCanBuy’s sales team with an automated solution that provides a panel like the one below where leads can be sorted by priority score. Each lead should be automatically labeled with a score so that the top priority leads (green bars on the picture) represent the leads with higher confidence of becoming customers.

salesforce_usecase_example

 

To implement a fully automated solution, we need to accomplish the following  tasks:

  1. Automatically generate scores for sales leads.  I’ll solve this task with Machine Learning. We’ll use historical data on which leads converted into customers to predict future scores. We need an engine or service that allows us to programmatically build predictive models and generate predictions. Obviously, I am going to use BigML but another machine learning package could be used in a similar fashion.
  2. Automatically extract historical data from the CRM and return it with new scores. I’ll solve this task using an ETL tool.  ETL tools are great helpers on the integration of disparate sources of data and services. They help Extract data from a multitude of sources including Salesforce, Transform them with using a programmable toolkit with many pre-built functions and techniques, and finally Load the results into other system or service. For this example, I am going to use the Talend Open Studio tool, from the Talend Company

The following picture shows a high-level architecture of the predictive lead scoring solution:

Next I’m going to describe each of the high-level components of the architecture with a little more detail, but first let me tell you about more Predictive Lead Scoring.

Predictive Lead Scoring

Predictive Lead Scoring is one of the most active fields today within the set of problems that can be solved using Machine Learning algorithms, and therefore with the use of BigML.

This technique seeks to improve the results obtained by sales teams during the qualification stages of their leads, helping them to predict leads that have a higher probability of success. This will improve sales results by focusing efforts first on the most important leads rather on those with lower chances of success—thereby helping teams organize their time and work. When one considers the cost associated with lead qualification it is very logical to spend time looking for ways to optimize the process.

Imagine a company that buys a database of 5,000 new leads. Without a tool that allows them to set priorities, they would be calling from first to last in the list without having any idea what will happen.  Ad hoc intuition would be the guiding light on lead prioritization. What a waste of time, right?

However, with solutions like we are detailing here, you can solve this kind of business challenge quite easily—helping your sales team be more effective through the power of machine learning (as opposed to gut instinct).

Salesforce.com

salesforce

AllYouCanBuy (our fictitious company) fortunately has been operating for some time and has been storing data about prior leads using  Salesforce. Their data includes the following types of custom fields:

  • Input fields: fields that hold information about each lead (city, sector, number of interactions, etc). These are the fields that we will use as input data for making prediction. This set of features can be as reach as you want and can use both specific data about the lead as well as data about the interactions kept with the lead.
  • Fact field: this is field that the sales team has been using to label if a lead became a new customer or not. This is the label (in supervised learning parlance) that will be used to train and evaluate the predictive models. In a more sophisticated application, this field could be automatically computed.
  • Output fields: these are the fields that we will use to store the output of the model, the confidence of the prediction returned by the model, and a priority field (the lead score). We compute the priority field based on the output of the model and its confidence. For example, leads with a ‘true’ prediction and a high confidence will become our top priority and leads with a ‘false’ prediction and high confidence will become our low priority leads.   Priority fields will allow for a more user-friendly representation of the predictions.

In the tutorial, you can read more on how to create a Salesforce.com Developer Account, how create custom fields of leads object, how to customize leads objects in Salesforce.

BigML

bigml_whiteBigML not only provides a great set of 1-click functions and visualizations for predictive modeling but also a REST API to programmatically run diverse sophisticated predictive modeling workflows.

To simplify the first version of our predictive lead scoring app, I am going to create the model directly in BigML and use it to make predictions. In a second iteration, I’ll automate the model creation too. So basically, we only need to export Salesforce data to a CSV file, upload the file to BigML and let it do the data modeling. I will use BigML’s 1-Click Ensemble to create a very robust predictive model ready to make predictions in a matter of seconds.

article_bigml_ensemble

I the tutorial you will find more details on how to create an account in BigML  and how to create a predictive model in BigML

Talend Open Studio

talend_logo2_trans Talend Open Studio provides an extensible, highly-performant, open source set of tools to access, transform and integrate data from any business system in real time or batch to meet both operational and analytical data integration needs. It has more than 800 connectors, and can help you integrate almost any data source. The broad range of use cases addressed includes: massive scale integration (big data/ NoSQL), ETL for business intelligence and data warehousing, data synchronization, data migration, data sharing, data services, and now predictions!!!

We will use Talend Open Studio to not only perform the data transformations requested but also to communicate with Salesforce and BigML. The transformations will be only simple metadata mapping between the respective output data and input data of both services as in our case with AllYouCanBuy, all the information about our leads actually come from the same place: Salesforce. However, the transformations can be more sophisticated for more complex applications—for in the real world, you may want to get information from other internal and external sources that can help create richer predictive models.

Talend allows you to use a high-level visual component to design complex ETL processes without writing a single line of code!  You can see what the Talend ETL process looks like below:

0_jobdesign

BigML has developed a Talend Component named tBigMLPredict that you can download here and incorporate in your own installation of Talend. This component will help you make predictions with a predictive model you have previously created in BigML.

Once installed, this component will be visible in the Palette of components, inside the Business IntelligenceBigML Components category.

article_talend_palette

The tBigMLPredict component allows us to set the following configuration parameters:

Configuration of the tBigMLPredict component

In the tutorial, you can read more on how to download and install Talend Open Studio,  how to download and configure the BigML Talend Components, how to design the integration job in Talend, and how to execute it in Talend Open Studio.

Summary

We have outlined in this blog (and the tutorial + related documents!) how you can orchestrate a flow that automatically scores sales leads in Salesforce using Talend and BigML.  It shouldn’t be difficult to create similar flows for other CRM services or using other ETL platforms–please let us know which ETL and CRM tools we should work on next!

I hope this post inspired you to start building your own predictive lead scoring application or more sophisticated predictive flows

Introducing: Magic Data Goggles!

Bad things happen, but thankfully they tend to happen rarely. For example, you’d expect a small fraction of network traffic to be hackers, and a minority of patients to have a serious disease. (I was going to add that we expect a small percentage of credit card transactions to be fraud, but these days that feels a bit optimistic.) We obviously want to identify and avert these rare bad events, and anomaly detection—which BigML just launched last week—is a powerful way to achieve this.

In the disease category, there’s a well-known dataset of breast cancer biopsies from University of Wisconsin Hospitals, including measurements from each biopsy and the result of “benign” or “malignant”. Of course, you can use BigML to train a highly accurate predictive model on this labeled data, but that’s almost too easy. So here’s a challenge: what if we remove the labels of “benign” and “malignant”? Can we still find useful patterns in the data?

BigML makes it simple to create a dataset with only the measurements and not the “benign” or “malignant” labels. We then train an anomaly detector on this unlabeled data, and BigML displays the 10 biopsies with the highest “outlier-ness”:

ScreenShot006

That’s interesting, but I want more insight into what makes a biopsy anomalous. To do this, I create anomaly scores for the entire dataset, give each biopsy a label of “high” or “low” (with “high” defined as the top third of anomaly scores), then train a model to predict this new label. (I’m working on a video for David’s Corner that shows how this all takes just a few mouse clicks—which is exactly what we expect from BigML!)

This new model finds a striking pattern: most high-anomaly biopsies have “uniformity of cell size” greater than 2. Of the 231 high-anomaly biopsies in the entire dataset, a whopping 207 (almost 90%) are covered by this single rule. A higher “uniformity of cell size” means (unintuitively) that the size is less uniform, which is a feature of cancer cells, so experts would conclude that this pattern is worth investigating further.

And they would be right. Because if we let BigML use the labels of benign or malignant, it tells us that biopsies with “uniformity of cell size” higher than 2 are almost always malignant. Think about that: the anomaly detector, having no idea which examples are actually malignant, still managed to figure out that this cell size attribute is important, and specifically that it’s important when it’s greater than 2.

Another way to see the power of anomaly detection is to predict the outcome of a biopsy using only the high/low anomaly attribute. This correctly predicts the result 89% of the time, and detects 83% of malignant biopsies. Again, not bad considering the anomaly detector has no idea which examples are actually benign or malignant!

Screen Shot 2014-10-02 at 10.10.28 PM

Finally, we can simply compare the histogram of anomaly scores for malignant and benign biopsies. This clearly shows how well the anomaly score lines up with the biopsy results!

ano-hist

Hopefully I’ve conveyed how insanely useful anomaly detection can be for finding patterns in unlabeled data, especially if you expect the data to contain a highly interesting (and often unwelcome) smaller class. This is particularly useful for large datasets where it is not feasible to label all the “bad” examples: millions of credit card transactions, for example, or billions of network events.

Moreover, you expect your adversary—fraudsters, hackers or even cancer—to change tactics over time. Because anomaly detection doesn’t require you to know exactly what you’re looking for, it can pick up on new types of attacks and warn you that something weird is going on.

Anomaly detection is like magic goggles for your data, helping you find patterns in a completely automated and unsupervised way. Of course, it’s not really magic: we’re just reaping the benefits of assuming, correctly, that there is a minority class to be found. And as long as we have adversaries, that will continue to be a good assumption.


For serious data enthusiasts, here are the ingredients for this analysis:

BigML Down Under—Introducing BigML.com.au

Machine Learning for everyone also means Machine Learning everywhere. This month we will get a little closer. We are crossing the Pacific and launching BigML in Australia and New Zealand. From today on, BigML users in Australia and New Zealand can enjoy BigML at https://bigml.com.au. This site will have identical functionality as https://bigml.com, only it will run directly on local cloud-based infrastructure. In addition, we’re very excited to detail a unique alliance that BigML has launched with a leading data intelligence company in the region, GCS Agile.

BigML

While BigML makes it easy for you to build predictive models and perform a variety of machine learning tasks, there are many other related activities that are required for enterprise-grade deployment of machine learning solutions. For example, data transformations, feature engineering, finding the best modeling and prediction strategies, and measuring the impact are key to maximizing the power of machine-learned models. GCS Agile’s data intelligence team is uniquely qualified to support this type of holistic approach to machine learning, and BigML is pleased to announce that our companies have entered into a strategic alliance.

GCSAgile

GCS Agile is comprised of seasoned leaders who have extensive experience providing data-driven solutions to leading companies in diverse sectors such as telecommunications, finance, and government.  BigML will rest at the heart of GCS Agile’s data intelligence practice.  GCS Agile and BigML teams will  work together to bring BigML private deployments to mid-sized and big companies in Australia and New Zealand.

In addition, the GCS Agile team has been busy organizing a series of public events where you will be able to meet face to face with GCS Agile and BigML leadership team members. If you want to know more please contact the GCS Agile team at info@gcsagile.com.au

 See below an incomplete list of public events:

  • Cloud-based Machine Learning [Open to everyone]
Swinburne University of Technology in Melbourne

Tuesday, 14 October 2014, 10am – 4pm

  • Why data needs to be considered a strategic asset by organisations in 2015? [By invite only]
RACV: 501 Bourke Street, Melbourne

Wednesday, 15 October 2014, 5:30pm – 8:00pm

Data Science Melbourne
Inspire 9, Level 1 41 Stewart Street, Richmond

Thursday, October 16, 2014 

Stay tuned for further announcements from GCS Agile and BigML. In the interim, if you are in Australia or New Zealand don’t wait to give the new site a spin…

BigML-down-under

%d bloggers like this: