Skip to content

Introducing: Magic Data Goggles!

by on October 3, 2014

Bad things happen, but thankfully they tend to happen rarely. For example, you’d expect a small fraction of network traffic to be hackers, and a minority of patients to have a serious disease. (I was going to add that we expect a small percentage of credit card transactions to be fraud, but these days that feels a bit optimistic.) We obviously want to identify and avert these rare bad events, and anomaly detection—which BigML just launched last week—is a powerful way to achieve this.

In the disease category, there’s a well-known dataset of breast cancer biopsies from University of Wisconsin Hospitals, including measurements from each biopsy and the result of “benign” or “malignant”. Of course, you can use BigML to train a highly accurate predictive model on this labeled data, but that’s almost too easy. So here’s a challenge: what if we remove the labels of “benign” and “malignant”? Can we still find useful patterns in the data?

BigML makes it simple to create a dataset with only the measurements and not the “benign” or “malignant” labels. We then train an anomaly detector on this unlabeled data, and BigML displays the 10 biopsies with the highest “outlier-ness”:

ScreenShot006

That’s interesting, but I want more insight into what makes a biopsy anomalous. To do this, I create anomaly scores for the entire dataset, give each biopsy a label of “high” or “low” (with “high” defined as the top third of anomaly scores), then train a model to predict this new label. (I’m working on a video for David’s Corner that shows how this all takes just a few mouse clicks—which is exactly what we expect from BigML!)

This new model finds a striking pattern: most high-anomaly biopsies have “uniformity of cell size” greater than 2. Of the 231 high-anomaly biopsies in the entire dataset, a whopping 207 (almost 90%) are covered by this single rule. A higher “uniformity of cell size” means (unintuitively) that the size is less uniform, which is a feature of cancer cells, so experts would conclude that this pattern is worth investigating further.

And they would be right. Because if we let BigML use the labels of benign or malignant, it tells us that biopsies with “uniformity of cell size” higher than 2 are almost always malignant. Think about that: the anomaly detector, having no idea which examples are actually malignant, still managed to figure out that this cell size attribute is important, and specifically that it’s important when it’s greater than 2.

Another way to see the power of anomaly detection is to predict the outcome of a biopsy using only the high/low anomaly attribute. This correctly predicts the result 89% of the time, and detects 83% of malignant biopsies. Again, not bad considering the anomaly detector has no idea which examples are actually benign or malignant!

Screen Shot 2014-10-02 at 10.10.28 PM

Finally, we can simply compare the histogram of anomaly scores for malignant and benign biopsies. This clearly shows how well the anomaly score lines up with the biopsy results!

ano-hist

Hopefully I’ve conveyed how insanely useful anomaly detection can be for finding patterns in unlabeled data, especially if you expect the data to contain a highly interesting (and often unwelcome) smaller class. This is particularly useful for large datasets where it is not feasible to label all the “bad” examples: millions of credit card transactions, for example, or billions of network events.

Moreover, you expect your adversary—fraudsters, hackers or even cancer—to change tactics over time. Because anomaly detection doesn’t require you to know exactly what you’re looking for, it can pick up on new types of attacks and warn you that something weird is going on.

Anomaly detection is like magic goggles for your data, helping you find patterns in a completely automated and unsupervised way. Of course, it’s not really magic: we’re just reaping the benefits of assuming, correctly, that there is a minority class to be found. And as long as we have adversaries, that will continue to be a good assumption.


For serious data enthusiasts, here are the ingredients for this analysis:

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: