Anomaly Detection in a Single Command Line
In our prior post we talked about clustering and how you can group your data together into segments by typing a single sentence using BigMLer — BigML‘s command line API. The same can be done for Anomaly Detection, so after reading this post you should be able to find the outliers in your data on a single command line.
Anomaly Detection is a technique used to identify the instances in your data that do not conform to the general pattern. Depending on the nature of your problem, pinpointing these instances can really make all the difference: they may be the errors in your dataset, which you would like to exclude before building models. They may represent fraudulent transactions in a credit card database, defective products in a manufacturing context, etc. In these cases, separating the wheat from the chaff is something paramount for your business or research.
In BigML, we added the anomaly detector to our machine learning kit last year. It is an unsupervised tool, so you don’t need to label your data as normal or abnormal. You just upload it to BigML for the anomaly detector to figure that out for you. For a good example, just check David Gerster’s post about anomaly detection and breast cancer biopsies. Now let’s see how easy it is to build an anomaly detector using BigMLer.
Finding anomalies in your data
BigMLer can build an anomaly detector from any CSV data file just like this:
With this simple command, BigMLer will upload your data and build a source from it, create a dataset summarizing all the statistical information per field, and finally build an anomaly detector. The console will show links to these resources as they are created. Their IDs appear in these links and they are also stored for later use in files under the output directory of the command. You can also load your data from remote repositories:
or even stream it from your standard input
How do you then extract the anomalous instances in your dataset from the anomaly detector that you have just created? And how can you know how anomalous they are?
The information about the anomalous instances in your dataset is stored in the JSON that describes the anomaly detector itself, where a list of the top anomalies is enclosed. This is the information that our web interface displays, namely: the list of the instances with the top anomaly scores. Each instance in the dataset is assigned an anomaly score that ranges from
0 (least anomalous) to
1 (most anomalous).
What do these scores mean? The anomaly detector is built using an iforest, that is, a bunch of overfitted decision trees grown from samples of your data. The anomaly score is obtained from this iforest by comparing the medium depth of these trees with the real depth of the node where the instance under test is classified. The rational behind this procedure is pretty simple: the easier it is to single out an instance the more anomalous it is, while the average looking instances that follow the general pattern are hard to tell from each other. Thus, the higher the score, the more an instance is dissimilar from the general pattern.
The anomaly detectors will by default use an iforest of 128 trees and show the top ten anomalies, but these figures can also be changed using the
--forest-size options. You can build an anomaly detector from an existing dataset in BigML using its identifier:
In this example, an iforest of 50 trees showing the top twenty anomalies will be created. The
--anomaly-seed option is added to ensure that the sample’s random pick are deterministic. Now that we have created and tailored our anomaly detector, how will we use it to improve our datasets and models?
Extracting outliers and anomaly scores
Well, the first use of an anomaly detector is extracting a new dataset that contains only the top anomalous instances:
--anomaly option refers to the existing anomaly detector ID, and the
--anomalies-dataset is set to
in to select only the top anomalies. The opposite case, excluding the top anomalies from the dataset used to create the anomaly detector, is also possible by using
--anomalies-dataset out. This can be very useful to get rid of outliers in your dataset, as models built upon cleansed data will most likely perform better.
Sometimes you may prefer to see the score assigned to every instance in your dataset when deciding where the threshold for outliers should be. The best option then is to create a batch anomaly score for the anomaly detector training dataset that can later be downloaded as a CSV file, as in:
or stored as a new dataset with an additional column that contains the anomaly score for each instance:
The anomaly detector can be used to score datasets other than its own training dataset. To compute anomaly scores locally on any test data file:
If you use this command, the anomaly detector code will be downloaded to you computer, and each instance in your test file will be scored locally. The resulting scores will be stored in the my_dir/anomaly_scores.csv CSV file. Similarly, if you would like to score an existing dataset remotely, you can use the
--test-dataset option to set the dataset ID:
As you can see, removing outliers, detecting fraud and improving the quality of your data is just a BigMLer command away. Now it’s your turn to give it a try and join in the fun: we hope you get to bring the power of BigML to your command line before too long!