Using Anomaly Detectors to Assess Covariate Shift

Posted by

BigML first discussed some time ago how the performance of a predictive model can suffer when the model is applied to new data generated from a different distribution than the data used to train the model.  Machine Learning practitioners commonly identify two types of data variations that can cause problems for predictive models. The first, Covariate Shift, refers to differences between the distribution of the data fields used as predictors in the training and production datasets.  The other type of variation, Dataset Shift, denotes changes in the joint distribution of the predictors and the predicted data fields arising between the target and production datasets. A recent blog post showed how to implement one technique for detecting both types of data shift by using WhizzML.

The introduction of Anomaly Detectors in BigML provides yet another means using WhizzML for detecting data shifts that can affect the performance of predictive models.  As background for the simple technique we describe next,  you can read more about anomaly detection in a previous BigML blog post.  In a nutshell, an anomaly detector is an iforest of over-fitted decision trees.  Anomalous data items are outliers from the dataset and therefore are detected at shallower depth in the decision trees.  The depth at which an item is classified compared to the average depth of the decision trees is converted to an anomaly score that ranges from 0 (least anomalous) to 1 (most anomalous).

BigML provides two anomaly detector functions in WhizzML useful for building a data shift detector:

  • (create-and-wait-anomaly …): We can use this function to build an anomaly detector object from the same training dataset we use to build a predictive model.
  • (create-and-wait-batchanomalyscore …): Once we have built an anomaly detector, we can use this function to apply that anomaly detector to a production datashift.

There are some features of the (create-and-wait-batchanomalyscore …) function that are useful for our purpose.   When this function is applied to the input production dataset, it creates a Batch Anomaly Score object and an output Dataset object that includes every row of the input production dataset object with an added score field containing the anomaly score for that row.  The batch anomaly score function also adds summary metadata to the output dataset metadata that we can use to compute the desired dataset shift measure.

The BigML WhizzML script gallery includes an example Anomaly Shift Estimate script that demonstrates how we to use the anomaly detector functions to create a dataset shift measure.  In the rest of this post, we describe the component functions in this demonstration WhizzML script.  The script can be used as-is, or you can use the component functions as starting points for custom WhizzML scripts.

A Few Helper Functions to get Started

To begin, we recall that predictive models are typically built by learning a model from a training subset of the source data.  The data shift detection script starts with a simple WhizzML function (sample-dataset …) that allows us to select a subset of the training dataset:

(define (sample-dataset dst-id rate oob seed)
  (create-and-wait-dataset {"sample_rate" rate
                            "origin_dataset" dst-id
                            "out_of_bag" oob
                            "seed" seed}))

This minimal helper function primarily illustrates the few parameters one likely would want to use to select a subset of an input dataset.  In a WhizzML script customized for your application, you may want to use other parameters of the (create-and-wait-dataset …) function, that is described in the BigML Dataset documentation.

The script also includes a minimal WhizzML helper function (anomaly-evaluation …) to apply an anomaly detector to every row of the production dataset:

(define (anomaly-evaluation anomaly-id dst-id)
  (create-and-wait-batchanomalyscore {"anomaly" anomaly-id
                                      "dataset" dst-id
                                      "all_fields" true
                                      "output_dataset" true }))

Again, just those parameters of the (create-and-wait-batchanomalyscore …) function needed to apply the “anomaly” detector with anomaly-id to an input “dataset” with dst-id are used.  Specifying both the “output_dataset”  and “all_fields” parameters as true requests creation of an output dataset that includes all  fields in the input dst-id and an anomaly score for each row in the dataset.  The Batch Anomaly Score documentation describes the full set of parameters you might find useful in your own WhizzML scripts.

The script includes one last WhizzML helper function (avg-anomaly …) that uses metadata the WhizzML (create-and-wait-batchanomalyscore …) function adds to the basic metadata of the output dataset, which it creates to compute an average anomaly score for how anomalous the input dataset is relative to the training set used to build the anomaly detector:

(define (avg-anomaly evdst-id)
  (let (evdst (fetch evdst-id)
        score-field (get-in evdst ["objective_field" "id"])
        sum (get-in evdst ["fields" score-field "summary" "sum"])
        population (get-in evdst ["fields" score-field "summary" "population"]))
    (/ sum population)))

There are a few details worth noting here. We first must fetch the output dataset evdst identified by evdst-id.  The metadata associated with evdst includes a map “objective_field” that includes a sub-map “id”  that identifies the score-field in the metadata containing the anomaly results we need. Using that score-field value, we can access the “summary” sub-map in the “fields” sub-map of the metadata, where the total sum of the anomaly scores for all rows in the dataset and the population count of the number of rows in the dataset are found.  We return the quotient of these two quantities as our average anomaly score for the entire dataset as a measure of data shift.  The Dataset Properties section of the Dataset documentation provides more information about the properties we describe here as well as other properties you might find useful in a custom WhizzML script.

Using our Helper Functions to do Anomaly Scoring

We now can combine these minimal helper functions into a single function that computes an anomaly score for an entire production dataset relative to a training dataset.

(define (anomaly-measure train-dst train-exc prod-dst prod-exc seed clean)
  (let (traino-dst (sample-dataset train-dst 0.8 false seed)
        prodo-dst (sample-dataset prod-dst 0.8 true seed)
        anomaly (create-and-wait-anomaly {"dataset" traino-dst
                                          "excluded_fields" train-exc})
        ev-id (anomaly-evaluation anomaly prodo-dst)
        evdst-id (get-in (fetch ev-id) ["output_dataset_resource"])
        score (avg-anomaly (wait evdst-id)))
      (if clean
        (prog (delete evdst-id)
              (delete ev-id)
              (delete anomaly)
              (delete prodo-dst)
              (delete traino-dst)))
      score))

In summary, this (anomaly-measure …) function:

  1. Creates samples of both datasets (traino-dst,  prodo-dst)
  2. Creates an anomaly detector from the training sample (anomaly)
  3. Applies the anomaly detector to the production sample to create a batch score (ev-id)
  4. Computes the average anomaly score for the entire production sample (score)

This function also includes several details you might handle differently in your own WhizzML scripts.  The “train-exc” parameter is a WhizzML list of fields in the training dataset that should be ignored when the anomaly detector is created.  The “prod-exc” input parameter is ignored here since the contents of the “train-exc” input parameter determines what fields the anomaly detector will ignore in the production dataset.

In addition to these input parameters, there are some internal details of the function that should be noted. The (anomaly-evaluation …) function returns the ID of a Batch Anomaly Score object identified by ev-id; the metadata map for this object includes a property “output_dataset_resource” that contains the BigML ID evdst-id of the output dataset created by the batch anomaly score function.  We note that the BigML platform backend produces the batch anomaly score object before the output dataset object is complete.  We must use the (wait …) function or an equivalent operation to insure the dataset referenced by “output dataset resource” is available before we attempt to access the anomaly score information in the output dataset metadata that we need.

Finally, our (anomaly-evaluation …) function includes some housekeeping features to support the higher-level functions in our WhizzML script.  You might find similar features useful in your own scripts.   The “seed” input string parameter passed to the (sample-dataset …) functions causes deterministic, and therefore repeatable, sampling.  Specifying the “clean” input as true causes the function to delete the intermediate working objects it creates before returning the average anomaly score.  This can be helpful when one repetitively computes the average anomaly score on a sequence of pairs of subsets of the testing and production dataset.

As just suggested, in practice we likely would want to repetitively sample the training and production datasets and compute a sequence of average anomaly scores for that sequence of samples.  The (anomaly-loop …) function in the script is explicitly does this in a form  that illustrates how you could also easily add other computations or logging in your own custom WhizzML scripts:

(define (anomaly-loop train-dst train-exc prod-dst prod-exc seed niter clean logf)
  (loop (iter 1
         scores-list [])
    (if logf
      (log-info "Iteration " iter))
    (let (score (anomaly-measure train-dst train-exc prod-dst prod-exc (str seed " " iter) clean)
          scores-list (append scores-list score))
      (if logf
        (log-info "Iteration " iter scores-list))
      (if (< iter niter)
        (recur (+ iter 1)
                scores-list)
        scores-list))))

This function just calls the (anomaly-measure …) function “niter” times and returns the resulting sequence of average anomaly scores.  Note that the input parameters include the “clean” boolean parameter specifying whether the intermediate objects created by each use of the (anomaly-measure …) function should be preserved or deleted.  Finally, this function illustrates how we can use logging features on the BigML platform to log results from the sequence of (anomaly-measure …) calls under control of the “logf” boolean input parameter.

Next in our script we call the (anomaly-loop …) function inside a wrapper function:

(define (anomaly-measures train-dst train-exc prod-dst prod-exc seed niter clean logf)
  (let (values (anomaly-loop train-dst train-exc prod-dst prod-exc seed niter clean logf))
    values))

Although this function could be eliminated in our script, you might find a similar function useful in your own custom WhizzML script as the place for adding additional computations on the sequence of average anomaly scores returned by the (anomaly-loop …) function.

The final high-level function in our script computes our final single numeric measure of data shift.  In our script, this is simply the average of the sequence of average anomaly scores returned by the (anomaly-measures …) function:

(define (anomaly-estimate train-dst train-exc prod-dst prod-exc seed niter clean logf)
  (let (values (anomaly-measures train-dst train-exc prod-dst prod-exc seed niter clean logf)
        sum (reduce + 0 val)
        cnt (count values))
    (/ sum cnt)))

In a custom WhizzML script one could combine the (anomaly-estimate …) function and (anomaly-measures …) function by just replacing the use of (anomaly-measures …) with (anomaly-loop ..).   If one doesn’t need to access the list of scores, one could also pull the contents of the (anomaly-loop …) function into this function. On the other hand, you might need to use the list of scores from (anomaly-measures …) directly in your own WhizzML scripts, rather than just computing the average values of the average anomaly scores in that list.

Finally, the example in the WhizzML script gallery concludes with the definition required to use the script in the BigML Dashboard:

(define result (anomaly-estimate train-dst train-exc prod-dst prod-exc seed niter clean logf))

This definition also demonstrates how you would call the top-level (anomaly-estimate …) function directly in your own WhizzML scripts. Thanks to WhizzML’s composability, using Anomaly Detectors to detect covariate shift is more convenient that ever.  We hope you get a chance to give it a spin and let us know how it goes!

 

 

One comment

Leave a comment