A Simple Machine Learning Method to Detect Covariate Shift

2014-01-07T11:46:33+00:00

I like the way you explain.

It’s a nice method, however I have the following remarks:

For every batch of production data you need to build a separate model, and you need to keep you training data available for doing so. That is/can be burdensome. Also, it doesn’t work on streaming data.

So, I’m more concerned with new cases that lie outside the (multivariate) domain of the training set than inside. So, scoring a dataset with only men while the model is trained on both men and women shouldn’t be a problem. The reverse is an issue: if I build a model on people between 20-30 and I start scoring it on people of all ages, I’m likely in trouble. So, this is what I want to detect, and I’d do so with a one-class SVM or the anomaly detection mode of a RF. In those cases you can store the model for multiple production batches and in the streaming case you can take a running mean on the outlier-ness of cases.

Reply

2014-01-07T19:54:09+00:00

Olav,

Thanks for your comment.

That’s right. You need to build a new model/ensemble “each time”. It’s not that complex if you do everything on the cloud 🙂 Check this gist out: https://gist.github.com/aficionado/8304590.

Keeping all the data around is something that you need to do anyways. To implement the method you don’t need to use all the data but just build representative samples of your historical and new data. With streaming data, it becomes again a matter of choosing a right sample size and the right window size. In the example above, a day.

Anyway, I think that method can be applied in general to multitude of domains as a quick indication. What you propose makes a lot of sense to implement a specific anomaly detector. Actually, we’re working on a version of Isolation Forests http://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf that will help implement anomaly detectors as you suggest.

Thanks again for your comment.

Reply

Pingback: Using Anomaly Detectors to Assess Covariate Shift | The Official Blog of BigML.com

2017-05-25T13:45:00+00:00

Hello,

I really enjoyed your tutorial, so much so that I’m using your method for some experiments.

I verified at the end of the article that the author indicated that this method is not recent, already being used previously. I looked for some published sources that used this method but I did not find it.

Is it possible to indicate any source that uses the method?

Reply

2017-05-25T17:47:46+00:00

Hi Daniel, thanks for your comment. There’re books like this https://mitpress.mit.edu/books/dataset-shift-machine-learning that deal with this problem from a more theoretical point of view. Never found a formal reference to the specific method and as I mentioned at the time of writing it, the first thing that came to my memory was a conversation with Tom Dietterich. We published another post explaining how do it the check using WhizzML https://blog.bigml.com/2016/06/06/whizzml-tutorial-ii-covariate-shift/ and another one using anomaly detectors https://blog.bigml.com/2016/06/21/using-anomaly-detectors-to-assess-covariate-shift/.

Reply

Pingback: Pytolemaic — A Toolbox for Model Quality Analysis – Data Science Austria

	#!/bin/bash

	# Simple test to detect covariate shift.
	# Requires:
	# BIGML_AUTH set up with your BIGML_USERNAME and API_KEY
	# curl: http://curl.haxx.se/
	# jq: http://stedolan.github.io/jq/

	BIGML_DOMAIN=bigml.io # Set it up to your own BigML VPC domain

	FINISHED=5
	ERROR=-1
	WAIT=5
	TOTAL_WAIT=600

	SEED=SEED
	TRAINING_SAMPLE_RATE=0.8
	PRODUCTION_SAMPLE_RATE=0.8
	EVALUATION_SAMPLE_RATE=0.8
	OBJECTIVE_FIELD_NAME=Origin

	TRAINING_DATA=https://gist.github.com/aficionado/7743748/raw/3f1f1d5bd09c296e099344a80539103e4fa90756/titanic_train.csv
	PRODUCTION_DATA=https://gist.github.com/aficionado/7743752/raw/db2f5b38bc290b4defc68fe0865f23c16e1e6b7f/titanic_test.csv
	NAME=Titanic
	MAX_TRIALS=10
	TRAINING_FILTER=true
	PRODUCTION_FILTER=true
	EXCLUDED_FIELDS=[]

	# Uncomment to induce a covariate shift
	TRAINING_FILTER='(= (f Sex) male)'
	PRODUCTION_FILTER='(!= (f Sex) male)'

	# Uncomment to exclude discriminative fields
	#EXCLUDED_FIELDS=[\"000001\"]

	function wait_resource {
	# Waits for resources to finish as their creation is asynchronous
	ID=$1
	COUNTER=0
	STATUS=$(curl -s "https://$BIGML_DOMAIN/$ID?$BIGML_AUTH" \
	\| jq ".status.code")
	while [ "$STATUS" -ne "$FINISHED" ] && [ "$STATUS" -ne "$ERROR" ] &&
	[ "$COUNTER" -lt "$TOTAL_WAIT" ]; do
	sleep $WAIT
	let COUNTER++
	STATUS=$(curl -s https://$BIGML_DOMAIN/$ID?$BIGML_AUTH \
	\| jq ".status.code")
	done

	if [ "$STATUS" -eq "$ERROR" ]; then
	echo "Detected a failure waiting for $ID"
	exit 1
	fi
	}

	# Create training and production sources
	TRAINING_SOURCE=$(curl -s "https://$BIGML_DOMAIN/source?$BIGML_AUTH" \
	-X POST -H "content-type: application/json" \
	-d '{"remote": "'"$TRAINING_DATA"'", "name": "'"$NAME"' Training"}' \
	\| jq -r ".resource")

	PRODUCTION_SOURCE=$(curl -s "https://$BIGML_DOMAIN/source?$BIGML_AUTH" \
	-X POST -H "content-type: application/json" \
	-d '{"remote": "'"$PRODUCTION_DATA"'", "name": "'"$NAME"' Production"}' \
	\| jq -r ".resource")

	# Create training and production datasets
	wait_resource $TRAINING_SOURCE
	TRAINING_DATASET=$(curl -s "https://$BIGML_DOMAIN/dataset?$BIGML_AUTH" \
	-X POST -H "content-type: application/json" \
	-d '{"source": "'"$TRAINING_SOURCE"'"}' \
	\| jq -r ".resource")

	wait_resource $PRODUCTION_SOURCE
	PRODUCTION_DATASET=$(curl -s "https://$BIGML_DOMAIN/dataset?$BIGML_AUTH" \
	-X POST -H "content-type: application/json" \
	-d '{"source": "'"$PRODUCTION_SOURCE"'"}' \
	\| jq -r ".resource")

	wait_resource $TRAINING_DATASET
	wait_resource $PRODUCTION_DATASET

	TRIALS=0
	AVG_PHI=0
	while [ "$TRIALS" -lt "$MAX_TRIALS" ]; do
	let TRIALS++

	# Filter training and production datasets and label them with a new field
	LABELED_TRANING_DATASET=$(curl -s \
	"https://$BIGML_DOMAIN/dataset?$BIGML_AUTH" \
	-X POST -H "content-type: application/json" \
	-d '{"origin_dataset": "'"$TRAINING_DATASET"'",
	"lisp_filter": "'"$TRAINING_FILTER"'",
	"new_fields": [{"field": "Training",
	"name": "'"$OBJECTIVE_FIELD_NAME"'"}],
	"sample_rate": '"$TRAINING_SAMPLE_RATE"'}' \
	\| jq -r ".resource")

	LABELED_PRODUCTION_DATASET=$(curl -s \
	"https://$BIGML_DOMAIN/dataset?$BIGML_AUTH" \
	-X POST -H "content-type: application/json" \
	-d '{"origin_dataset": "'"$PRODUCTION_DATASET"'",
	"lisp_filter": "'"$PRODUCTION_FILTER"'",
	"new_fields": [{"field": "Production",
	"name": "'"$OBJECTIVE_FIELD_NAME"'"}],
	"sample_rate": '"$PRODUCTION_SAMPLE_RATE"'}' \
	\| jq -r ".resource")

	wait_resource $LABELED_TRANING_DATASET
	wait_resource $LABELED_PRODUCTION_DATASET

	# Compute sample rate sizes to make sure that the input dataset for the
	# model is balanced
	TRAINING_INSTANCES=$(curl -s \
	"https://$BIGML_DOMAIN/$LABELED_TRANING_DATASET?$BIGML_AUTH" \
	\| jq -r ".rows")
	PRODUCTION_INSTANCES=$(curl -s \
	"https://$BIGML_DOMAIN/$LABELED_PRODUCTION_DATASET?$BIGML_AUTH" \
	\| jq -r ".rows")

	if [ $TRAINING_INSTANCES -gt $PRODUCTION_INSTANCES ]; then
	SAMPLE_RATE=$(echo "$PRODUCTION_INSTANCES/$TRAINING_INSTANCES" \| bc -l)
	TRAINING_SAMPLE_RATE=$(printf '%.4f\n' $SAMPLE_RATE)
	PRODUCTION_SAMPLE_RATE=1
	else
	SAMPLE_RATE=$(echo "$TRAINING_INSTANCES/$PRODUCTION_INSTANCES" \| bc -l)
	TRAINING_SAMPLE_RATE=1
	PRODUCTION_SAMPLE_RATE=$(printf '%.4f\n' $SAMPLE_RATE)
	fi

	# The target of the new model will be the label (Training / Production)
	OBJECTIVE_FIELD=$(curl -s \
	"https://$BIGML_DOMAIN/$LABELED_TRANING_DATASET?$BIGML_AUTH;prefix=$OBJECTIVE_FIELD_NAME" \
	\| jq -r ".fields \| keys[0]")

	# Create a model using just a sample of the data
	MODEL=$(curl -s "https://$BIGML_DOMAIN/model?$BIGML_AUTH" \
	-X POST -H "content-type: application/json" \
	-d '{"datasets": ["'"$LABELED_TRANING_DATASET"'",
	"'"$LABELED_PRODUCTION_DATASET"'"],
	"sample_rates": {"'"$LABELED_TRANING_DATASET"'":
	'"$TRAINING_SAMPLE_RATE"',
	"'"$LABELED_PRODUCTION_DATASET"'":
	'"$PRODUCTION_SAMPLE_RATE"'},
	"objective_field": "'"$OBJECTIVE_FIELD"'",
	"sample_rate": '"$EVALUATION_SAMPLE_RATE"',
	"seed": "'"$SEED"'",
	"name": "'"$NAME"' - Covariate Shift?",
	"excluded_fields": '"$EXCLUDED_FIELDS"'}' \
	\| jq -r ".resource")

	wait_resource $MODEL

	# Create an evaluation using the other part of the data (out_of_bag=true)
	EVALUATION=$(curl -s "https://$BIGML_DOMAIN/evaluation?$BIGML_AUTH" \
	-X POST -H "content-type: application/json" \
	-d '{"datasets": ["'"$LABELED_TRANING_DATASET"'",
	"'"$LABELED_PRODUCTION_DATASET"'"],
	"sample_rates": {"'"$LABELED_TRANING_DATASET"'":
	'"$TRAINING_SAMPLE_RATE"',
	"'"$LABELED_PRODUCTION_DATASET"'":
	'"$PRODUCTION_SAMPLE_RATE"'},
	"sample_rate": '"$EVALUATION_SAMPLE_RATE"',
	"seed": "'"$SEED"'",
	"out_of_bag": true,
	"model": "'"$MODEL"'",
	"name": "'"$NAME"' - Covariate Shift?"}' \
	\| jq -r ".resource")

	wait_resource $EVALUATION

	PHI=$(curl -s "https://$BIGML_DOMAIN/$EVALUATION?$BIGML_AUTH" \
	\| jq -r ".result.model.average_phi")
	AVG_PHI=$(echo "$AVG_PHI + $PHI" \| bc -l)
	printf '%.4f\n' $PHI
	done

	AVG_PHI=$(echo "$AVG_PHI / $TRIALS" \| bc -l)
	printf 'AVG_PHI: %.4f\n' $AVG_PHI

A Simple Machine Learning Method to Detect Covariate Shift

The Basic Idea

Testing for Covariate Shift in a few API calls

Recap

6 comments

Leave a comment Cancel reply

The Basic Idea

Testing for Covariate Shift in a few API calls

Recap

Share this:

Relacionado

Leave a comment Cancel reply