Get your training and test sets in just one click

Posted by

Testing or evaluating a predictive model involves using the model to generate predictions for a test set  and then computing a number of performance measures or metrics like accuracy, precision, recall, etc. These metrics estimate how well the model will perform when making predictions for instances that haven’t been used to train the model. You can read more about them in our post on how to  perform automatic evaluations in BigML. But where to find a test set that is representative of the instances that the predictive model will face in production?

A traditional approach is to split the available data into two disjoint sets, one for training and one for tests. If your programming skills are up to the task you could use, say, WekaScikitR, or SQLS to perform the split. But, as we explain in this post, BigML offers you a much easier way to automatically perform this task with just one click or, if you use our API, a couple of HTTP requests.

1-click Train | Test dataset splits

On the surface, splitting the instances of a dataset into training and test sets seems like an easy task, for example just take the first 80% as the training set and the rest as the test set. But what if the dataset is ordered according to non-random criteria? Let’s imagine that the instances are ordered by the field representing the class that you want to predict.  Then, almost certainly, the test set will not be a fair representation of the instances in the training set, or the new ones in production for that matter. For example, say that you have an equally balanced dataset with 5 classes, ordered by class.  Then, taking the first 80% of the dataset instances as your training set, your test set will contain instances of just one class and the training set only instances of the other four classes. So you’d be testing a model with instances of a class that it never saw during training and therefore the performance measures won’t be representative of the real performance of the model. This shows that random sampling is important when splitting your data into training and test sets.

When it comes to random sampling, you could use the Clojure library that we open-sourced a few months ago, but if you don’t know Clojure there’s no need to worry: in BigML’s interface splitting a dataset into a training and test set is now one click away.

1-click-tt2

Now, in your dashboard, from the dataset listings or from an individual dataset view you have a new menu option to create a training and test set in only one click.  By default, the split for training and test set is 80/20. That is, 80% of the instances in the dataset will be used to create a new dataset suffixed with “training” and the other 20% of the instances will be used to create a new dataset suffixed with “test”.

To get the same training/test set split from BigML’s API two requests are needed.  Both requests need to share exactly the same following arguments:

  • origin_dataset:  this is a newly created argument that specifies the dataset to split.
  • seed: (e.g, “my seed”)  the seed to initialize the random number generator used to shuffle the rows.  It can be any string.  The same seed always gives raise to the same shuffling.
  • sample_rate: a number between 0 and 1 that represents the fraction of instances. (e.g., 0.7 for a 70/30 split).

The first request is to create the training set.  We have updated the dataset creation REST method to accept a new argument named origin_dataset that specifies the dataset to split. You also need to specify the sample_rate that to use (e.g., 0.8) and a seed (e.g, “my seed”).  The seed is essential to guarantee that in the second call the random generator is initialized to exactly the same state, ensuring that the training and test sets are complementary.

curl "https://bigml.io/dataset?$BIGML_AUTH" \
     -X POST \
     -H "content-type: application/json" \
     -d '{"origin_dataset": "dataset/518d0568925ded7ea40004a2",
          "sample_rate": 0.8,
          "seed": "my seed",
          "name": "Training set"}'

The second request is to generate the test set. Notice that the sample_rate needs to be the same as in the training set creation (i.e., 0.8 and not  0.2 ). The key difference is the flag out_of_bag that needs to be set to true. This will create a new dataset with all the remaining (out-of-bag) instances that weren’t used in the training set creation.

curl "https://bigml.io/dataset?$BIGML_AUTH" \
     -X POST \
     -H "content-type: application/json"  \
     -d '{"origin_dataset": "dataset/518d0568925ded7ea40004a2",
          "sample_rate": 0.8,
          "seed": "my seed",
          "out_of_bag": true,
          "name": "Test set"}'

Once you have your training and test sets ready, you can use the training set to create a model or ensemble and the test set to evaluate it. If you are satisfied with the results, then remember to use the original dataset (the one with all your data) to create the model or ensemble that you’ll release to production.  In our next post, we’ll show you how to split a dataset from BigML’s command line.

7 comments

  1. With the latest release, users can configure the sample rate used in each split using the web interface

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s