Retraining Machine Learning Models

Posted by

Nowadays, our models deal with streams of information. They need to be updated when they are fed new data that differs significantly from the data used to train them. This is why Machine Learning tools need to provide not only accurate, but scalable, repeatable and ready-to-go solutions that can be swiftly integrated into any information system. This means that a Machine Learning tool shouldn’t necessarily look for the perfect immutable model, but rather for flexible solutions that can be retrained easily and brought to production in the shortest time as circumstances shift.

Today we’ll present a new tool that will help you better retrain your existing models as new data arrives. A new bigmler retrain sub-command has been added to BigMLer, the command-line to automate workflows in BigML.

Why and how to retrain?

The typical scenario to build a Machine Learning model involves some trial and error. This can be done manually or automatically, using scripts like best feature selection or SMACDown, but some tasks are meant to be repeated, for example: uploading your data, identifying the fields important to your model, evaluating the predictions or checking the model performance. Because your business or problem domain may change, your models will need to be retrained using new data eventually. But how do we do that?

Well, you need to realize that when you train a model in the first place, your data goes through some tasks: setting a type to each field, transforming the field values, adding new fields or filtering rows. The model rarely can work on the raw data, so to add more data to an existing model, all these transformations need to be re-run first.

In BigML, once a model (or any other resource) is built, it is immutable. None of its essential properties can be changed (only non-essential attributes, like their name or description, can be updated). The model keeps the information about what resources were used to build it or which configuration choices were made to get it. That ensures that all the steps that led to your actual model can be deduced i.e, the model-building workflow. By simply applying this workflow to new data you can get a retrained model.

Using BigMLer to retrain your models

Let’s assume you train a model from a sample of data and want to rebuild it with more data later.  You can use a BigMLer command such as:

bigmler --train data/diabetes_sample.csv \
--tag best_diabetes
--output-dir ./initial_model

where you provide the file diabetes_sample.csv that contains the first sample of data and add a unique tag to the generated resources. Tags in BigML act like keywords that can be used in searches to retrieve your resources. In the command, we added a best_diabetes tag to our model so that we can use it as a reference when we decide to retrain it.

In that case, the call to retrain our model using BigMLer would be:

bigmler retrain --add diabetes_new_data.csv \
--model-tag best_diabetes \
--output-dir accumulative_retrain
view raw hosted with ❤ by GitHub

The process that BigMLer will execute for you consists of:

  • Uploading the data in diabetes_new_data.csv to the platform.
  • Retrieving the workflow that was used to build the last model with the tag best_diabetes.
  • Adding the uploaded data to the dataset used in that model.
  • Generating a new model from the merged data by using the same configuration arguments: the retrained model.

Every time you run a bigmler retrain command on a certain model, a new model resource is generated. BigMLer uses the directory set in --output-dir to store files that contain the information about the resources generated in this process. It also prints a report in the console showing the evolution of the process. The URL that will retrieve the latest retrained model can be seen at the bottom of this report. You can use this URL in any application that needs to use the newest model available to make predictions.

Following the example, you would use the$BIGML_AUTH;limit=1;full=yes;tags=best_diabetes

query to retrieve the latest retrained model. The $BIGML_AUTH environment variable stands for the credentials information (username=[my_username];api_key=[api_key]).
The output of this URL is a JSON with an objects attribute that contains a list with a single element: the model information. To use the latest retrained model in your information systems, you could pass this information to a local model object like the ones we offer in any of our bindings. The local model provides a function that generates the model’s predictions.

Retraining by windowing

Other scenarios involve retraining your models periodically with a subset of the most recent data. An example could be a sales model that will be updated monthly with a limit in the number of months of data to be considered. In this case, your data will need to be uploaded incrementally, but it shouldn’t be aggregated. Instead, we would need to keep monthly data in a separate dataset. We can easily achieve this by setting the number of datasets to be used in the model construction in the --window-size option:

bigmler retrain --add diabetes_12.csv \
--window-size 3 \
--ensemble-tag best_diabetes \
--output-dir windowed_retrain

This will now upload your data, build a new dataset with it and use a list of the last datasets specified in --window-size to rebuild an ensemble.

Referring by ID

Of course, the use of tags is just a human-friendly mechanism. It’s not strictly needed to add a tag to retrain a model. Since each resource in BigML has its own unique resource ID, you can use also that as a reference in the bigmler retrain command:

bigmler retrain --add diabetes_new_data.csv \
--id cluster/5a186f1d92527304c200077b \
--output-dir accumulative_cluster

Consequently, the workflow that built the resource with the given ID (a cluster in this case) will be reproduced, so a new cluster will be built on the consolidated data.

As we’ve just explained, thanks to BigMLer, retraining your Machine Learning models is as simple as running a single command. This process can be triggered by any local scheduler available in your machines. To boot, you’ll be able to use the latest retrained model thanks to a permanent URL. A real simple setup for a production ready Machine Learning system. Your turn: just give it a try!

One comment

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s