Scriptify: 1-Click Reification of Complex Machine Learning Workflows
Real world Machine Learning is not just about the application of an algorithm to a dataset but a workflow, which involves a sequence of steps such as adding new features, sampling, removing anomalies, applying a few algorithms in cascade, and stacking a few others. The exact steps are often arrived at during iterative experiments performed by the practitioner. In other words, when it comes to the real life Machine Learning process, not everything is as automatic as various business media may make you believe.
Usually, one starts by playing around a bit with the data to assess its quality and to get more familiar with it. Then, a significant amount of time is spent in feature engineering datasets, configuring models, evaluating them, and iterating or combining resources to improve results. Finally, when the right workflow is found, traceability and replicability become must have concerns to bring the workflow to a production environment. Without those attributes, one can’t ensure that errors are eliminated, or workflows can be rerun and otherwise improved by everyone in a workgroup.
You are probably asking, “That all sounds great, but how does one achieve this without creating even more complexity?”. That’s precisely why today BigML has launched a new game-changing feature, which can create a WhizzML script for any Machine Learning workflow in a single click: Scriptify.
Auto-scripting workflows to generate your resources
With this update, all BigML resources (datasets, models, evaluations, etc.) have a new menu option named Scriptify that allows you to automatically generate a script that can regenerate that resource (or generate a similar one if it is used with different inputs with the same field structure)! BigML resources have always been downloadable as JSON objects, where the user-specified properties at the time of the creation or update of the resource are stored as attributes. This white-box approach is crucial to ensure that everything you do with BigML is traceable, reusable and replicable. So we can use our recently launched scripting language, WhizzML to inspect any existing resource in BigML and easily extract the entire workflow needed to generate it.
Let’s start by explaining an example use case. Say you created an evaluation some time ago, but you don’t remember which model was evaluated, whether it was a balanced model or not, or which test dataset was used. No problem! This information is already stored in each of the resources. You don’t need to track it down manually or document it in a separate repo. Clicking the Scriptify your evaluation link in the actions menu of your evaluation view screen will unravel it, and swiftly generate a script, which can reproduce the original evaluation.
This new script will now be available as another resource. Like any other WhizzML script, you will be able to execute it to recreate your evaluation on demand.
To create a resource in BigML, you usually provide the starting resource (e.g., if you want to build a model, you’ll need to decide from which dataset) and some configuration. The corresponding Scriptify action retrieves this information, and recursively does the same for all resources used throughout your entire workflow.
Following the example in the previous section, first the evaluation is analyzed to find out the IDs of both the model that was evaluated and the dataset used as test dataset since these are the origin resources for the evaluation. Then, each of them is analyzed recursively to find out the origin resources that were used to build them. The model was built from a dataset and the test dataset from a source. Finally, it turns out that the dataset used to build the model was built from the same source as the test dataset by using a 80%-20% split. In general, any Scriptify call will bubble up through the hierarchy of parent objects until it finds the source object for every resource involved in the analysis. As the Scriptify process needs to explore every object in the hierarchy, it will be stopped if any intermediate resource has been deleted. For each of the resources, the script extracts the attributes used in the create and/or update calls and generates a new script which contains the WhizzML code able to replicate them.
As you can see in the code example all the resources derive from the data in a file named s3://bigml-public/csv/diabetes.csv, which was initially uploaded to build a source object. This URL is kept as input for the script, but you can change it if need be. In a production environment, you periodically need to repeat the same process on new data. Using this script, you would only need to provide the new URL as input for the script to rebuild the evaluation on new data by using the same procedure and its configurations.
Scriptify as the building block for automation
Another interesting property of scripts in BigML is that you modify them to create new scripts. The link create a new script using this one opens an editor screen, where you can modify the code in the original script. Following the example, if you find out that your model was not balanced and you want to try the evaluation on a balanced model, you can do so by adding the
balance_objective flag to the model creation call attributes.
Clicking the validate button checks your changes and points out any errors. You can also change the outputs of the script. This view searches all the possible outputs in the script and offers you a list to select from. In this case, we set the model ID to be returned as output in addition to the original evaluation ID.
As you can see, you don’t really need to know WhizzML to start scripting your workflows. Rather, you can create any workflow on BigML’s Dashboard and let Scriptify magically transform it into WhizzML code! You can even make it a 1-click automation option on your menu. Finally, this code can also be modified and composed to create, share, and execute new workflows in a scalable, parallelizable and reproducible way. Now it’s your turn to transform your custom Machine Learning resources into blazingly fast, traceable and reproducible workflows ready for production environments!