Automating Machine Learning Workflows

Posted by

whizzml_logo

Machine Learning (ML) services are quickly becoming a taken-for-granted part of the software developer’s toolbox, in any domain. These days, databases or networking are a standard component of almost any non-trivial application, so easily integrated that almost no special expertise is required. We expect to see Machine Learning becoming, in the very near future, a similar layer in the software stack.

This commoditization of ML services has been driven so far by Service-oriented platforms such as BigML, which have provided a key ingredient of the process: abstraction. Simple and easy to use REST APIs hide away not only the details of the sophisticated algorithms underlying the services at hand, but also the complexities of scaling those computations both over CPU cycles and input data volumes. A programmer no longer needs to be an expert in distributed systems or the fine details of implementing ML algorithmics in order to incorporate intelligent behaviour in her applications.

Or does she? The answer is, well, sometimes. Machine Learning platforms excel at hiding away the complexities of distributed computation, and are already providing powerful building blocks in the algorithmic department: decision trees, ensembles, random forests, anomaly detectors, automated feature engineering and many other pieces are already there. And, sometimes, they are enough to cover the needs of the end user. However, it is also often the case that the machine learning solution for non-trivial problems requires combining many of those primitives and algorithms into a complex workflow.

A Fresh Approach to Automation

Machine learning workflows are most of the time iterative, and typically involve the creation of many intermediate datasets, models, evaluations and predicitions, all of them dynamically interwoven. There are many examples of such workflows, and we have covered in the past several of them in this very blog, from basic ones that automate repetitive tasks (e.g., create a dataset, filter it removing anomalies, then cluster it to label instances and finally model the result and make predictions) to sophisticated algorithms that enhance our machine learning arsenal (e.g., feature selection or hyperparameter optimization techinques).

So, quite often, the predictive components of end-user applications are workflows that must be built on top of our machine learning platforms using a general purpose language such as Python or Java and its bindings to the corresponding platform’s simple and nice API. That is, we write client-side programs that combine server-side API calls to solve our problem.

While effective, this way of automation destroys a good deal of the simplicity and abstraction that those APIs had worked very had to provide. In particular, automation based on client-side programming against remote APIs presents, among others, the following drawbacks:

  • Complexity Coding the workflow in a standard programming language implies worrying over details outside of the problem domain and not directly related to the machine learning task.
  • Reuse REST APIs such as BigML’s provide a uniform, language-agnostic interface for solving machine learning problems; however, the moment we codify a workflow using a specific programming language we lose that interoperability. For instance, workflows written in Python, won’t be reusable by the Node.js or Java communities.
  • Scalability Complex, client-side workflows are potentially very hard to optimize, and they lose the automatic parallelism and scalability offered by sophisticated server-side services. Furthermore, there are performance penalties inherent to the client-server paradigm: asynchronously pipelining a remote workflow introduces an extra number of intermediate delays just to check the status of a created resource before moving forward in the workflow.
  • Client-server brittleness Client-side programs need to take care of non-trivial error conditions derived from network failures. When any step in a complex workflow can potentially fail, handling errors and resuming from interruptions without losing valuable work already performed becomes a big challenge.

In short, the underlying REST API is already providing an abstraction layer over computing resources and the nitty-gritty details of many algorithms, but client-side workflows are re-introducing inessential details in our workflow specifications.

We need a way to recover for workflows the nice properties the ML platforms are already giving us for each step the workflow is made of.

In the sofware engineering field, we have long known a very effective method to cure the kind of problems outlined above. To increase the abstraction level of our workflow descriptions and free them from inessential detail, we must express our solution in a new language that is closer to the domain at hand, i.e., machine learning and feature engineering. In other words, we need a domain-specific language, or DSL, for machine learning workflows.

If you are a BigML user, chances are that you are already familiar with another DSL that the platform offers to perform feature synthesis and dataset transformations, namely, Flatline. With that domain-specific language, we offer a solution to the problem of specifying, via mostly declarative symbolic expressions, new fields synthesized from existing ones in a dataset that is traversed using a sliding window. The DSL lets you also name things and use those names inside your expressions, an archetypal means of abstraction.

In Flatline, we talk only of the domain we are in, namely features (old and new), as well as their combination and composition. Also note how the Flatline language is based on expressions for the new features we are creating, rather than a procedural sequence of instructions describing how to synthesize those features.

Now, we want to be able to formulate machine learning workflows with a similarly expressive DSL. And that is precisely what WhizzML provides: a language specifically tailored to the domain of machine learning workflows, containing primitives for all the tasks available via BigML’s API (such as model creation or evaluation or dataset tranformation) and providing powerful means of composition, iteration, recursion and abstraction.

An Example WhizzML Workflow

To give you a quick flavor of what WhizzML syntax looks like, here’s a very simple workflow. It generates a batchcentroid dataset from a source, encapsulating all the steps needed. Namely, creating a dataset, creating a cluster, creating a batchcentroid and extracting the resulting dataset from it:

(define (one-click-batch-centroid src-id)
  (let (ds-id (create-and-wait "dataset" {"source" src-id})
        cl-id (create-and-wait "cluster" {"dataset" ds-id
                                          "default_numeric_value" "median"})
        bc-id (create-and-wait "batchcentroid" {"cluster" c-id
                                                "dataset" ds-id
                                                "output_dataset" true
                                                "all_fields" true}))
     (get (fetch bc-id) "output_dataset_resource")))

(define dataset-id (one-click-batch-centroid source-id))

Here, we can already recognize some of the qualities we have been discussing: the keywords define and let, which introduce new names for different artifacts: the identifiers of the dataset, cluster and batchcentroid that are created (ds-id, cl-id and bc-id), of a function, one-click-batch-centroid that encapsulates the workflow, and of the identifier of the resulting dataset, dataset-id, which will be the workflow’s output. We also see how primitives for the creation of BigML resources are readily available (create-and-wait-dataset, etc.), and take care of their time evolution (in this case, by waiting for their successful completion). All the steps of the workflow are thus trivially rendered.

Admittedly, careful use of high-level language bindings such as Python’s can attain similar levels of readability for workflows as simple as this example. However, as you’ll see in other examples, WhizzML’s readability scales better when the complexity of the workflow increases.

Moreover, there are other immediate advantages of using the WhizzML script for expressing your workflow, even in this simple case. For one, it is automatically cross-language and available for any development platform. Even more importantly, it is executable as a single unit fully on the server side, inside BigML’s system: there, we not only avoid the inefficiency of all the network calls needed by, say, a Python script, but also automatically parallelize the job. In that way, we are recovering one of the big advantages of Machine Learning-as-a-service: scalability of the available machine learning tasks, which are mostly lost for composite workflows when using the traditional, bindings-based solution.

The Power of Reusability

Another very important advantage of using WhizzML instead of client-side scripts written in your favorite programming language is that code written in WhizzML becomes reusable server side resources. Specifically, you have three kinds of resources at your disposal:

whizzml_actions

  • Library A collection of WhizzML definitions that can be imported by other libraries or scripts. Libraries therefore do not describe directly executable workflows, but rather, new building blocks (in the form of new WhizzML procedures, such as one-click-batch-centroid above) to be (re)used in other workflows.
  • Script A WhizzML script (as the ones we have seen in the previous section) has executable code and an optional list of parameters, settable at run time, that describe an actual workflow. Scripts can use any of the WhizzML primitives and standard library procedures. They can also import other libraries and use their exported definitions. When creating a script resource, users can provide a list of outputs in its metadata, in the form of a list of variable names defined in the script’s source code.
  • Execution When a user creates a script, its syntax is checked and it gets pre-compiled and readied for execution. Each script execution is described as a new BigML resource, with a dynamic status (from queued to started to running to finished) and a series of actual outputs. When creating an execution, besides the identifier of the script to be run, users will typically provide the actual values for the script’s free parameters (just as you provide its arguments’ values, when you call a function in any programming language). It is also possible to pipe together several scripts in a single execution.

Workflow scripts written in WhizzML are reusable server side resources, which makes WhizzML’s procedural abstraction capabilities especially powerful. New procedures can be made available, as part of a library (just another kind of REST resource), to other scripts. We have thus gained the ability to define new machine learning primitives, climbing even higher in our abstraction layer, without losing the benefits of performance or scalability associated with a Service-oriented platform such as BigML.

To learn more about WhizzML jump right away to its home page and read more posts of this series!

Leave a comment