Programmatically Fill in Missing Values in Your Dataset with WhizzML

For new WhizzML developers, WhizzML’s power as a full-blown functional programming language can sometimes obscure the relationship between WhizzML and the BigML Machine Learning platform. At BigML, we refer to WhizzML as a functional programing language for orchestrating workflows on the BigML platform. In this post we describe an example script in the WhizzML script gallery for filling in missing data values in a BigML Dataset object to help elucidate how WhizzML and the BigML machine learning platform interact.

The BigML developer documentation provides one view into the Machine Learning functions the BigML platform makes available to users. This functionality can be accessed through multiple programming methods including the BigML REST API, the downloadable BigMLer command line tool, BigML bindings for all popular programming languages, and now as WhizzML functions. There is an important difference between the first three programming methods (REST API, BigMLer, and bindings) and WhizzML: Solutions using the first three methods run on other user platforms, increasing the volume of data and metadata transfers between user platforms and the BigML platform. Production WhizzML scripts run on the BigML platform, eliminating data transport costs and leveraging parallelism and performance optimizations for BigML Machine Learning on large datasets.

WhizzML and Flatline

When using WhizzML to orchestrate workflows, you might quickly come up against an additional subtlety: To realize the full potential of WhizzML, WhizzML functions should not themselves process the data in datasets, but only orchestrate execution of BigML Machine Learning functions in the BigML API. However, in your ML application, you might need to process dataset data in unique ways. For example, the API to create a BigML Cluster object from a BigML Dataset object includes an argument “default_numeric_value” that allows us to specify the single type of the numeric value — “mean”, “median”, “minimum”, “maximum”, “zero” — (the first four types computed on a per-column basis) that should be used to fill all missing values in a dataset in all columns considered in the clustering operation.

It is easy to conceive of applications, where you might need more flexibility in filling missing values in a dataset. We don’t want to do this by processing the data in WhizzML itself, because we couldn’t leverage all the performance benefits the BigML platform provides for handling datasets. This is where we can turn to Flatline in WhizzML. Flatline is a row-oriented processing language for datasets in the BigML platform itself. The BigML Developer tools includes a Flatline editor for directly applying Flatline operations to datasets, but we can also use Flatline directly in WhizzML.

The WhizzML script Clean Data Fill can be found in the BigML WhizzML Github repository and it is an example of how we can use WhizzML and Flatline to fill in missing values in a dataset by using default values supplied in a map to a WhizzML function. We can’t cover all of the Flatline operations and use cases in here, so in our example we’ll just show how to apply the the Flatline function:

(all-with-defaults <field-designator-0> <field-value-0>
                   <field-designator-1> <field-value-1>
                     ...
                   <field-designator-n> <field-value-n>)

to modify a dataset. We do this by using the built in WhizzML (flatline …) function and a macro-like structure to fill in the field information:

(flatline "(all-with-defaults @{{fargs}})")

where fargs is a WhizzML list that includes field key/value (field names/values) pairs as sequential entries. Our example script does this by using four WhizzML functions, three of which are actually functions to simplify the task of specifying default values for the fourth function that does the real work by using Flatline.

Specifying Default Values for Missing Dataset Values

Arguably the most burdensome task we have to undertake is to build a map of default values to fill in the missing values in a dataset. Three of the four functions in our example WhizzML script (extract-meta …), (extract-meta-func …), and (generate-configmap …) implement our illustrative approach for doing this. Before discussing these functions, note that the metadata for BigML Dataset objects include two properties “input_fields” and “fields” that provide the metadata items we need to build our default value map. The “input_fields” property is a list of field (column) IDs in the dataset, e.g.:

{ ...
 :input_fields
 ["000000"
  "000001"
   ...
  "000007"],
   ...
}

The “fields” property is a dictionary of summary information for each field keyed on the IDs in the “input_fields” property, e.g.:

{ ...
 :fields
 {:000000
  { ...
   :datatype "double",
   :name "Employment Rate",
   :optype "numeric",
   :summary
   { ...
    :mean 58.35941,
    :median 58.00162,
    :minimum 29.96302,
    :maximum 83.55616,
     ... }}
    ... }
  ...
}

The first three functions in our script process the “input_fields” and “fields” properties of the input dataset metadata to generate a template map for specifying default values to the function that fills the missing values in the dataset.

The first function (extract-meta …), is a helper function that accepts the submap {:fields {:00000 {…}} for a single field from the “fields” property as an input parameter:

(define (extract-meta mpi) 
  (let (mpis (get mpi "summary")
        mpos {"mean" (get mpis "mean")
              "median" (get mpis "median")
              "minimum" (get mpis "minimum")
              "maximum" (get mpis "maximum")})
    {"datatype" (get mpi "datatype") 
     "name" (get mpi "name")
     "optype" (get mpi "optype")
     "summary" mpos}))

The function extracts and returns just the contents we need for the corresponding field entry in our default value map:

{:datatype "double",
 :name "Employment Rate",
 :optype "numeric",
 :summary
 {:mean 58.35941,
  :median 58.00162,
  :minimum 29.96302,
  :maximum 83.55616}}

This map provides the minimum information you might find useful about the type and contents of a column.

The next function (extract-meta-func …) is a factory function that returns a lambda function suitable for use in a WhizzML (reduce fn {…} [..]) function.

(define (extract-meta-func ds)
  (let (fields (get ds "fields"))
    (lambda (mp id)
      (let (mpi (get fields id)
            mpo (extract-meta mpi))
        (assoc mp id mpo)))))

This function creates a closure that captures the contents of the “fields” property of the metadata for the dataset whose ID is supplied as the “ds” input parameter. The returned lambda function (lambda (mp id) …) accepts a partial metadata map “mp” and a column “id” (from the “input_fields” property of the dataset metadata map) as input parameters. It returns a new version of the input map augmented with the submap returned by the (extract-meta …) function for the column specified by “id”.

Our third function (generate-configmap …) just repetitively applies the function returned by (extract-meta-func …) to the dataset metadata to build up a template map for supplying default values to our dataset:

(define (generate-configmap dataset-id)
  (let (ds (fetch dataset-id)
        flds (get ds "input_fields")
        metafn (extract-meta-func ds))
    (reduce metafn {} flds)))

The result is a WhizzML list of maps, one per column, of the minimum metadata for the column. For each field, we can then add a property “default” to the submap for the field to specify the value that should be plugged in to the rows of the dataset with missing values in that column:

{:000000
 {:datatype "double",
  :name "Employment Rate",
  :optype "numeric",
  :summary
  {:mean 58.35941,
   :median 58.00162,
   :minimum 29.96302,
   :maximum 83.55616},
  :default 0.0}
    ... }

Filling Missing Dataset Values with Flatline

Once we have a map that explicitly specifies the default values for the columns of our dataset, we can use the fourth function in the example WhizzML script (fill-missing …) to create a new dataset with all missing values in the source dataset specified by “dataset-id” replaced with the default values in the “dflt-mp” map:

(define (fill-missing dataset-id dflt-mp)
  (let (frdce (lambda (lst itm) 
                (let (dkey (get itm "name")
                      dval (get itm "default"))
                  (append (append lst dkey) dval)))
        fargs (reduce frdce [] (values dflt-mp)))
    (log-info fargs)
    (create-and-wait-dataset {"origin_dataset" dataset-id
                              "all_fields" false
                              "new_fields" [{"fields" (flatline "(all-with-defaults @{{fargs}})")}]})))

This function first declares a function frdce that is used in a WhizzML (reduce …) function to extract a WhizzML list fargs of sequential per-column name-value pairs.

The heart of our example (fill-missing …) function is the WhizzML (create-and-wait-dataset …) function that creates a modified copy of the source dataset with our default values inserted. Referring to the BigML API documentation for the Dataset object API arguments for extending a dataset, a false value for “all_fields” argument specifies that the function should not pass any of the input fields of the source dataset directly to the new dataset. The “new_fields” argument specifies new fields that should be added to the new dataset by using Flatline.

Our example function uses a “new_fields” argument form that includes a WhizzML map in the argument [{“fields” (flatline …)}], which specifies values for all of the fields in the new dataset with a single Flatline expression. The (flatline …) function accepts a single string argument that is passed to the BigML backend at execution time. The string argument “(all-with-defaults @{{fargs}})” in turn incorporates a WhizzML macro form where fargs is the WhizzML list of sequential per-column name-value pairs, which were defined earlier. When the (flatline …) function is executed, WhizzML expands the string argument with the value of fargs. The resulting string value for the “new_fields” argument is passed to the BigML backend along with the other arguments by the (create-and-wait-dataset …) function. The BigML platform backend generates the new dataset with the default values inserted by using whatever optimizations it can.

A Final Comment

Our WhizzML script is primarily intended as an example of how you can use WhizzML and Flatline to process datasets on the BigML platform backend. In your applications, you may want to compute default values in other ways or perform other data manipulations. For instance, you may want to compute default values on a row-basis by using Flatline rather than on a column-basis. Most data manipulations can be accomplished by using WhizzML and Flatline, but some computations may be harder to implement than others. We will take up other ways to use WhizzML and Flatline to facilitate Machine Learning tasks in subsequent WhizzML demonstration scripts and blog posts.

Share this:

Relacionado

Leave a comment Cancel reply