WhizzML Tutorial I: Automated Dataset Transformation

I hope you’re as excited as I am to start using WhizzML to automate BigML workflows! (If you don’t know what WhizzML is yet, I suggest you check out this article first. In this post, we’ll write a simple WhizzML script that automates a dataset transformation process.

As those of you who have dealt with datasets in a production environment know, sometimes there are fields which are missing a lot of data. So much so, that we want to ignore the field altogether. Luckily, BigML will automatically detect useless fields like this and ignore them automatically if we create a predictive model. But what if we want to specify the required “completeness” of the data field? Like if we only want to include fields that have values for more than 95% of the rows.

We can use WhizzML!

Let’s do it! Look to the WhizzML Reference Guide if you need it along the way. Also, the source code can be found in this GitHub Gist.

We want to write a function that: given a dataset and a specified threshold (e.g., 0.95), returns a new dataset with only the fields that are more than 95% populated. Our top-level function is defined below.

filtered-dataset

Hey, slow down!

Ok. Let’s take it step-by-step. We define a new function called filtered-dataset that takes two arguments: our starting dataset, dataset-id and a threshold (e.g., 0.95).

Screen Shot 2016-05-26 at 5.12.15 PM.png

What do we want this function to do? We want it to return a new dataset, hence:

Screen Shot 2016-05-26 at 5.12.23 PM.png

But we don’t just want any old dataset, we want one based off our old dataset:

Screen Shot 2016-05-26 at 5.12.32 PM.png

And we also want to exclude some fields from our old dataset!

Screen Shot 2016-05-26 at 5.12.39 PM.png

Ah, but which fields do we want to exclude? We can let a new function called excluded-fields figure that out for us. But for now, all we need to know is that this new function (excluded-fields) takes two arguments: our original dataset and our specified threshold.

The line above becomes: (indentation removed for clarity)

Screen Shot 2016-05-26 at 5.12.49 PM.png

As we progress, keep in mind that we want this new function (excluded-fields) to return a list of field names (e.g., ["field_1" "field_2" "field_3"]).

Great! We have defined our base function. Now we have to tell our new function, excluded-fields, how to give us the list that we want.

excluded-fields

Wow what?
You can use that code for reference, but don’t be intimidated. We’ll go over each piece. First we define the function, declaring its two arguments: our original dataset, and the threshold we want to use.

Screen Shot 2016-05-26 at 5.12.59 PM.png

Before we write any more code, let’s talk about the meat of this function. We want to look at all the fields (columns) of this dataset, and find the ones that are missing too much data. We’ll keep the names of these “bad” fields so that we can exclude them from our new dataset. To do this, we can use the function filter. It takes two arguments: a list and a predicate (a predicate is like a test) and will return a new list based on the predicate. In our case, the predicate is that the field has less than 95% of the rows populated.

Screen Shot 2016-05-26 at 5.13.31 PM.png

The predicate should be a function that either evaluates to true or false based on each element of the list we pass to it. If the predicate returns true, then that element of the list is kept. Otherwise, it is thrown out.

We can define the predicate function using lambda.lambda is like any other function definition. We have to tell it the name of the thing we are passing into it

Screen Shot 2016-05-26 at 5.13.43 PM.png

and also tell it what we are going to do with that thing.

Screen Shot 2016-05-26 at 5.13.50 PM.png

In our case, we are checking to see if the threshold is greater than the amount of data present. We will keep the field-name(s) that do not have enough data. (Because remember, these are the fields that will be excluded from our new dataset.) Two things still missing from our filter.

all-field-names
<percent-of-data-that-the-field-has>

How do we get these? The first isn’t too difficult because BigML Datasets have this information readily available. We just have to “fetch” it from BigML first.

Screen Shot 2016-05-26 at 5.14.10 PM.png

and then specify which value we want to get.

Screen Shot 2016-05-26 at 5.14.17 PM.png

Nice. To figure out what percent of the rows are populated for a specific field, we get to… Define a new function! But before we do that, let’s talk about some things we skipped over in our excluded-fields function. Here it is again, for convenience.

What is let?

let is the method for declaring local variables in WhizzML.

We set the value of data to the result of (fetch dataset-id).
We set the value of all-field-names to the result of (get data "input_fields")
We set the value of total-rows to the result of (get data "rows"). (We didn’t talk about this yet. It’s one of the values we need to pass to the present-percent function)

let is useful for a couple of reasons in this function. First, we use data twice. So we can avoid the repetition of writing (fetch dataset-id) twice. Second, naming these variables at the top of the function makes the rest much easier to read and comprehend!

So to wrap up this excluded-fields function, lets talk through what it does again.
First, it declares local variables that we’ll need. Then, we take the list of all-field-names and filter it based on a function that checks its “present percent” of data points. We keep the names of the fields that do not pass our predicate. Cool! Now we’ll go over that present-percent function.

present-percent

Ah. Not so bad. To calculate the percentage of data points that are present in a given field, we need a few things:

The big collection of data from our dataset (data).
The name of the field we are inspecting (field-name).
The total number of rows in our dataset (total-rows).

We’ll set another local variable using let and call it fields. This is another object containing data about each of the fields. We’ll be using it below.

Screen Shot 2016-05-26 at 5.14.30 PM.png

Then, we divide the missing-count from the field by the total-rows. This gives us a “missing percent”.

Screen Shot 2016-05-26 at 5.14.39 PM.png

We subtract the “missing percent” from 1 and that gives us the “present percent”!

Screen Shot 2016-05-26 at 5.14.45 PM.png

But “missing-count” is another function!

Yes it is!

missing-count

missing-count takes two arguments. First, the name of the field we are inspecting (field-name) Second, the fields object we mentioned earlier. It holds a bunch of information about each of the Dataset fields. To get the count of missing rows of data in the field, we do this:

Screen Shot 2016-05-26 at 5.14.54 PM.png

It lets us access an inner value (e.g., 10) from a data object structured like so:

Screen Shot 2016-05-26 at 5.27.40 PM.png

And… That’s it! We have now written all the pieces to make our filtered-dataset function work! All together, the code should look like this:

And we can run it like this:

Screen Shot 2016-05-26 at 5.15.07 PM

And get a result like this "dataset/574317c346522fcd53000102"– a new dataset without those empty fields. I can add this script to my BigML Source dashboard and use it with one click. Or I can put it in a library, and incorporate it into a more advanced workflow. Awesome!

Stay tuned for more blog posts like this that will help you get started automating your own Machine Learning workflows and algorithms.

filtered-dataset

excluded-fields

present-percent

missing-count

Share this:

Relacionado

Leave a comment Cancel reply