October 23, 2018 Data / Flatline / Flatline / Product / Python / transformations / Tutorial / Webinar / WhizzML

Automating Data Transformations with WhizzML and the BigML Python Bindings

Posted by joribez

This is the fifth post of our series of six about BigML’s new release: Data Transformations. This time we are focusing on the Data Preparation step, prior to any Machine Learning project, to create a new and improved dataset from an existing one.

CRISP-DM_diagram-bigml

The data preparation phase is key to achieve good performance for your predictive models. Not only that, there is a wide variety of operations that can be performed since data usually do not come ready or do not have upfront the fields we need to create our Machine Learning models. Being aware of that, in 2014 BigML introduced Flatline, the DSL language specifically designed for data transformations. Over the years, Flatline has grown and increased the number of operations that can be performed. Now, in this release, we improved its sliding window operations and added the ability to use a subset of SQL instructions that add a new range of transformations, such as joins, aggregations, or adding rows to an existing dataset.

In this blog post, we will learn step-by-step how to automate these data transformations programmatically using WhizzML, BigML’s Domain Specific Language for Machine Learning automation, and the official Python Bindings.

Adding Rows: Merge Datasets

When you want to add data to an existing dataset that is already in the platform you will use the following code. This is an example used where data are collected in periods or the same kind of data comes from different sources.

;; creates a dataset merging two existing datasets
(define merged-dataset
  (create-dataset {"origin_datasets"
                   ["dataset/5bca3fb3421aa94735000003"
                    "dataset/5bcbd2b5421aa9560d000000"]})

The equivalent code in Python is:

# merge all the rows of two datasets
api.create_dataset(
    ["dataset/5bca3fb3421aa94735000003",
     "dataset/5bcbd2b5421aa9560d000000"]
)

As we saw in previous posts, the BigML API is mostly asynchronous, which means that the execution will return the ID of the new dataset before its creation is completed. This implies that the analysis of fields and their summary will continue after the code snippet is executed. You can use the directive “create-and-wait-dataset” to be sure that the datasets have finally merged:

;; creates a dataset from two existing datasets and
;; once it's completed its ID is saved in merged-dataset variable
(define merged-dataset
  (create-and-wait-dataset {"origin_datasets"
                            ["dataset/5bca3fb3421aa94735000003",
                             "dataset/5bcbd2b5421aa9560d000000"]})

The equivalent code in Python is:

# merge all the rows of two datasets and store the ID of the
# new dataset in merged_dataset variable
merged_dataset = api.create_dataset(
    ["dataset/5bca3fb3421aa94735000003",
     "dataset/5bcbd2b5421aa9560d000000"]
)
api.ok(merged_dataset)

When you merge datasets, you can update several parameters that you can check in the Multidasets sections of the API documentation. With that, we can now configure a merged dataset with WhizzML setting the sample rates and using the same pattern of pairs <property_name> and <property_value> we have used in the first example.

;; creates a dataset from two existing datasets
;; setting the percentage of sample in each one
;; once it's completed its ID is saved in merged-dataset variable
(define merged-dataset
  (create-and-wait-dataset {"origin_datasets"
                            ["dataset/5bca3fb3421aa94735000003"
                             "dataset/5bcbd2b5421aa9560d000000"]
                            "sample_rates"
                             {"dataset/5bca3fb3421aa94735000003" 0.6
                              "dataset/5bcbd2b5421aa9560d000000" 0.8}})

The equivalent code in Python is:

# Creates a merged dataset specifying the rate of each 
# one of the original datasets
merged_dataset = api.create_dataset(
    ["dataset/5bca3fb3421aa94735000003", "dataset/5bcbd2b5421aa9560d000000"],
    {
        "sample_rates": {
            "dataset/5bca3fb3421aa94735000003": 0.6,
            "dataset/5bcbd2b5421aa9560d000000": 0.8
        }
    }
)
api.ok(merged_dataset)

Denormalizing Data: Join Datasets

Data is commonly stored in relational databases, following the normal forms paradigm to avoid redundancies. Nevertheless, for Machine Learning workflows, data need to be denormalized.

BigML now allows you to make this process in the cloud as part of your workflow codified in WhizzML or with the Python Bindings. For this transformation, we can use the Structured Query Language (SQL) expressions. See below how it works. Assuming we have two different datasets in BigML, which we want to put together, and both share a field `employee_id` whose field ID is 000002:

;; creates a joined dataset composed by two datasets
(define joined_dataset
  (create-dataset {"origin_datasets"
                   ["dataset/5bca3fb3421aa94735000003"
                    "dataset/5bcbd2b5421aa9560d000000"]
                   "origin_dataset_names"
                   {"dataset/5bcbd2b5421aa9560d000000" "A"
                    "dataset/5bca3fb3421aa94735000003" "B"}
                   "sql_query"
                   "select A.* from A left join B on A.`000000` = B.`000000`"}))

The equivalent code in Python is:

# creates a joined dataset composed by two datasets
api.create_dataset(
    ["dataset/5bca3fb3421aa94735000003", "dataset/5bcbd2b5421aa9560d000000"],
    {
        "origin_dataset_names": {
            "dataset/5bca3fb3421aa94735000003": "A",
            "dataset/5bcbd2b5421aa9560d000000": "B"
        },
        "sql_query":
            "SELECT A.* FROM A LEFT JOIN B ON A.`000000` = B.`000000`"
    }
)

Aggregating Instances

The use of SQL opens the possibility to make a huge quantity of operations with your data like selection, values transformations, and rows groups between others. For instance, in some situations, we need to collect some statistics from the data creating groups around the value of a specific field. This transformation is commonly known as aggregation and the SQL keyword for that is ‘GROUP BY’. See below how to use it in WhizzML, assuming we are managing a dataset with some data of a company where the field 000001 is the department and the field 000005 is employee ID.

;; creates a new dataset aggregating the instances
;; of the original one by the field 000001
(define aggregated_dataset 
  (create-dataset {"origin_datasets"
                   ["dataset/5bcbd2b5421aa9560d000000"]
	           "origin_dataset_names"
                   {"dataset/5bcbd2b5421aa9560d000000" "DS"}
                   "sql_query"
                   "SELECT `000001`, count(`000005`) FROM DS GROUP BY `000001`"}))

The equivalent code in Python is:

# creates a new dataset aggregating the instances
# of the original one by the field 000001
api.create_dataset(
    ["dataset/5bcbd2b5421aa9560d000000"],
    {
         "origin_dataset_names": {"dataset/5bcbd2b5421aa9560d000000": "DS"},
         "sql_query":
             "SELECT `000001`, count(`000005`) FROM DS GROUP BY `000001`"
    }
)

It is possible to use the name of the fields in the queries but field IDs are preferred to avoid ambiguities. It is also possible to define aliases for the new fields using the keyword AS after the operation that follows the SQL syntax. Note that using SQL, you can also perform more complex operations than the ones we demonstrate in this post.

Want to know more about Data Transformations?

If you have any questions or you would like to learn more about how Data Transformations work, please visit the release page. It includes a series of blog posts, the BigML Dashboard and API documentation, the webinar slideshow, as well as the full webinar recording.

2 comments

Leave a comment Cancel reply