Introduction to Data Transformations

Posted by

In this post, we’ll do a quick introduction to Data Transformations before we move on to the remainder of our series of 6 blog posts (including this one) to give you a detailed perspective of what’s behind the new capabilities. Today’s post explains the basic concepts that will be followed by an example use case. Then, there will be three more blog posts focused on how to use Data Transformations through the BigML DashboardAPI, and WhizzML for automation. Finally, we will complete this series of posts with a nice example of how to prepare data for Machine Learning.

Understanding Data Transformations

Transforming your data is one of the most important, yet time-consuming and difficult tasks in any Machine Learning workflow. Of course, “data transformations” is a loaded phrase and entire books are authored on the topic. When mentioned in the Machine Learning context, what we mean is a collection of actions that can be performed compositionally on your input data to make it more responsive to various modeling tasks — if you will, these are the methods to optimally prepare or pre-process your data.

data transformations

To remind, BigML already offers several automatic data preparation options (missing values treatment, categorical fields encoding, date-time fields expansion, NLP capabilities, and even a full domain-specific language for feature generation in Flatline) as well as useful dataset operations such as sampling, filtering, and the addition of new fields. Despite those, we’ve been looking to add more capabilities for full-fledged feature engineering within the platform.

Well, the time has come! This means the powerful set of supervised and unsupervised learning techniques we’ve built from scratch over the last 7 years all stand to benefit from data better prepared to make the most of them. Without further adieu, let’s see what goodies made it to this release:

Aggregating Datasets

  • Aggregating instances: at times aggregating highly granular data at higher levels can be necessary. When that happens, you can group your instances by a given field and perform various operations on the other fields. For example, you may want to aggregate sales figures by product and perform further operations on the resulting dataset before applying Machine Learning techniques such as Time Series.
  • Joining datasets: if your data comes from different sources, in multiple datasets you need to join said datasets by defining a join field. For instance, imagine you have a dataset containing user profile information such as account creation date, age, sex, country, and another dataset that contains transactions that belong to those users with critical fields like transaction date, payment type, amount and more. If you’d rather have all those fields in a single dataset, you can join those datasets based on a common field such as customer_id.
  • Merging datasets: if you have multiple datasets to process to create with the same fields, then you may want to concatenate those before you continue your workflow. Take for example a situation where daily files of sensor data need to be collated into a single monthly file before you can proceed. This would be a breeze with the new merge capability built into the BigML Dashboard.
  • Sliding windows: Creating new features using sliding windows is one of the most common feature engineering techniques in Machine Learning. It is usually applied to frame time series data using previous data points as new input fields to predict the next time window data points. For instance, in predicting hospital re-admissions, we may want to break the healthcare data into weekly windows and see how those weekly signals are correlated with the likelihood of re-admission in the following weeks after the patient is released from the hospital.
  • SQL support: This is big! BigML now supports all the operations from PostgreSQL, which means you have the full power of SQL at your disposal through the BigML REST API. You can choose between writing a free-form SQL query or use the JSON-like formulas that the BigML API supports. You can also easily see the associated SQL queries that created a given dataset and even apply them to other datasets — more on those in the subsequent blog posts.

NOTE: Keep this under the wraps for now, but before you know it, the Dashboard will be supporting other capabilities such as ordering instances, removing duplicate instances, and more!

Want to know more about Data Transformations?

If you have any questions or you would like to learn more about how Data Transformations work, please visit the release page. It includes a series of blog posts, the BigML Dashboard and API documentation, the webinar slideshow, as well as the full webinar recording.

One comment

  1. Reblogged this on BLACK BOX PARADOX and commented:
    Data preparation is a key task in any Machine Learning workflow, but it’s often one of the most challenging and time-consuming parts. Read below about BigML’s new data transformation capabilities that make work faster and easier than ever before.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s