Skip to content

Introduction to Principal Component Analysis: Dimensionality Reduction Made Easy

BigML’s upcoming release on Thursday, December 20, 2018, will be presenting our latest resource to the platform: Principal Component Analysis (PCA). In this post, we’ll do a quick introduction to PCA before we move on to the remainder of our series of 6 blog posts (including this one) to give you a detailed perspective of what’s behind the new capabilities. Today’s post explains the basic concepts that will be followed by an example use case. Then, there will be three more blog posts focused on how to use PCA through the BigML DashboardAPI, and WhizzML for automation. Finally, we will complete this series of posts with a technical view of how PCAs work behind the scenes.

Understanding Principal Component Analysis

Many datasets in fields as varied as bioinformatics, quantitative finance, portfolio analysis or signal processing can contain an extremely large number of variables, that may be highly correlated, resulting in sub-optimal Machine Learning performance. Principal component analysis (PCA) is one technique that can be used to transform such a dataset in order to obtain uncorrelated features or as a first step in dimensionality reduction

PCA Introduction

Because PCA transforms the variables in a dataset without accounting for a target variable, it can be considered an unsupervised Machine Learning method suitable for exploratory data analysis of complex datasets. However, when used towards dimensionality reduction, it also helps reduce supervised model overfitting, as there remain fewer relationships to consider between variables after the process. To do this, the principal components yielded by a PCA transformation are typically ordered by the amount of variance each explains in the original dataset. The practitioner can decide how many of the new component features can be eliminated from a dataset while preserving most of the original information contained in it.

Even though they are all grouped under the same umbrella term (PCA), under the hood, BigML’s implementation incorporates multiple factor analysis techniques, rather than only the standard PCA implementation. Specifically,

  • Principal Component Analysis (PCA)BigML utilizes this option if the input dataset contains only numerical data.
  • Multiple Correspondence Analysis (MCA): this option is available if the input dataset contains only categorical data.
  • Factorial Analysis of Mixed Data (FAMD)in case the input dataset contains both numeric and categorical fields this option is also available.

In the case of items and text fields, data is processed using a bag-of-words approach allowing PCA to be applied. Because of this nuanced approach, BigML can handle categorical, text, and items fields in addition to numerical data in an automatic fashion that does not require manual intervention by the end user.

Principal Component Analysis (PCA)

Want to know more about PCA?

If you would like to learn more about Principal Component Analysis and see it in action on the BigML Dashboard, please reserve your spot for our upcoming release webinar on Thursday, December 20, 2018. Attendance is FREE of charge, but space is limited so register soon!

Principal Component Analysis (PCA): Dimensionality Reduction!

The new BigML release is here! Join us on Thursday, December 20, 2018, at 10:00 AM PST (Portland, Oregon. GMT -08:00) / 07:00 PM CET (Valencia, Spain. GMT +01:00) for a FREE live webinar to discover the latest addition to the BigML platform. We will be showcasing Principal Component Analysis (PCA), a key unsupervised Machine Learning technique used to transform a given dataset in order to yield uncorrelated features and reduce dimensionality. PCA is most commonly applied in fields with high dimensional data including bioinformatics, quantitative finance, and signal processing, among others.

 

Principal Component Analysis (PCA), available on the BigML Dashboard, API and WhizzML for automation as of December 20, 2018, is a statistical technique that transforms a dataset defined by possibly correlated variables (whose noise negatively affects the performance of your model) into a set of uncorrelated variables, called principal components. This technique is used as the first step in dimensionality reduction, especially for those datasets with a large number of variables, which helps improve the performance of supervised models due to noise reduction. As such, PCA can be used in any industry vertical as a preprocessing technique in the data preparation phase of your Machine Learning projects.

 

BigML PCA is distinct from other implementations of the PCA algorithm, our Machine Learning platform lets you transform many different data types in an automatic fashion that does not require you to configure it manually. That is, BigML’s unique approach can handle numeric and non-numeric data types, including text, categorical, items fields, as well as combinations of different data types. To do so, BigML PCA incorporates multiple factor analysis techniques, specifically, Multiple Correspondence Analysis (MCA) if the input contains only categorical data, and Factorial Analysis of Mixed Data (FAMD) if the input contains both numeric and categorical fields.

 

When we work with high dimensional datasets, we often have the challenge of extracting the discriminative information in the data while removing those fields that only add noise and make it difficult for the algorithm to achieve the expected performance. PCA is ideal for these events. While a PCA transformation maintains the dimensions of the original dataset, it is typically applied with the goal of dimensionality reduction. Reducing the dimensions of the feature space is one method to help reduce supervised model overfitting, as there are fewer relationships between variables to consider. The principal components yielded by a PCA transformation are ordered by the amount of variance each explains in the original dataset. Plots of the cumulative variance explained, also known as scree plots, are one way to interpret appropriate thresholds for how many of the new features can be eliminated from a dataset while preserving most of the original information.

Want to know more about PCA?

Please join our free, live webinar on Thursday, December 20, 2018, at 10:00 AM PT.  Register today as space is limited! Stay tuned for our next 6 blog posts that will gradually present PCA and how to benefit from it using the BigML platform.

Note: In response to user inquiries, we are including links here to the datasets featured in the two images above showing the BigML Dashboard. For the first, we filtered a subset of AirBNB data fields available on Inside AirBNB, and for the second, the Arrythmia diagnosis dataset is available on the BigML Gallery. We hope you enjoy exploring the data on your own!

Preparing Data for Machine Learning with BigML

At BigML we’re well aware that data preparation and feature engineering are key steps for the success of any Machine Learning project. A myriad of splendid tools can be used for the data massaging needed before modeling. However, in order to simplify the iterative process that leads from the available original data to a ML-ready dataset, our platform has recently added more data transformation capabilities. By using SQL statements, you can now aggregate, remove duplicates, join and merge your existing fields to create new features. Combining these new abilities with Flatline, the existing transformation language, and the platform’s out-of-the-box automation and scalability will help greatly to solve any real Machine Learning problem.

Data Transformations with BigML

The data: San Francisco Restaurants

Some time ago we wrote a post describing the kind of transformations needed to go from a bunch of CSV files that contained information about the inspections of some restaurants and food businesses in San Francisco. The data was published by the San Francisco’s Department of Public Health and was structured in four different files:

  • businesses.csv: a list of restaurants or businesses in the city.
  • inspections.csv: inspections in some of previous businesses.
  • violations.csv: detected law violations in some of previous inspections.
  • ScoreLegend.csv: a legend to describe score ranges.

The post described how to build a dataset that could be used to do Machine Learning with them using MySQL. Let’s compare now how could you do that using BigML’s newly added transformations.

Uploading the data

As explained in the post, the first thing that you need to do to use this data in MySQL is defining the structure of the tables where you will upload it, so you need to care about the contents of each column and assign the correct type after a detailed inspection of each CSV file. This means writing commands like this one for every CSV.

create table business_imp (business_id int, name varchar(1000), address varchar(1000), city varchar(1000), state varchar(100), postal_code varchar(100), latitude varchar(100), longitude varchar(100), phone_number varchar(100));

and some more to upload the data to the tables:

load data local infile '~/SF_Restaurants/businesses.csv' into table business_imp fields terminated by ',' enclosed by '"' lines terminated by '\n' ignore 1 lines (business_id,name,address,city,state,postal_code,latitude,longitude,phone_number);

and creating indexes to be able to do queries efficiently:

create index inx_inspections_businessid on inspections_imp (business_id);

The equivalent in BigML would be just drag and dropping the CSVs in your Dashboard:

uploadAnd as a result, BigML infers for you the types associated to every column detected in each file. In addition, the types being assigned are totally focused on the way the information will be treated by the Machine Learning algorithms. Thus, in the inspections table we see that the Score will be treated as a number, the type as a category and the date is actually automatically separated into the year, month and day components, which are the ones meaningful in the ML processes.

typesWe just need to verify the inferred types in case we want some data to be interpreted differently. For instance, the violations file contains a description text that includes information about the date the violation was corrected.

$ head -3 violations.csv
"business_id","date","description"
10,"20121114","Unclean or degraded floors walls or ceilings [ date violation corrected: ]"
10,"20120403","Unclean or degraded floors walls or ceilings [ date violation corrected: 9/20/2012 ]"

Depending on how you want to analyze this information, you can decide to leave it as it is, and contents will be parsed to produce a bag of words analysis, or set the text analysis properties differently and work with the full contents of the field.

term_analysis

As you see, so far BigML has taken care of most of the work, defining the fields in every file, their names, the types of information they contain, parsing datetimes and text. The only remaining contribution we think of now is taking care of the description field, which in this case combines information about two meaningful features: the real description and the date when the violation was corrected.

Now that the data dictionary has been checked, we can just create one dataset per source by using the 1-click dataset action.

1-c-dataset

Transforming the description data

The same transformations described in the above-mentioned post can be applied now from using the BigML Dashboard. The first one is removing the [date violation corrected: …] substring from the violation’s description field. In fact, we can go further and use that string to create a new feature: the days it took for the violation to be corrected.

add_field

This kind of transformations was already available in BigML thanks to Flatline, our domain-specific transformation language.

editor

Using a regular expression, we can create a clean_description field removing the date violation part


(replace (f "description")
            "\\[ date violation corrected:.*?\\]"
            "")

Previewing the results of any transformation we define is easier than ever thanks to our improved Flatline Editor.

clean_description

By doing so, we discover that the new clean_description field is assigned a categorical type because its contents are not free text but a limited range of categories.

clean_desc

The second field is computed using the datetime capabilities of Flatline. The expression to compute the days that took to correct the violation is:


(/ (- (epoch (replace (f "description")
                      ".*\\[ date violation corrected: (.*?) \\]"
                      "$1") "MM/dd/YYYY")
      (epoch (f "date") "YYYYMMdd"))
   (* 1000 24 60 60))

day_to_correction

where we parsed the date in the original description field and subtracted the one that the violation was registered in. The difference is stored in the  days_to_correction new feature, to be used in the learning process.

Getting the ML-ready format

We’ve been working on a particular field of the violations table so far, but if we are to use that table to solve any Machine Learning problem that predicts some property about these businesses, we need to join all the available information in a single dataset. That’s where BigML‘s new capabilities come handy, as we now offer joins, aggregations, merging and duplicate removal operations.                       new_trans

In this case, we need to join the businesses table with the rest, and we realize that inspections and violations use the business_id field as the primary key, so a regular join is possible. The join will keep all businesses and every business can have none or multiple related rows in the other two tables. Let’s join businesses and inspections:

join

Now, to have a real ML-ready dataset, we still need to meet a requirement. Our dataset needs to have a single row for every item we want to analyze. In this case, it means that we need to have a single row per business. However, joining the tables has created multiple rows, one per inspection. We’ll need to apply some aggregation: counting inspections, averaging scores, etc.

aggreg

The same should be done for the violations table, where again each business can have multiple violations. For instance, we can aggregate the days that it took to correct the violations and the types of violation per business.

viol_aggr

And now, use a right join to add this information to every business record.

viol_join

Finally, the ScoreLegend table is just providing a list of categories that can be used to discretize the scores into sensible ranges. We can easily add that to the existing table with a simple select * from A, B expression plus a filter to select the rows whose Score field value is between the Minimum_Score and Maximum_Score of each legend. In this case, we’ll use the more full-fledged API capabilities through the Python bindings.

# applying the sql query to the business + inspections + violations
# dataset and the ScoreLegend
from bigml.api import BigML
api = BigML()
legend_dataset = api.create_dataset( \
    [business_ml_ready,
     score_legend],
    {"origin_dataset_names": {
        business_ml_ready: "A",
        score_legend: "B"},
     "sql_query": "select * from A, B"})
api.ok(legend_dataset)

# filtering the rows where the score matches the corresponding legend
ml_dataset = api.create_dataset(\
    legend_dataset,
    {"lisp_filter": "(<= (f \"Minimum_Score\")" \
                    " (f \"avg_score\")" \
                    " (f \"Maximum_Score\"))"})

With these transformations, the final dataset is eventually Machine Learning ready and can be used to cluster restaurants in similar groups, find out the anomalous restaurants, or classify them according to their average score ranges. Nevertheless, we can generate new features, like the distance to the city center, or the rate of violations per inspection. These transformations can help to better describe the patterns in data. Here’s the Flatline expression needed to compute the distance of the restaurants to the center of San Francisco using the Haversine formula.


(let (pi 3.141592
      lon_sf (/ (* -122.431297 pi) 180)
      lat_sf (/ (* 37.773972 pi) 180)
      lon (/ (* (f "longitude") pi) 180)
      lat (/ (* (f "latitude") pi) 180)
      dlon (- lon lon_sf)
      dlat (- lat lat_sf)
      a (+ (pow (sin (/ dlat 2.0)) 2) (* (cos lat_sf) (cos lat) (pow (sin (/ dlon 2.0)) 2)))
      c (* 2 (/ (sqrt a) (sqrt (- 1 a)))))
 (* 6373 c))

For instance, modeling rating in terms of the name, address, postal code or distance to the city center could give us information about how to look up for the best restaurants.

distance

Trying a logistic regression, we learn that to find a good restaurant, it’s best to move a bit away from the center of San Francisco.

logistic

Having data transformations in the platform has many advantages. Feature engineering becomes an integrated feature, so trivial to be used, and scalabilityautomation and reproducibility are granted, as for any other resource (and one click away thanks to Scriptify). So don’t be shy and give it a try!

Enterprise Machine Learning More Accessible Than Ever with BigML Lite

Yesterday, millions of shoppers flocked to online sales for Cyber Monday. While this single day of extra savings is exciting, we believe in providing excellent value to our customers year-round. Today, we are delighted to introduce BigML Lite, a new Private Deployment option that makes enterprise Machine Learning more accessible than ever.

BigML Lite for enterprise
At this point, it’s well established that businesses in all industries have the challenge and opportunity to utilize tremendous amounts of data. What isn’t as well accepted yet is that obtaining a company-wide Machine Learning platform is key to enable analysts, developers, and engineers to build robust predictive applications in a timely manner.

A terrific blog on “Why businesses fail at machine learning” by Cassie Kozyrkov uses cooking metaphor to explain how companies often make the mistake of trying to build an oven (a Machine Learning platform) instead of baking bread (deriving insights and making predictions from data). Continuing with this metaphor, the majority of data-driven companies are in the business of “making bread”, so there’s no reason to spend resources creating an oven from scratch. BigML, on the other hand, is the “oven maker” of this metaphor. The BigML Team has spent the last 7+ years meticulously building a comprehensive Machine Learning platform that provides instant access to the most effective ML algorithms and high-level workflows.

To help companies focus on what matters most, automating the decision-making process, BigML offers Private Deployments for customers to start building production-ready predictive apps from day one, without having to worry about low-level infrastructure management. Now, BigML provides two options to meet the needs of both small and large scale deployments: BigML Lite and BigML Enterprise.

  • BigML Lite offers a fast-track route for implementing your first use cases. Ideal to get immediate value for startups, small to mid-size enterprises or in a single department of a large enterprise ready to benefit from Machine Learning.
  • BigML Enterprise offers full-scale access for unlimited users and organizations. Ideal for larger enterprises ready for company-wide Machine Learning adoption.

All BigML Private Deployments (Lite or Enterprise) include the following:

  • Unlimited tasks.
  • Regular updates and upgrades of new features and algorithms.
  • Priority access to customized assistance.
  • Ability to run in your preferred cloud provider, ISP, or on-premises.
  • Fully managed or self-managed Virtual Private Cloud (VPC) deployments.

With BigML Lite, your company can obtain the full power of BigML’s platform on a single server at a significantly reduced price. After successfully bringing your initial predictive use cases to production, you can easily upgrade to bigger deployments, auto-scaling to accommodate more users and more data. Along with our Private Deployments, we are happy to guide companies with their projects,  giving personalized support to help your business successfully apply Machine Learning.

Please see our pricing page for more details and contact us at info@bigml.com for any inquiries.

Ready, set, deploy!

K-means – – Finding Anomalies while Clustering

On November 4th and 5th, BigML joined the Qatar Computing Research Institute (QCRI), part of Hamad Bin Khalifa University, to bring a Machine Learning School to Doha, Qatar! We are very excited to have this opportunity to collaborate with QCRI.

During the conference, Dr. Sanjay Chawla discussed his algorithm for clustering with anomalies, k-means–. We thought it would be a fun exercise to implement a variation of it using our domain-specific language for automating Machine Learning workflows, WhizzML.

Applying BigML to ML research

The Algorithm

The usual process for the k-means– algorithm is as follows. It starts with some dataset, some number of clusters k, and some number of expected outliers l. It randomly picks k centroids, and assigns every point of the dataset to one of these centroids based on which one is closest. So far, it’s just like vanilla k-means. In vanilla k-means, you would now find the mean of each cluster and set that as the new centroid. In k-means–, however, you first find the l points which are farthest from their assigned centroids and filter them from the dataset. The new centroids are found using the remaining points. By removing these points as we go, we’ll find centroids that aren’t influenced by the outliers, and thus different (and hopefully better) centroids.

We already have an implementation of k-means in BigML, the cluster resource. But this is not vanilla k-means. Instead of finding the new centroids by averaging all of the points in the cluster, BigML’s implementation works faster by sampling the points and using a gradient descent approach. BigML also picks better initial conditions than vanilla k-means. Instead of losing these benefits, we’ll adapt Chawla’s k-means– to use a full BigML clustering resource inside the core iteration.

This WhizzML script is the meat of our implementation.  

(define (get-anomalies ds-id filtered-ds k l)
  (let (cluster-id (create-and-wait-cluster {"k" k 
                                             "dataset" filtered-ds})
        batchcentroid-id (create-and-wait-batchcentroid 
                            {"cluster" cluster-id 
                             "dataset" ds-id 
                             "all_fields" true 
                             "distance" true 
                             "output_dataset" true})
        batchcentroid (fetch batchcentroid-id)
        centroid-ds (batchcentroid "output_dataset_resource")
        sample-id (create-and-wait-sample centroid-ds)
        field-id (((fetch centroid-ds) "objective_field") "id") 
        anomalies (fetch sample-id {"row_order_by" (str "-" field-id) 
                                    "mode" "linear"
                                    "rows" l
                                    "index" true}))
    (delete* [batchcentroid-id sample-id])
    {"cluster-id" cluster-id 
     "centroid-ds" centroid-ds
     "instances" ((anomalies "sample") "rows")}))

Let’s examine it line by line. Instead of removing l outliers at each step of the algorithm, let’s run an entire k-mean sequence before removing our anomalies.

cluster-id (create-and-wait-cluster {"k" k "dataset" filtered-ds})

It is very easy to then create a batch centroid with an output dataset with distance to the centroid appended.

batchcentroid-id (create-and-wait-batchcentroid {"cluster" cluster-id 
                                                 "dataset" ds-id 
                                                 "all_fields" true 
                                                 "distance" true 
                                                 "output_dataset" 
                                                   true})

To get specific points, we need to use a BigML sample resource to get the most distant points.

sample-id (create-and-wait-sample centroid-ds)

We can now find the distance associated with the lth instance, and subsequently filter out all points greater than that distance from our original dataset.

anomalies (fetch sample-id {"row_order_by" (str "-" field-id) 
                            "mode" "linear"
                            "rows" l
                            "index" true}))

We repeat this process until the centroid stabilizes, as determined by passing a threshold for the Jaccard coefficient between the sets of outliers in subsequent iterations of the algorithm, or until we reach some maximum number of iterations set by the user.

 You can find the complete code on GitHub or in the BigML gallery.

The Script In Action

So what happens when we run this script? Let’s try it with the red wine quality dataset. Here is the result when using a k of 13 (chosen using a BigML g-means cluster) and an l of 10.

Screen Shot 2018-11-05 at 10.25.20 AM

We can export a cluster summary report and compare it to a vanilla BigML cluster with the same k. As you might expect by removing the outlying points, the average of the centroid standard deviations is smaller for the k-means minus two results: 0.00128 versus 0.00152.

What about the points we removed as outliers? Do we know if they were truly anomalous? When we run the wine dataset through a BigML anomaly detector, we can get the top ten anomalies according to an isolation forest. When compared to the ten outliers found by the script, we see that there are six instances in common. This is a decent agreement that we have removed true outliers.

We hope you enjoyed this demonstration of how BigML can work with research to easily custom-make ML algorithms. If you couldn’t join us in Doha this weekend, we hope to see you at the first of our ML School in Seville, Spain, and other upcoming events!

Where Robotics and Machine Learning Meet

Robotic Process Automation (RPA) and Machine Learning are cutting-edge technologies that are showing an astonishing pace of growth in both capabilities and real-world applications. Having realized the powerful synergy between the data generated by Software Robots and the insights that Machine Learning algorithms can provide for any business, Jidoka and BigML, the leading RPA and Machine Learning companies respectively, join forces in a partnership to provide highly integrated solutions to collective partners and clients.

There are plenty of areas where businesses and developers can benefit from this strategic alliance between RPA and Machine Learning. For instance, a company’s customer care department and their email processing requirements. On one hand, BigML creates a Machine Learning model that predicts the receiver (department or employee) of a given email. On the other hand, Jidoka’s robots will automatically carry out all the rule-based tasks that humans historically completed such as checking if there are new e-mails to be processed, forwarding them to the correct recipients according to BigML’s predictions, and registering the task to address the request.

“Merging RPA and Machine Learning capabilities, we are able to provide enhanced automation solutions to companies that want to lead their digital transformation journey”, said Victor Ayllón, Jidoka’s CEO. “Alliances with companies such as BigML allow us to expand significantly the typology and complexity of automated processes, taking RPA one step further. Intelligent Automation is not a matter of “if” but a matter of “when”.

BigML’s CEO, Francisco Martín commented, “The imperative for more automation continues unabated in all type of business processes, and it’s only natural that RPA efforts seek to embed more and more Machine Learning models which optimize those processes, and perform tasks that until now only highly-training humans were capable of achieving. Such automation is liberating personnel so that they can focus on more strategic tasks. BigML’s partnership with Jidoka will result in much more adaptive systems that will help our customers introduce new avenues of productivity.”

With the shared goals of making businesses more agile, productive, and customer-oriented, Jidoka and BigML jointly offer enhanced capabilities for business process automation. Reducing costs and errors, improving response times, increasing human’s performance, enabling predictions, facilitating decision-making based on data, are just some of the benefits businesses will be rewarded with as they adopt Machine Learning driven Robotics.

Partner with a Machine Learning Platform: Bringing Together Top Tech Companies

The concept of “better together” doesn’t just apply to people or beloved food pairings; it also works quite well for software technologies. With this in mind, we are excited to announce the  BigML Preferred Partner Program (PPP) which brings together our comprehensive Machine Learning platform with innovative companies around the world that offer complementary technology and services. Through effective collaboration, we can provide numerous industry-specific solutions to benefit the customers of both companies.

Our Preferred Partner Program is comprised of three different levels of commitment and incentives:

  • Referral Partners: focus on generating qualified leads by leveraging their business network.
  • Sales Partners: drive product demonstrations and advance qualified leads through the contract stage.
  • Sales & Delivery Partners: facilitate the sales process and see closed deals through the solution deployment stage.

BigML provides a matching amount of training according to the tasks covered at each partnership level, including personalized training sessions, sales and IT team training, and BigML Engineer Certifications. In addition to learning one-on-one from BigML’s Machine Learning experts, partner perks include:

Over the past 7 years, BigML has systematically built a consumable, programmable, and scalable Machine Learning platform that is being used by tens of thousands of users around the world to develop end-to-end Machine Learning applications. BigML provides the core components for any Machine Learning workflow, giving users immediate access to analyze their data, build models, and make predictions that are interpretable, programmable, traceable and secure.

With these powerful building blocks that are both robust and easy-to-use on BigML, it is more accessible than ever for data-driven companies to build full-fledged predictive applications across industries. We strongly believe that not only end customers can benefit from our platform, but also those tech companies that wish to help in the adoption of Machine Learning to make it accessible and effective for all businesses. Displayed below is a small sampling of services that can be provided atop of the BigML platform:

Partner Services atop of BigML Platform

Find more information here and reach out to us at partners@bigml.com. In addition to three main levels, BigML is also looking for companies interested in Original Equipment Manufacturer (OEM) and technology partnerships, so if that is of interest to your company, please let us know. We look forward to partnering with other innovative companies!

Qatar is Ready for the Machine Learning Adoption

141 attendees from 13 countries enjoyed the first edition of our Machine Learning School in Doha, co-organized with Qatar Computing Research Institute (QCRI), part of Hamad Bin Khalifa University. Machine Learning practitioners from mostly Qatar, as well as Algeria, Egypt, Finland, Mexico, Montenegro, Republic of Korea, South Africa, Spain, Tunisia, Turkey, United Kingdom, and the United States, traveled to the capital of Qatar for two days packed with Machine Learning master classes, practical workshops, one-on-one consultations at the Genius Corner, as well as plenty of chances for networking with data-driven professionals.

The BigML and QCRI teams are very excited and impressed with such a great response, which was felt at the event in our interactions with the audience and the media on November 4-5 at the Marriott Marquis City Center Doha Hotel. The MLSD18 was booked to capacity in less than a week. A wide variety of professionals joined the two-day intensive course, 90 attendees represented 12 top universities from the Middle East region, and the remaining 51 attendees came from 29 organizations, including some of the most influential companies in the region.

The lessons were delivered by experienced BigML team members as well as QCRI Scientists. BigML’s CIO Poul Petersen covered the important topic of Machine Learning-Ready Data, together with Dr. Saravanan Thirumuruganathan, Scientist at QCRI. The curriculum also included supervised and unsupervised learning techniques, taught by Dr. Greg Antell, BigML’s Machine Learning Architect, and Dr. Sanjay Chawla, Research Director with the Qatar Computing Research Institute, respectively. The automation section was presented by BigML’s CTO, Dr. José Antonio Ortega (jao). Furthermore, Dr. Mourad Ouzzani, Principal Scientist with QCRI, presented the latest research projects QCRI has been working on. Please check out these photo albums on Google+ and Facebook to view the event’s highlights.

We could not wish for a better start of our Machine Learning Schools series in the Middle East, and look forward to future editions in the region. Stay tuned for future announcements! For the Machine Learning enthusiasts who could not join the MLSD18, we encourage you to register for the first Machine Learning School in Seville, on March 7-8, 2019.

The Data Transformations Webinar Video is Here: Machine Learning-Ready Data!

The latest BigML release brings new Data Transformation capabilities to our Machine Learning platform, crucially alleviating the data engineering bottleneck.

In addition to our previous offering in the data preparation area, we are now releasing a new batch of transformations that maximize the benefits of BigML platform’s supervised and unsupervised learning techniques by pre-processing input data more effectively. You can now aggregate instances to your datasets, join and merge datasets, and many other options as presented in the official launch webinar video. You can watch it anytime on the BigML YouTube channel.

To learn more about Data Transformations, please visit the release page, where you can find:

  • The slides used during the webinar.
  • The detailed documentation to learn how to use Data Transformations with the BigML Dashboard and the BigML API.
  • The series of blog posts that gradually explain Data Transformations. We start with an introductory post that explains the basic concepts, followed by a use case to understand how to put Data Transformations to use, and then three more posts on how to use Data Transformations through the BigML Dashboard, API, WhizzML, and Python Bindings.

Thanks for watching the webinar! For any queries or comments, please contact the BigML Team at support@bigml.com.

Automating Data Transformations with WhizzML and the BigML Python Bindings

by

This is the fifth post of our series of six about BigML’s new release: Data Transformations. This time we are focusing on the Data Preparation step, prior to any Machine Learning project, to create a new and improved dataset from an existing one.

CRISP-DM_diagram-bigml

The data preparation phase is key to achieve good performance for your predictive models. Not only that, there is a wide variety of operations that can be performed since data usually do not come ready or do not have upfront the fields we need to create our Machine Learning models. Being aware of that, in 2014 BigML introduced Flatline, the DSL language specifically designed for data transformations. Over the years, Flatline has grown and increased the number of operations that can be performed. Now, in this release, we improved its sliding window operations and added the ability to use a subset of SQL instructions that add a new range of transformations, such as joins, aggregations, or adding rows to an existing dataset.

In this blog post, we will learn step-by-step how to automate these data transformations programmatically using WhizzML, BigML’s Domain Specific Language for Machine Learning automation, and the official Python Bindings

Adding Rows: Merge Datasets

When you want to add data to an existing dataset that is already in the platform you will use the following code. This is an example used where data are collected in periods or the same kind of data comes from different sources.

;; creates a dataset merging two existing datasets
(define merged-dataset
  (create-dataset {"origin_datasets"
                   ["dataset/5bca3fb3421aa94735000003"
                    "dataset/5bcbd2b5421aa9560d000000"]})

The equivalent code in Python is:

# merge all the rows of two datasets
api.create_dataset(
    ["dataset/5bca3fb3421aa94735000003",
     "dataset/5bcbd2b5421aa9560d000000"]
)

As we saw in previous posts, the BigML API is mostly asynchronous, which means that the execution will return the ID of the new dataset before its creation is completed. This implies that the analysis of fields and their summary will continue after the code snippet is executed. You can use the directive “create-and-wait-dataset” to be sure that the datasets have finally merged:

;; creates a dataset from two existing datasets and
;; once it's completed its ID is saved in merged-dataset variable
(define merged-dataset
  (create-and-wait-dataset {"origin_datasets"
                            ["dataset/5bca3fb3421aa94735000003",
                             "dataset/5bcbd2b5421aa9560d000000"]})

The equivalent code in Python is:

# merge all the rows of two datasets and store the ID of the
# new dataset in merged_dataset variable
merged_dataset = api.create_dataset(
    ["dataset/5bca3fb3421aa94735000003",
     "dataset/5bcbd2b5421aa9560d000000"]
)
api.ok(merged_dataset)

When you merge datasets, you can update several parameters that you can check in the Multidasets sections of the API documentation. With that, we can now configure a merged dataset with WhizzML setting the sample rates and using the same pattern of pairs <property_name> and <property_value>  we have used in the first example.

;; creates a dataset from two existing datasets
;; setting the percentage of sample in each one
;; once it's completed its ID is saved in merged-dataset variable
(define merged-dataset
  (create-and-wait-dataset {"origin_datasets"
                            ["dataset/5bca3fb3421aa94735000003"
                             "dataset/5bcbd2b5421aa9560d000000"]
                            "sample_rates"
                             {"dataset/5bca3fb3421aa94735000003" 0.6
                              "dataset/5bcbd2b5421aa9560d000000" 0.8}})

The equivalent code in Python is:

# Creates a merged dataset specifying the rate of each 
# one of the original datasets
merged_dataset = api.create_dataset(
    ["dataset/5bca3fb3421aa94735000003", "dataset/5bcbd2b5421aa9560d000000"],
    {
        "sample_rates": {
            "dataset/5bca3fb3421aa94735000003": 0.6,
            "dataset/5bcbd2b5421aa9560d000000": 0.8
        }
    }
)
api.ok(merged_dataset)

Denormalizing Data: Join Datasets

Data is commonly stored in relational databases, following the normal forms paradigm to avoid redundancies. Nevertheless, for Machine Learning workflows, data need to be denormalized.

BigML now allows you to make this process in the cloud as part of your workflow codified in WhizzML or with the Python Bindings. For this transformation, we can use the Structured Query Language (SQL) expressions. See below how it works. Assuming we have two different datasets in BigML, which we want to put together, and both share a field `employee_id` whose field ID is 000002:

;; creates a joined dataset composed by two datasets
(define joined_dataset
  (create-dataset {"origin_datasets"
                   ["dataset/5bca3fb3421aa94735000003"
                    "dataset/5bcbd2b5421aa9560d000000"]
                   "origin_dataset_names"
                   {"dataset/5bcbd2b5421aa9560d000000" "A"
                    "dataset/5bca3fb3421aa94735000003" "B"}
                   "sql_query"
                   "select A.* from A left join B on A.`000000` = B.`000000`"}))

The equivalent code in Python is:

# creates a joined dataset composed by two datasets
api.create_dataset(
    ["dataset/5bca3fb3421aa94735000003", "dataset/5bcbd2b5421aa9560d000000"],
    {
        "origin_dataset_names": {
            "dataset/5bca3fb3421aa94735000003": "A",
            "dataset/5bcbd2b5421aa9560d000000": "B"
        },
        "sql_query":
            "SELECT A.* FROM A LEFT JOIN B ON A.`000000` = B.`000000`"
    }
)

Aggregating Instances

The use of SQL opens the possibility to make a huge quantity of operations with your data like selection, values transformations, and rows groups between others. For instance, in some situations, we need to collect some statistics from the data creating groups around the value of a specific field. This transformation is commonly known as aggregation and the SQL keyword for that is ‘GROUP BY’. See below how to use it in WhizzML, assuming we are managing a dataset with some data of a company where the field 000001 is the department and the field 000005 is employee ID.

;; creates a new dataset aggregating the instances
;; of the original one by the field 000001
(define aggregated_dataset 
  (create-dataset {"origin_datasets"
                   ["dataset/5bcbd2b5421aa9560d000000"]
	           "origin_dataset_names"
                   {"dataset/5bcbd2b5421aa9560d000000" "DS"}
                   "sql_query"
                   "SELECT `000001`, count(`000005`) FROM DS GROUP BY `000001`"}))

The equivalent code in Python is:

# creates a new dataset aggregating the instances
# of the original one by the field 000001
api.create_dataset(
    ["dataset/5bcbd2b5421aa9560d000000"],
    {
         "origin_dataset_names": {"dataset/5bcbd2b5421aa9560d000000": "DS"},
         "sql_query":
             "SELECT `000001`, count(`000005`) FROM DS GROUP BY `000001`"
    }
)

It is possible to use the name of the fields in the queries but field IDs are preferred to avoid ambiguities. It is also possible to define aliases for the new fields using the keyword AS after the operation that follows the SQL syntax. Note that using SQL, you can also perform more complex operations than the ones we demonstrate in this post.

Want to know more about Data Transformations?

If you have any questions or you would like to learn more about how Data Transformations work, please visit the release page. It includes a series of blog posts, the BigML Dashboard and API documentation, the webinar slideshow, as well as the full webinar recording.

%d bloggers like this: