Skip to content

Practical Recursive Feature Selection

With the Summer 2018 Release Data Transformations were added to BigML. SQL-style queries, feature engineering with the Flatline editor and options to merge and join datasets are great, but sometimes these are not enough. If we have hundreds of different columns in our dataset it can be hard to handle them.

We are able to apply transformations but to which columns? Which features are useful for our target prediction and which ones are only increasing resource usage and model complexity?

Later, with the Fall 2018 Release, Principal Component Analysis (PCA) was added to the platform in order to help with dimensionality reduction. PCA helps with the challenge of extracting the discriminative information in the data while removing those fields that only add noise and make it difficult for the algorithm to achieve the expected performance. However, PCA transformation yields new variables which are linear combinations of the original fields, and this can be a problem if we want to obtain interpretable models.

Recursive Feature Elimination with WhizzML

Feature Selection Algorithms will help you to deal with wide datasets. There are 4 main reasons to obtain the most useful fields in a dataset and discard the others:

  • Memory and CPU: Useless features consume unnecessary memory and CPU.
  • Model performance: Although a good model will be able to detect which are the important features in a dataset, sometimes, this noise generated by useless fields confuses our model, and we obtain better performance when we remove them.
  • Cost: Obtaining data is not free. If some columns are not useful, don’t waste your time and money trying to collect them.
  • Interpretability: Reducing the number of features will make our model simpler and easier to understand.

In this series of four blog posts, we will describe three different techniques that can help us in this task: Recursive Feature Elimination (RFE), Boruta algorithm, and Best-First Feature Selection. These three scripts have been created using WhizzML, BigML’s domain-specific language. In the fourth and final post, we will summarize the techniques and provide guidelines for which are better suited depending on your use case.

Some of you may already know about the Best-First and Boruta scripts since we have offered them in the WhizzML Scripts Gallery. We will provide some details about the improvements we made to those and the new script, RFE.

Introduction to Recursive Feature Elimination (RFE)

In this post, we are focusing on Recursive Feature Elimination. You can find it in BigML Script Gallery If you want to know more about this script, visit its info page.

This is a completely new script in WhizzML. RFE starts using all the fields and, iteratively, creates ensembles removing the least important field at each iteration. The process is repeated until the number of fields (set by the user in advance) is reached. One interesting feature of this script is that it can return the evaluation for each possible number of features. This is very helpful to find the ideal number of features we should use.  

The script input parameters are:

  • dataset-id: input dataset
  • n: final number of features expected
  • objective-id: objective field (target)
  • test-ds-id: test dataset to be used in the evaluations (no evaluations take place if empty)
  • evaluation-metric: metric to be maximized in evaluations (default if empty). Possible classification metrics: accuracy, average_f_measure, average_phi (default), average_precision, and average_recall. Possible regression metrics: mean_absolute_error, mean_squared_error, r_squared (default).

Our dataset: System failures in trucks dataset

This dataset, originally from the UCI Machine Learning Repository, contains information for multiple sensors inside trucks. The dataset consists of trucks with failures and the objective field determines whether or not the failure comes from the Air Pressure System (APS). This dataset will be useful for us for two reasons:

  • It contains 171 different fields, which is a sufficiently large number for feature selection.
  • Field names have been anonymized for proprietary reasons so we can’t apply domain knowledge to remove useless features.

As it is a very big dataset, we will use a sample of it with 15,000 rows.

Feature Selection with Recursive Feature Elimination

We will start applying Recursive Feature Elimination with the following inputs. We are using an n=1 because we want to obtain the performance of all possible subset of features, from 171 until 1. If we set a higher n, e.g. 50, the script would stop when it reached that number so we wouldn’t know how smaller subsets of features perform.

Input parameters of RFE execution

After 30 minutes, we obtain an output-features object that contains all the possible subsets of features and their performance. We can use it to create the graph below. From this, we can deduce the optimal number of features is around 25. From 25 features on, the performance is stable.

Evaluation score as a function of the number of features

Try it yourself with this Python script

Now that we know that we should obtain around 25 features, let’s run the script again to find out which are the optimal 25. This time, as we don’t need to perform evaluations, we won’t pass the test dataset to the script execution.

The script needs 20 minutes to finish the execution. The 25 most important fields that RFE returns are:

"bs_000", "cn_004", "cs_002","cn_000", "dn_000", "ay_008", "ba_005",    
"ee_005", "bj_000", "az_000", "al_000", "am_0", "ay_003", "ci_000", 
"ba_007",  "aq_000", "ag_002", "ee_007", "ck_000", "bc_000", "ay_005", 
"ba_002", "ee_000", "cm_000", "ai_000"

From the script execution, we can obtain a filtered dataset with these 25 fields. The ensemble associated with this filtered dataset has a phi coefficient of 0.815. The phi coefficient of the ensemble that uses the original dataset was only a bit higher, 0.824. That sounds good!

As we have seen, Recursive Feature Elimination is a simple but powerful feature selection script with only a few parameters, serving as a very useful way to get an idea of which features are actually contributing to our model. In the next post, we will see how we can achieve similar results using Boruta. Stay tuned!

Machine Learning meets Social Good to tackle Data Quality Challenges for Enterprises

BigML partners with WorkAround to provide datasets tagged and cleaned by skilled refugees. WorkAround, a crowdsourcing platform for refugees and displaced people, partners with BigML, the leading Machine Learning Platform accessible for everyone, to give more economic opportunities to end users.

In a world of increasing automation, it is easy to forget the human work that goes into making Machine Learning happen. Quality data is the linchpin to accurate outcomes from Machine Learning algorithms, but finding providers that can deliver clean and accurate data is challenging. However, WorkAround makes this possible while working with skilled refugees and displaced people who are otherwise unable to work due to government restrictions, lack of access to banking, and other barriers. With this partnership, BigML customers will enjoy the benefits of having their data cleaned and tagged without the burden of having to perform these tasks by themselves, thus being able to dedicate more time to other strategic tasks.

“I started WorkAround because aid is not a sustainable solution for anyone to move forward,” says Wafaa Arbash, WorkAround’s co-founder and CEO, who watched frustrated as many of her fellow Syrians fled conflict only to be left with few options for employment in host communities, despite having higher education and previous work experience. “People don’t need handouts, they need economic opportunities.”

Although the 1951 UN Refugee Convention signed by 144 countries grants refugees the right to work, the reality is that most host countries block or severely limit local access to jobs. “WorkAround basically saved my life,” said Oro Mahjoob, a WorkArounder since July of 2017, “it gave me a chance to work and earn enough to pay my rent with only having an internet connection and a device.”

Francisco Martin, BigML’s CEO emphasized: “BigML is excited to offer more ways to ensure high-quality data is made available for a variety of Machine Learning tasks executed on our platform. Our mission of democratizing Machine Learning is further extended to cover data preparation thanks to our partnership with WorkAround all the while contributing to a worthy social cause.”

Principal Component Analysis Webinar Video: Dimensionality Reduction Made Easy!

BigML has brought Principal Component Analysis (PCA) to the platform. PCA is a key unsupervised Machine Learning technique used to transform a given dataset in order to yield uncorrelated features and reduce dimensionality. PCA fundamentally transforms a dataset defined by possibly correlated variables into a set of uncorrelated variables, called principal components. When used for dimensionality reduction, these principal components often allow improvements in the results of supervised modeling tasks by reducing overfitting as there remain fewer relationships to consider between variables after the process.

Additionally, BigML PCA’s unique implementation lets you transform many different data types automatically without requiring you to configure it manually. That is, BigML PCA can handle numeric and non-numeric data types, including text, categorical, items fields, as well as combinations of different data types. PCA is ideal for domains with high dimensional data including bioinformatics, quantitative finance, and signal processing, among others.

Now you can easily create BigML PCAs through the BigML Dashboard benefiting from intuitive visualizations, via the REST API if you prefer to work programmatically, or via WhizzML and a wide range of bindings for automation. To see how, please watch the launch webinar video on the BigML YouTube channel.

For further learning about Principal Component Analysis, please visit the release page, where you can find:

  • The slides used during the webinar.
  • The detailed documentation to learn how to use PCA with the BigML Dashboard and the BigML API.
  • The series of blog posts that gradually explain PCA. We start with an introductory post that explains the basic concepts, followed by a use case that presents how to apply dimensionality reduction with PCA to Cander data, and three more posts on how to use and interpret Principal Component Analysis through the BigML Dashboard, API, as well as WhizzML and Python Bindings.

Thanks for watching the webinar and for your positive feedback! As usual, your comments are always welcome, feel free to contact the BigML Team at support@bigml.com.

Recapping BigML’s 2018 in Numbers

Another year has gone by in a hurry in the Machine Learning world of BigML. 2018 saw the interest in Machine Learning from all industries continually get stronger. Not only are we seeing an increase in the level of awareness and sophistication towards productive business applications running existing processes more efficiently, but we’re also witnessing novel use cases turning Machine Learning into an all together strategic asset. When things happen so fast, one can sometimes find it a challenge to stop and reflect on milestones and achievements. So below are the highlights of what made 2018 another special year for us.

bigml_summary_2018

In 2018, the BigML platform crossed the 80,000 registrations mark worldwide adding to our milestones since inception in 2011. Our users keep making a difference in their workplaces, government agencies, as well as educational institutions all over the map putting the BigML platform to use in the most creative ways.

5 Major Releases + 23 Enhancements

2018 saw a wealth of new features on BigML opening up many more compositional workflows involving the collective resources and the corresponding API endpoints.

  • The year started out with the Operating Thresholds and Organizations capabilities being launched. Operating Thresholds let you better adjust the tradeoff between false positives and false negatives in your classification models while Organizations help assign different roles and privileges to different users in workgroups respectively.
  • The OptiML release followed, complementing an already impressive array of supervised learning resources for tackling classification and regression problems by applying Bayesian Parameter Optimization.
  • Our Fusions release gave the platform a whole new dimension allowing users to easily mix and match multiple models into an ensemble regardless if the underlying algorithms are different.
  • Keeping the momentum alive we launched Data Transformations, a much-requested collection of new features that let users manipulate and pre-process their data into a Machine Learning-ready format even without any SQL expertise.
  • Finally, Principal Component Analysis (PCA) will mark the year-end addition that extends BigML with a practical dimensionality reduction functionality adaptable to all types of input data e.g., categorical, numeric, text.

releases_2018

Aside from major releases, we made 23 smaller but noteworthy improvements to the BigML platform including but not limited to: Feature Engineering with Flatline Editor, Sliding WindowsSQL in the BigML API, New Text Analysis Options, Prediction Explanation and many more. You can find a full list of enhancements on our What’s New page in case you’d like to try out the ones you may have missed.

As usual, we’ve also kept BigML Tools updated to make sure insights from BigML resources find their way to different platforms. One such example is the newest version of our Predict App for Zapier, which allows you to easily automate your Machine Learning workflows without any coding by importing your data in real-time from the most popular web apps.

Making Enterprise Machine Learning accessible with BigML Lite

We also need to drop a special mention for BigML Lite, which is the latest addition to BigML’s product line up.

BigML Lite offers a fast-track route for implementing your first use cases. Ideal to get immediate value for startups, small to mid-size enterprises or in a single department of a large enterprise ready to benefit from Machine Learning with their first predictive use cases. Now, for as low as $10,000/year, any of these scenarios can be realized as a stepping stone to company-wide Machine Learning initiatives with a larger scope.

BigML Dashboard in Chinese

Part of being a global company as seen in the body of users representing 182 countries in 6 continents is recognizing the need to customize the user experience to match local expectations.

In 2018, as a result of growing demand in Asia, BigML has taken a giant step by translating the BigML Dashboard to Chinese. You can expect further local customization options down the road as we strive to delight our global users.

498,849 Code Changes through 225 Deploys!

All our releases and the new features were made possible due to some serious heavy lifting by our product development group, which make up a large percentage of our 33 FTE strong BigML Team.  Just to give a glimpse of the level of non-stop activity, our team updated our backend codebase 282,786 times, API codebase 41,413 times and our Web codebase 174,650 times. These improvements and additions were carried out through 225 production deployments dotting the entire year.

code_lines_2018

58 Events in 5 Continents

It’s hard to match the excitement of connecting with BigMLers during real-life events to hear their stories and receive their feedback. In 2018, we continued the tradition of organizing and delivering Machine Learning schools with VSSML18 and added to the rotation the Machine Learning School in Doha 2018 with record attendance. Next year, Seville will be also part of the ML education circuit with MLSEV.

On the industry-specific events side, the 2ML event held in Madrid, Spain, has been fruitful so we intend to continue this collaboration with Barrabes in the 2019 edition. To kick off the new year, we’ll make a stop in Edinburgh, Scotland in January sharing our experiences in automating decision making for businesses.

schools_2018

171 Ambassador Applications, 7 Certification Batches

Our Education Program saw continued growth in 2018, with the addition of 171 new applicants to help promote Machine Learning on their campuses, having 265 in total since we launched the Education Program. BigML Ambassadors span the globe and include students as well as educators. To boot, as part of BigML’s Internship Program, we’ve hired 3 interns that made valuable contributions. We are thrilled to see a continually increasing interest in gaining hands-on Machine Learning experience, having received hundreds of applications for internships and full-time positions throughout the year.learnml_2018

Moreover, in its second year of existence, our team of expert Machine Learning instructors completed 7 rounds of BigML’s Certification Program passing on their deep expertise to newly minted BigML Engineers.

BigML Preferred Partner Program

Last but not least is the new BigML Preferred Partner Program that we announced in the last quarter of 2018. With 3 levels and different rewards, it enables new synergies covering multiple sales and project delivery collaboration scenarios.

Our first announcement on that front involved Jidoka, a leader in RPA (Robotic Process Automation), and earlier this week we have revealed our partnership with SlicingDice, which offers a highly performant all-in-one data warehouse service that will in the near future come with built-in Machine Learning capabilities. Stay tuned for more partnership announcements in the new year as we’re in discussion with an exciting lineup of new BigML partners that will each make a difference in their respective industry verticals.

84 Blog Posts (so far)

We’ve kept our blog running on all cylinders throughout the year. 84 new posts were added to our blog, which has long been recognized as one of the Top 20 Machine Learning blogs. Below is a selection of posts that were popular on our social media channels in case you’re interested in catching up with some Machine Learning reading during the holidays.

blogposts_2018

Looking Forward to 2019

Hope this post gave a good tour of what’s been happening around our neck of the woods. We fully intend to continue carrying the Machine Learning for everyone flag in the new year. As part of our commitment to democratize Machine Learning by making it simple and beautiful for everyone, we will be sharing more of our insights, customer success stories, and all the new features we will bring you with each new release in 2019. As always, thanks for being part of BigML’s journey!

Principal Component Analysis: Technical Overview

This past week we’ve been blogging about BigML’s new Principal Component Analysis (PCA) feature. In this post, we will continue on that topic and discuss some of the mathematical fundamentals of PCA, and reveal some of the technical details of BigML’s implementation.

A Geometric Interpretation

Let’s revisit our old friend the iris dataset. To simplify visualizations, we’ll just look at two of its variables: petal length and sepal length. Imagine now that we take some line in this 2-dimensional space and we sum the perpendicular distances between this line and each point in the dataset. As we rotate this line around the center of the dataset, the value of this sum changes. You can see this process in the following animation.

What we see here is that the value of this sum reaches a minimum when the line is aligned with direction in which the dataset is most “stretched out”. We call a vector that points in this direction the first “principal component” of the dataset. It is plotted on the left side of the figure below so you don’t need to make yourself dizzy trying to see it in the animated plot.

The number of principal components is equal to the dimensionality of the dataset, so how do we find the second one for our 2D iris data? We need to first cancel out the the influence of the first one. From each data point, we will subtract its projection along the first principal component. As this is the final principal component, we will be left with all the points in a neat line, so we don’t need to go spinning another vector around. This is the result shown on the right side of the figure below.

pca

The importance of a principal component is usually measured by its “explained variance”. Our example dataset has a total variance of 3.80197. After we subtract the first principal component, the total variance is reduced to 0.14007. The explained variance for PC1 is therefore:

\frac{3.80197 - 0.14007}{3.80197} \approx 0.96315

This process can be generalized for data with higher dimensions. For instance, starting with a 3D point cloud, subtracting the first principal component gives you all the points in a plane, and then the process for the finding the second and third components is the same as what we just discussed. However, for even a moderate number of dimensions, the process becomes extremely difficult to visualize, so we need to turn to some linear algebra tools.

Covariance Decomposition

Given a collection of data points, we can compute its covariance matrix. This is a quantification of how each variable in the dataset varies both individually and together with other variables. Our 2D iris dataset has the following covariance matrix:

\Sigma = \begin{bmatrix} 3.11628 & 1.27432 \\ 1.27432 &  0.68569 \end{bmatrix}

Here, sepal length and petal length are the first and second variables respectively.  The numbers on the main diagonal are their variance. The variance of sepal length is several times that of petal length, which we can also see when we plot the data points. The value on the off diagonal is the covariance between the two variables. A positive value here indicates that they are positively correlated.

Once we have the covariance matrix, we can perform eigendecomposition on it to obtain a collection of eigenvectors and eigenvalues.

                     \mathbf{e}_1 = [ 0.91928 , 0.39361]  \qquad \lambda_1 = 3.66189

                     \mathbf{e}_2 = [0.39361, -0.91928] \qquad \lambda_2 = 0.14007

We can now make a couple observations. First, the eigenvectors are essentially identical to the principal component directions we found using our repeated line fits. Second, the eigenvalues are equal to the amount of explained variance for the corresponding component. Since covariance matrices can be constructed for any number of dimensions, we now have a method that can be applied whatever size problem we wish to analyze.

Missing Values

When we move beyond the realm of toy datasets like iris, one of the challenges we encounter is how to deal with missing values, as the calculation of the covariance matrix does not admit those. One strategy could be to simply drop all the rows in your dataset that contain missing data. With a high proportion of missing values however, this may lead to an unacceptable loss of data. Instead, BigML’s implementation employs the unbiased estimator described here. First, we form a naive estimate by replacing all missing values with 0 and computing the covariance matrix as normal. Call this estimate \tilde{\Sigma}, and let \mathrm{diag}(\tilde{\Sigma}) be the same matrix with the off-diagonal elements set to 0. We also need to compute a parameter \delta, which is the proportion of values which are not missing. This is easily derived from the dataset’s field summaries. The unbiased estimate is then calculated as:

\Sigma^{(\delta)} = (\delta^{-1} - \delta^{-2})\mathrm{diag}(\tilde{\Sigma}) + \delta^{-1}\tilde{\Sigma}

With our example data, we randomly erase 100 values. Since we have 150 2-dimensional datapoints, that gives us \delta = 2/3. The naive and corrected missing value covariance matrix estimates are:

\tilde{\Sigma}=\begin{bmatrix} 2.25964 & 0.51222  \\ 0.51222 & 0.44881 \end{bmatrix} \qquad \Sigma^{(\delta)} = \begin{bmatrix}3.38946 & 1.15251 \\ 1.15251 & 0.67322\end{bmatrix}

The corrected covariance matrix is clearly much nearer to “true” values.

Non-Numeric Variables

PCA is typically applied to numeric-only data, but BigML’s implementation incorporates some additional techniques which allow for the analysis of datasets containing other data types. With text and items fields, we explode them into multiple numeric fields, one per item in their tag cloud, where the value is the term count.

Categorical fields are slightly more involved. BigML uses a pair of methods called Multiple Correspondence Analysis (MCA) and Factorial Analysis of Mixed Data (FAMD). Each field is expanded into multiple 0/1 valued fields y_1 \ldots y_k, one per categorical value. If is the mean of each such field, then we compute the shifted and scaled value, using one of the two expressions below:

x_{MCA} = \frac{y - p}{J\sqrt{Jp}} \qquad x_{FAMD} = \frac{y - p}{\sqrt{p}}

If the dataset consists entirely of categorical fields, the first equation is used, and J is the total number of fields. Otherwise, we have a mixed datatype dataset and we use the second equation. After these transformations are applied, we then feed the data into our usual PCA solver.

We hope this post has shed some light on the technical details of Principal Component Analysis, as well as highlight the unique technical features BigML brings to the table. To put your new-found understanding to work, head over to the BigML Dashboard and start running some analyses!

Want to know more about PCA?

For further questions and reading, please remember to visit the release page and join the release webinar on Thursday, December 20, 2018. Attendance is FREE of charge, but space is limited so register soon!

Bringing Automated Machine Learning to the All-in-One Data Warehouse

SlicingDice and BigML partner to bring the very first Data Warehouse embedding Automated Machine Learning. This All-in-One solution will provide guardrails for thousands of organizations struggling to keep up with insight discovery and decision automation goals.

Due to the accelerating data growth in our decade, the focus for all businesses has naturally turned to data collection, storing, and computation. Nevertheless, what companies really need is getting useful insights from their data to ultimately automate decision-making processes. This is where Machine Learning can add tremendous value. By applying the right Machine Learning techniques to solve a specific business problem, any company can increase their revenue, optimize resources, improve processes, and automate manual, error-prone tasks. With this vision in mind, BigML and SlicingDice, the leading Machine Learning platform and the unique All-in-One data solution company respectively, are joining forces to provide a more complete and uniform solution that helps businesses get to the desired insights hidden in their data much faster.

BigML and SlicingDice’s partnership embodies the strong commitment from both companies to bring powerful solutions to data problems in a simple fashion. SlicingDice offers an All-in-One cloud-based Data Solution that is easy to use, fast and cost-effective for any data challenge. Thus, the end customers do not need to spend excessive time configuring, pre-processing, and managing their data. SlicingDice will provide Machine Learning-ready datasets for companies to start working on their Machine Learning projects through the BigML platform, which will be seamlessly integrated into the SlicingDice site. As a result, thousands of organizations can make the best of their data by having it all organized and accessible from one platform, to solve and automate Classification, Regression, Time Series Forecasting, Cluster Analysis, Anomaly Detection, Association Discovery, and Topic Modeling tasks thanks to the underlying BigML platform.

This integration is ideal for tomorrow’s data-driven companies that have large volumes of data and the need to carefully manage their data in a cost-effective manner. Take, for example, a large IoT project that needs to deploy over 150 thousand sensors distributed in several regions with billions of insertions per day. Normally, this type of data could be very costly and difficult to manage and maintain, but with the SlicingDice and BigML’s integrated approach, data and process complexities are abstracted while risks are mitigated. The client can then not only visualize all this data in real-time business dashboards for hundreds of users located in different geographical areas but also apply Machine Learning to truly start automating decision making, with a very accessible solution, clearly optimizing their resources in a traceable and proven way.

Francisco Martín, BigML’s CEO shared, “It is critically important to acknowledge the costly challenges enterprises face when having to prepare their data for Machine Learning and automate Machine Learning workflows by themselves. By joining forces with SlicingDice, we aim to drastically simplify such initiatives. Our joint customers will be able to focus on becoming truly data-driven businesses with agile and adaptable decision-making capabilities able to meet the ever-shifting competitive and demand-driven dynamics in their respective industries.”

SlicingDice’s CEO, Gabriel Menegatti, wishes “to enable any and every company to be able to tackle their data challenges using simpler tools, which delivers value to them faster. We wanted to offer companies a data solution that is comprehensive, fast and cheap. We built that. Now, by leveraging BigML’s technology, those companies can take their analytics to the next level, using Machine Learning to take the next step in their data journeys. We’re sure companies can seize this opportunity and ensure data is treated as an asset and not just an operational bottleneck.”

Automating Principal Component Analysis

by

Today’s post is the fifth one of our series of blog posts about BigML Principal Component Analysis (PCA) unique implementation, the latest resource added to our platform. PCA is a different type of task in the Data Preparation phase described in the CRISP-DM methodology, which implies the creation of a new dataset based on an existing one.

As mentioned in BigML previous release, the data preparation is a key part of any Machine Learning project where a large number of operations are often required to get the best out of your data. Now, bringing PCA to the BigML Dashboard, API, and also WhizzML and Bindings for automation, you will be able to transform your data and, this time, to achieve dimensionality reduction by decreasing certain features in your dataset. Let’s dive in to learn how to automate BigML PCA with WhizzML and our Python Bindings. If you are new to WhizzML and would like to start automating your own Machine Learning workflows, we invite you to read this blog post to get started. 

automating_pca

Creating a PCA

First of all, we are going to create a PCA from an existing dataset. We are assuming we want to reduce the number of features of this dataset translating it from its original form to another with fewer dimensions. The WhizzML code to do just that without specifying any parameter in the configuration looks like this:

;; creates a PCA with default configuration
(define my-pca
  (create-pca {"dataset" "dataset/5bcbd2b5421aa9560d000000"}))

The equivalent code using the BigML Python Bindings is:

from bigml.api import BigML

api = BigML()
my_pca = api.create_pca("dataset/5bcbd2b5421aa9560d000000")

This is the simplest way to create a PCA from a dataset. But there are some parameters that users can configure as needed. Now, let’s see the configuration option to replace a missing numeric value when creating a PCA:

;; creates a PCA setting the default numeric values
(define my-pca
  (create-pca {"dataset" "dataset/5bcbd2b5421aa9560d000000"
    "default_numeric_value" "median"}))

And the equivalent code in our Python Bindings:

from bigml.api import BigML

api = BigML()
args = {"default_numeric_value": "median"}
my_pca = api.create_pca("dataset/5bcbd2b5421aa9560d000000", args)

This has been a simple example of how to add arguments during the PCA creation. You could similarly set many other values, for instance, the name of the new resource. For a complete list of the parameters available for PCA configuration please check the API documentation.

Creating a new projection

We have seen how PCA translates the data from one space to another, which is why we talk about “projections”. So, let’s assume we have a set of inputs and we want to apply the result of the PCA to them. Following the proper syntax, our set of data should be passed as input_data, which is an object with pairs of field IDs and values in WhizzML.

;; creates a projection for the input data
(define my-projection
  (create-projection {"pca" "pca/5bcbd2b5421aa9560d000001"
    "input_data" {"000000" 3 "000001" "London"}))

And the equivalent code for the Python Bindings passes a dictionary, where the key is the field ID (or the field name) and the value is the value of the field.

from bigml.api import BigML

api = BigML()
input_data = {"000000": 3, "0000001": "London"}
my_projection = api.create_projection("pca/5bcbd2b5421aa9560d000001",
                                      input_data)

Creating batch projections

Once you have created your PCA, it’s very likely you’ll want to apply the same transformation that we already applied in the example above to different data — and not just to one instance, but to a set of them. That’s what we call a batch projection. To create such a call takes at least two mandatory arguments: the PCA that was previously created and the set of data that we want to project.

;; creates the projection of a new set of data
(define my-batchprojection
  (create-batchprojection
    {"pca" "pca/5bcbd2b5421aa9560d000001"
    "dataset" "dataset/5bcbd2b5421aa9560d000003"}))

The equivalent code for the Python Bindings is:

from bigml.api import BigML

api = BigML()
my_pca = "pca/5bcbd2b5421aa9560d000001"
my_new_dataset = "dataset/5bcbd2b5421aa9560d000003"
my_batch_projection = api.create_batch_projection(my_pca,
                                                  my_new_dataset)

The result of our example will contain as features all the fields from the original dataset and the projected principal components. Users can also change and further adapt it to the output of the API call. Please, check the API documentation to see the available options.

What about dimensionality reduction?

PCA allows you to reduce the number of dimensions in a dataset by creating new features that best describe the variance in your data. The algorithm yields a number of Principal Components that preserve the dimensionality of the original dataset, and the new features are conveniently sorted according to the Percent Variance Explained of the original data. With this information, we can choose to eliminate a fraction of the Principal Component fields in order to reduce the number of features while preserving the maximum amount of useful information.

To help you do that, two key parameters allow you to select how many components of the PCA you want to use in the new mapping space. Those parameters are:

  • max_components  represents the integer number of components you want to employ in your new dataset and
  • variance_threshold determines what percentage of the total variance in the original dataset you’d like to capture in the new space.

For example, let’s suppose that you want to explain at least 90% of the variance of your data with the components. The WhizzML code will be as follows:

;; creates a bathprojection that explain the 90% of variance
(define my-batchprojection
  (create-batchprojection
    {"pca" "pca/5bcbd2b5421aa9560d000001"
    "dataset" "dataset/5bcbd2b5421aa9560d000003"
    "variance_threshold" 0.9}))

On the other hand, if we used the Python Bindings to code this creation, the equivalent code would be:

from bigml.api import BigML

api = BigML()
my_pca = "pca/5bcbd2b5421aa9560d000001"
my_new_dataset = "dataset/5bcbd2b5421aa9560d000003"
args = {"variance_threshold": 0.9}
my_batch_projection = api.create_batch_projection(my_pca,
                                                  my_new_dataset,
                                                  args)

These two parameters are the most significant ones, but there are many other parameters that can be set for the batch projection creation. Check the complete list here.

Finally, feel free to check out the set of bindings that BigML offers for most popular programming languages, such as Java or Node.js.

Want to know more about PCA?

If you have any questions or you would like to learn more about how PCA works, please visit the release page and reserve your spot for the upcoming webinar about Principal Component Analysis on Thursday, December 20, 2018. Attendance is FREE of charge, but space is limited so register soon!

A REST API for Principal Component Analysis

As part of our PCA release, we have released a series of blog posts, including a use case and a demonstration of the BigML Dashboard. In this installment, we shift our focus to implement Principal Component analysis with the BigML REST API. PCA is a powerful data transformation technique and unsupervised Machine Learning method that can be used for data visualizations and dimensionality reduction.

pca-workflow

Authentication

The first step in any BigML workflow using the API is setting up authentication. In order to proceed, you must first set the BIGML_USERNAME and BIGML_API_KEY environment variables, available in the user account page. Once authentication is successfully set up, you can begin executing the rest of this workflow.

export BIGML_USERNAME=my_name
export BIGML_API_KEY=13245
export BIGML_AUTH="username=$BIGML_USERNAME;api_key=$BIGML_API_KEY;"

Create a Source

Data sources can be uploaded to BigML in many different ways, so this step should be appropriately adapted to your data with the help of the API documentation. Here we will create our data source using a local file downloaded from Kaggle.

curl "https://bigml.io/source?$BIGML_AUTH" -F file=@mobile.csv

This particular dataset has a target variable called “price_range”. Using the API we can update the field type easily.

curl "https://bigml.io/source/4f603fe203ce89bb2d000000?$BIGML_AUTH" \
  -X PUT \
  -d '{"fields": {"price_range": {"optype": "categorical"}}}' \
  -H 'content-type: application/json'

Create Datasets

In BigML, sources need to be processed into datasets.

curl "https://bigml.io/dataset?$BIGML_AUTH" \
  -X POST \
  -H 'content-type: application/json' \
  -d '{"source": "source/4f603fe203ce89bb2d000000"}'

Because we will want to be able to evaluate our model trained using PCA-derived features, we need to split the dataset into a training and test set. Here we will allocate 80% for training and 20% for testing, indicated by the “sample_rate” parameter.

curl "https://bigml.io/dataset?$BIGML_AUTH" \
  -X POST \
  -H 'content-type: application/json' \
  -d '{"origin_dataset": "dataset/59c153eab95b3905a3000054",
  "sample_rate": 0.8,
  "seed": "myseed"}'

By setting the parameter “out_of_bag” to True, we select all the rows that were not selected when creating the training set in order to have an independent test set.

curl "https://bigml.io/dataset?$BIGML_AUTH" \
  -X POST \
  -H 'content-type: application/json' \
  -d '{"origin_dataset": "dataset/59c153eab95b3905a3000054",
  "sample_rate": 0.8,
  "out_of_bag": true,
  "seed": "myseed"}'

Create a PCA

Our datasets are now prepared for PCA. The Principal Components obtained from PCA are linear combinations of the original variables. If the data is going to be used for supervised learning at a later point, it is critical not to include the target variable in the PCA, as it will result in the target variable being present in the covariate fields. As such, we create a PCA using all fields except “price_range”, using the “excluded_fields” parameter.

curl "https://bigml.io/pca?$BIGML_AUTH" \
  -X POST \
  -H 'content-type: application/json' \
  -d '{"dataset": "dataset/59c153eab95b3905a3000054",
  "excluded_fields": ["price_range"]}'

Create Batch Projections

Next up, utilize the newly created PCA resource to perform a PCA batch projection on both the train and test sets. Ensure that all PCs are added as fields to both newly created datasets.

curl "https://bigml.io/batchprojection?$BIGML_AUTH" \
  -X POST \
  -H 'content-type: application/json' \
  -d '{"pca": "pca/5423625af0a5ea3eea000028",
  "dataset": "dataset/59c153eab95b3905a3000054",
  "all_fields": true,
  “output_dataset”:true}'

Train a Classifier

After that, using the training set to train a logistic regression model that predicts the “price_range” class is very straightforward.

curl "https://bigml.io/logisticregression?$BIGML_AUTH" \
  -X POST \
  -H 'content-type: application/json' \
  -d '{"dataset": "dataset/5af59f9cc7736e6b33005697",
  "objective_field":"price_range"}'

Model Evaluation

Once ready, evaluate the model using the test set. BigML will provide multiple classification metrics, some of which may be more relevant than others for your use case.

curl "https://bigml.io/evaluation?$BIGML_AUTH" \
  -X POST \
  -H 'content-type: application/json' \
  -d '{"dataset": "dataset/5af5a69cb95b39787700036f",
  "logisticregression": "logisticregression/5af5af5db95b3978820001e0"}'

Want to know more about PCA?

Our final blog posts for this release will include additional tutorials on how to automate PCAs with WhizzML and the BigML Python Bindings. For further questions and reading, please remember to visit the release page and join the release webinar on Thursday, December 20, 2018. Attendance is FREE of charge, but space is limited so register soon!

Principal Component Analysis with the BigML Dashboard: Easy as 1-2-3!

The BigML Team is bringing Principal Component Analysis (PCA) to the BigML platform on December 20, 2018. As explained in our introductory post, PCA is an unsupervised learning technique that can be used in different scenarios such as feature transformationdimensionality reduction, and exploratory data analysis. PCA, explained in a nutshell, fundamentally transforms a dataset defined by possibly correlated variables into a set of uncorrelated variables, called principal components.

In this post, we will take you through the five necessary steps to upload your data, create a dataset, create a PCA, analyze the results and finally make projections using the BigML Dashboard.

pca-workflow

We will use the training data of the Fashion MNIST dataset from Kaggle which contains 60,000 Zalando’s fashion article images that represent 10 different classes of products. Our main goal will be to use PCA to reduce the dataset dimensionality which has 784 fields containing pixel data to build a supervised model that predicts the right product category for each image.

fashion-mnist-dataset.png

1. Upload your Data

Start by uploading your data to your BigML account. BigML offers several ways to do so: you can drag and drop a local file, connect BigML to your cloud repository (e.g., S3 buckets) or copy and paste a URL. In this case, we download the dataset from Kaggle and just drag and drop the file.

BigML automatically identifies the field types. We have 784 input fields containing pixel data and correctly set as numeric by BigML. The objective field “label” is also identified as numeric because it contains digit values from 0 to 9, however, we need to convert this field into categorical since each digit of this field represents a product category instead of a continuous numeric value. We can easily do this by clicking on the “Configure source” option shown in the image below.

configure-source.png

2. Create a Dataset

From your source view, use the 1-click dataset menu option to create a dataset, a structured version of your data ready to be used by a Machine Learning algorithm.

In the dataset view, you will be able to see a summary of your field values, some basic statistics, and the field histograms to analyze your data distributions. You can see that our dataset has a total of 60,000 instances where each of the 10 classes in the objective filed has 6,000 instances.

fashion-mnist-dataset2.png

3. Create a PCA

Before creating the PCA we need to split our dataset into two subsets: 80% for training and 20% for testing. This is because our main goal in building a PCA is to reduce our data dimensionality to build a supervised model that can predict the product categories afterwards. If we used the full dataset to build the PCA and then split the resulting dataset into the train and tests subsets to build the supervised model, we would be introducing data leakage, i.e., the training set would contain information of the test set. However, this split wouldn’t be necessary if we wanted to use PCA for other purposes such as for data exploration.

split-dataset.png

Next, we take the 80% training set to create the PCA. You can use the 1-click PCA menu option, which will create the model using the default parameter values, or you can adjust the parameters using the PCA configuration option. Another important thing to consider at this point is that we need to exclude our objective field from the PCA creation to avoid another possible data leakage scenario. Otherwise, we will be mixing information about the objective field into the principal components that we will use as predictors for our supervised model.

create-pca.png

BigML provides the following parameters to configure your PCA:

  • Standardize: allows you to automatically scale numeric fields to a 0-1 range. Standardizing implies assigning equal importance to all the fields regardless if they are on the same scale. If fields do not have the same scale and you create a PCA with non-standardized fields, it is often the case that each principal component is dominated by a single field. Thus, BigML enables this parameter by default.
  • Default numeric value: PCA can include missing numeric values as valid values. However, there can be situations for which you don’t want to include them in your model. For those cases, you can easily replace missing numeric values with the field’s mean, median, maximum, minimum or with zero.
  • Sampling: sometimes you don’t need all the data contained in your test dataset to generate your PCA. If you have a very large dataset, sampling may very well be a good way to get faster results.

configure-pca.png

4. Analyze your PCA Results

When your PCA is created, you will be able to visualize the results in the PCA view, which is composed of two main parts: the principal component list and the scree plot.

  • The principal component list allows you to see the components created by the PCA (up to 200). Each of the principal components is a linear combination of the original variables, is orthogonal to all other components, and ordered according to the variance. The variance of each component indicates the total variability of the data explained by that component. In this list view, you can also see the original field weights associated with each component that indicate each field’s influence on that component.

principal-components-list.png

  • The scree plot helps you to graphically see the amount of variance explained by a given subset of components. It can be used to select the subset of components to create a new dataset either by setting a threshold for the cumulative variance or by limiting the total number of components using the slider shown in the image below. Unfortunately, there is not an objective way to decide the optimal number of components for a given cumulative variance. This depends on the data and the problem you are looking to solve so be sure to apply your best judgment given your knowledge of the context.

scree-plot.png

5. Create Projections

PCA models can be used to project the same or new data points to a new set of axes defined by the principal components. In this case, we want to make projections on our two subsets (the 80% for training and the 20% for testing) so we can replace the original fields by the components calculated by our PCA to create and evaluate a supervised model.

Create a Dataset from the PCA view

If you want to get the components for the same dataset that you used to create the PCA, you can use the “Create dataset” button that BigML provides in the PCA view.  This option is like a shortcut that creates a batch projection behind the scenes. For our 80% subset, we are using this faster option. We can see in the scree plot that selecting around 300 components (out of the 784 total components) using the slider shown in the image below, gives us more than 80% of the cumulative variance which seems a large enough number to create a new dataset without losing much information from the original data.

create-dataset

After the dataset is created we can find it listed on our Dashboard. The new dataset will include the original fields used to create the PCA and the new principal components taking into account the threshold set.

training-dataset-with-components.png

Create a Batch Projection

If you want to use a different dataset than the one used to create the PCA, then you need to take the long path and click on the “Batch projection” option. We are using this option for our 20% subset. The step-by-step process is explained below.

1. Click on the option “Batch projection” from the PCA view.batch-projection.png

2. Select the dataset you want to use and configure the batch projection if you want. In this case, we are selecting the 20% test subset and we are limiting the number of components to be returned up to 300 by using the slider shown in the image below (as we did with the training set before). batch-projection-limits.pngWe can also choose to remove the original fields or not. In this case, we are keeping them since we want to use the same 80% and 20% subsets to build and evaluate two different supervised models: one with the original fields and another one with the components.

3. When the batch projection is created, you can find the new dataset containing the components in your dataset list view.test-dataset-with-components.png

Final Results

Using our reduced dataset with the 300 components, we create a logistic regression to predict the product categories. We also create another logistic regression that uses the original 784 fields that contained the pixel data so we can compare both models’ performances.

logistic-regression-with-components.png

When we evaluate them, we can observe that the performances of the 300 component models (f-measure=0.8449) are almost exactly the same as the one from the model that used all of the original fields (f-measure=0.8484) despite the fact that we only used ~40% of the original fields. This simple act allows us to reduce model complexity considerably, in turn, decreasing the training and prediction times.

results-comparison.png

Want to know more about PCA?

If you would like to learn more about Principal Component Analysis and see it in action on the BigML Dashboard, please reserve your spot for our upcoming release webinar on Thursday, December 20, 2018. Attendance is FREE of charge, but space is limited so register soon!

Applying Dimensionality Reduction with PCA to Cancer Data

Principal Component Analysis (PCA) is a powerful and well-established data transformation method that can be used for data visualization, dimensionality reduction, and possibly improved performance with supervised learning tasks. In this use case blog, we examine a dataset consisting of measurements of benign and malignant tumors which are computed from digital images of a fine needle aspirate of breast mass tissue. Specifically, these 30 variables describe specific characteristics of the cell nuclei present in the images, such as texture symmetry and radius.

Data Exploration

The first step in applying PCA to this process was to see if we can more easily visualize separation between the malignant and benign classes in two dimensions. To do this, we first divide our dataset into train and test sets and perform the PCA using only the training data. Although this step can be considered feature engineering, it is important to conduct the train-test split prior to performing PCA because the transformation takes into account variability in the whole dataset and would lead to information in the testing set leaking into the training data otherwise.

pca_scree.png

The resulting PCA resource lists the top Principal Components (PC), sorted by their Percent Variation Explained. In this example, we can see that PC1 accounts for 45.12% of the total variance in the dataset, and the top 7 PCs alone account for approximately 90% of the Percent Variation Explained. We can further explore which of the original fields are contributing the most to the various PCs by inspecting the bar chart provided on the left. The three greatest contributors to PC1 turn out to be “concave points mean”, “concavity mean”, and “concave points worst”.  Based on this information, we can begin to conclude that features related to concavity are highly variable and possibly discriminative.

components.png

A PCA transformation yields new variables which are linear combinations of the original fields. The major advantages of this transformation are that the new fields are not correlated with one another and that each successive Principal Component seeks to maximize the remaining variance in the dataset under the constraint of being orthogonal to the other components. By maximizing variance and decorrelating the features, PCA-created variables can often perform better in supervised learning – especially with model types that have higher levels of bias. However, this is performed at the expense of overall interpretability. Although we can always inspect the contributions of our original variables on the PCA fields, a linear combination of 30 variables will always be less straightforward than simply inspecting the original variable.

Data Visualization

scatterplot.png

After plotting the dataset according to only the top 2 Principal Components (PC1 and PC2) and coloring each data point by diagnosis (benign or malignant), we already can see a considerable separation between the classes. What makes this result impressive is that we used no knowledge of our target variable when creating or selecting for these Principal Component fields. We simply created fields that explained the most variance in the dataset, and they also turned out to have enormous discriminatory power.

Predictive Modeling

Finally, we can evaluate how well our Principal Components fields work as the inputs to a logistic regression classifier. For our evaluation, we trained a logistic regression model with identical hyperparameters (L2 regularization, c=1.0, bias term included) using 4 different sets of variables:

  • All 30 original variables
  • All Principal Components
  • Top 7 PCs (90% PVE)
  • Top 2 PCs only

The results are visualized in the Receiver Operating Characteristic (ROC) curve below, with a malignant diagnosis serving as the positive class and sorted by Area Under the Curve (AUC). In general, all of the models in this example performed very well, with rather subtle differences in performance. However, the scale of the input data varied widely. Most notably, a model using only two variables (PC1 and PC2) performed with an AUC of >0.97, and very close to the top performing model with AUC >0.99.

roc_auc.png

As part of Occam’s Razor of Machine Learning, it is often advantageous to utilize a simpler model whenever possible. PCA-based dimensionality reduction is one method that enables models to be built with far fewer features while maintaining most of the relevant informational content. As such, we invite you to explore the new PCA feature with your own datasets, both for exploratory visualization tasks and as a preprocessing step.

Want to know more about PCA?

If you would like to learn more about Principal Component Analysis and see it in action on the BigML platform, please reserve your spot for our upcoming release webinar on Thursday, December 20, 2018. Attendance is FREE of charge, but space is limited so register soon!

%d bloggers like this: