Skip to content

Automated Best-First Feature Selection

In this third post about feature selection scripts in WhizzML, we will introduce the third and final algorithm, Best-First Feature Selection (Best-First). In the first post, we discussed Recursive Feature Selection, and in the second post, we covered Boruta

Best First Feature Selection with WhizzML

Introduction to Best-First Feature Selection

You can find this script in the BigML Script Gallery If you want to know more about it, visit its info page.

Best-First selects the n best features for modeling a given dataset, using a greedy algorithm. It starts by creating N models, each of them using only one of the N features of our dataset as input. The feature that yields the model with the best performance is selected. In the next iteration, it creates another set of N-1 models with two input features: the one selected in the previous iteration and another of the N-1 remaining features. Again, the combination of features that gives the best performance is selected. The script stops when it reaches the number of desired features which is specified in advance by the user.

One improvement we made to this script includes k-fold cross-validation for the model evaluation process at each iteration. This ensures that the good or bad performance of one model is not produced by chance because of a single favorable train/test split.

Since this is the most time-consuming script of the dimensionality reduction scripts described in this series of posts, another useful feature has been added to this script: early-stop. We can configure the script to stop the execution if there are a certain number of iterations where the additional features do not improve the model performance. We created two new inputs for that:

  • early-stop-performance: An early-stop-performance improvement value (in %) used as a threshold to consider if a new feature has a better performance compared to previous iterations.
  • max-low-perf-iterations: The maximum number of consecutive iterations allowed that may have a lower performance than the early-stop performance set. It needs to be set as a percentage of the initial number of features in the dataset.

Finally, there are two more inputs that can be very useful:

  • options: It allows you to configure the kind of model that will be created at each iteration and its parameters.
  • pre-selected-fields: List of field IDs to be pre-selected as best features. The script won’t consider them but they will be included in the output.

Feature selection with Best-First Feature Selection

As this is a time-consuming script, we won’t apply it to the full Trucks APS dataset used in the first post in case you wanted to quickly replicate the results. We will use a subset of the original dataset that uses the 29 fields selected by the Boruta script in our second post. Then we will apply these parameters:

We have used a max-n of 20 because that’s the number of features that we want to select. As we want the script to return exactly 20 features, we are using an early-stop-performance value of -100 to bypass the early stop feature. After 1 hour, Best-First selects these 20 fields as important:

"bj_000", "ag_002", "ba_005", "cc_000", "ay_005", "am_0", "ag_001", "cn_000", 
"cn_001", "cn_004","cs_002","ag_003", "az_000", "bt_000", "bu_000", "ee_005", 
"al_000", "bb_000","cj_000", "ee_007"  

In the fourth and final post, we will compare RFE, Boruta, and Best-First to see which one is better suited for different use cases. We will also explore the results of the evaluations performed to the reduced datasets and compare them with the original ones. Stay tuned!

Powering the Next Wave of Intelligent IoT Devices with Machine Learning – Part 1

At BigML, we strive to bring the power of Machine Learning to as many diverse environments as possible. Now you can easily power your Internet of Things (IoT) devices with Classifiers, Regressors, Anomaly Detectors, Deep Neural Networks, and more with the BigML bindings for Node-RED.

The BigML Node-RED bindings aim to make it easier to create and deploy ML-powered IoT devices using one of the most used development environments for IoT: Node-RED. Node-RED is a flow-based, visual programming development tool that allows you to wire together hardware devices, APIs and online services, as part of the Internet of Things. Node-RED provides a web browser-based flow editor which can be used to visually create a JavaScript web service.

Thanks to the BigML Node-RED bindings, you will be able to carry through ML tasks using the BigML platform. For example, tasks such as creating a model from a remote data source, making a prediction using a pre-existing model when a new event occurs, and so on, will be as easy as dragging and dropping the relevant BigML nodes on to the Node-RED canvas and wiring them together. As a bonus, the BigML Node-RED bindings are based on WhizzML, our domain-specific language for automating Machine Learning workflows. This will allow you to easily integrate your Node-RED flows with any advanced ML workflows your use case requires.

Setting up Node-RED with the BigML Node-RED bindings

Let’s see first how you can set-up your Node-RED environment. Installing Node-RED is super-easy if you already have Node and npm installed on your machine. Just run the following shell command:

$ sudo npm install -g --unsafe-perm node-red

Once Node-RED is installed, you launch it by executing the following command:

$ node-red

Now, you can point your browser to http://localhost:1880/ and access the Node-RED visual flow editor, shown in the image below.

The Node-RED flow editor on launch

Note that there are alternative ways to install and run Node-RED on your machine or IoT device. Check the documentation linked above for more options.

Your first Node-RED flow with BigML: Creating an ensemble

Now that you have Node-RED installed on your machine, we can define a flow to create an ML resource on BigML.

To get a rough idea about the way Node-RED works, let’s create a very basic flow that outputs some JSON to the Node-RED debug console. Once we have that in place, we will add a BigML node to carry through our ML task.

As a first step, just grab the default inject and debug nodes from the node palette on the left-side of the node-RED editor to the middle canvas. Then connect the inject node output port to the debug node input port. You should get the flow displayed in the next image:

Your first Node-RED flow

Notice the two blue dots on each of the nodes. That is the Node-RED way of telling you those nodes have changes that have not been deployed yet. When you are ready with your changes, you can deploy them by clicking the red Deploy button in the top-right corner. If everything looks right, Node-RED will update the status of the nodes by removing the blue dot.

You can customize the two nodes by double-clicking on each of them and configuring their options. For now, just click the Deploy button and then the small square box left of the inject node. This will make a timestamp message to be injected in the flow and reach the debug node, which simply outputs the message payload to the debug console, as shown in the following image.

Your first Node-RED flow in action
Now, let’s build a Machine Learning workflow to create a new model from a remote data source. As you likely know, this requires three steps:

  • Creating a BigML source from your remote data source.
  • Creating a dataset from the source.
  • Finally, creating the model using that dataset.

So, our Node-RED flow will include three nodes, one to create the source, another to create the dataset, and another to create the model.

Before doing this, we will need to install the BigML Node-RED bindings, which will require going back to the command line. If you have not modified Node-RED default configuration, it will store all of its stuff in its user data directory, which is ~/.node-red by default. In that directory, you can install any additional Node-RED nodes you would like to use as npm packages. In our present case, just execute the following command to have the BigML Node-RED bindings installed:

cd $HOME/.node-red
npm install bigml-nodered

Then restart your node-red process to have it load the new nodes. This should populate your Node-RED node palette with a wealth of new BigML nodes, as the following image shows.

BigML Nodes
Now, you can drag and drop the BigML nodes we mentioned above and connect them as in the following image. Thereafter, we are going to configure the nodes appropriately.

BigML Nodes
To configure each node, double-click it and then set its properties as described below:

Source:

BigML Source configuration

Dataset:

BigML Dataset configuration

Ensemble:

BigML Ensemble configuration

Reify:

BigML Reify configuration

As you can see, each node contains a real wealth of configuration parameters. You can find a thorough description of each of them on the BigML API page. For the sake of this example, we have just modified the description associated with each node.

Before attempting to execute this workflow, one important thing we should consider is authentication. The BigML API, which the BigML Node-RED bindings use, requires a user to authenticate themselves by specifying a username and an API key. We should provide this information if we want BigML to execute our flow. The BigML Node-RED bindings support several ways to specify authentication information. For this example, we will resort to providing username and API key in the payload message. To do this, we will customize our inject node so it initializes the message payload with a specific JSON. First, we will change the inject node type to make it a JSON node, as displayed in the following image.

tutorial-1-10

Next, we will define the JSON as in the following image, so it contains our API credentials.

tutorial-1-11

With the credentials set, we can finally inject the message into the flow, which will effectively start the execution. This flow will create a source, a dataset, and an ensemble in your BigML account using the specified arguments. If you go to your BigML Dashboard, you can check this out for yourself and see how the created resources look and use them as any other resources that exist in your Dashboard. Since we are using a Node-RED debug node at the end of our flow, we can additionally inspect our flow results in Node-RED debug sidebar, as shown in the following image.

Flow execution results

There, you can see how each node’s output was stored in the message payload under the corresponding output port name. This property enables the use of any node’s output in downstream nodes – provided they are not overridden by any intermediate node.

Conclusion

In this first instalment, we have just skimmed the very basics of using BigML with Node-RED. In a second instalment, we will create a more complex flow, which will give us the opportunity to cover important topics such as input-output connections, debugging, and so on. If you are developing IoT devices and would like to leverage our best-in-class Machine Learning algorithms to make them more intelligent, do not hesitate to get in touch with us at support@bigml.com.

Strategic Partnership for Joint Machine Learning Solutions

A1 Digital, a subsidiary of the A1 Telekom Austria Group, and BigML, the leading Machine Learning platform company based in USA and Spain, have agreed on a strategic partnership for joint Machine Learning solutions. Companies can now use leading Machine Learning solutions based on the European cloud platform Exoscale.

With the partnership with BigML, A1 Digital aims to accelerate its growth in the Machine Learning market in Europe and beyond through fast-paced Machine Learning innovation and synergies with its European cloud solution Exoscale that will allow A1 Digital’s customers to benefit from a purely European alternative when it comes to Machine Learning platforms.

Machine Learning driven applications allow companies of all sizes to extract value out of its data: e.g., to increase revenues, to reduce costs and risks, or improve customer satisfaction or security. BigML’s Machine Learning platform already helps hundreds of organizations worldwide to prepare their data for Machine Learning and to build, evaluate, and deploy predictive models, the essential part of every Machine Learning application, in a fully automated fashion.

“With A1 Digital, we have found the ideal partner to gain an even stronger foothold in Europe”, says Francisco Martin, CEO of BigML. “As businesses are shifting workloads to the cloud due to its attractive economics, it becomes absolutely critical to find trustworthy service providers to run tomorrow’s highly optimized Machine Learning applications driving incremental operational efficiencies and novel digital transformation use cases. BigML’s comprehensive platform on top of the Exoscale cloud can be customized to meet the most demanding functional and operational requirements of any European business.”

“With BigML, we are pleased to have found a partner with whom we can realize one of the key technologies for IoT applications, i.e., for fully networked and intelligent products”, says Elisabetta Castiglioni, CEO of A1 Digital. “BigML’s Machine Learning platform enables us to structure and optimize Machine Learning processes like any other business process. With Exoscale, we offer highly available and high-performance cloud servers and guarantee the highest data security at the same time.

The impact of this partnership is quite noticeable mostly in Europe, where thousands of companies have access to Exoscale, and therefore to the BigML platform. With BigML they can easily solve and automate a wide variety of use cases such as fraud detection, customer segmentation, churn analysis, predictive maintenance, propensity to buy, or healthcare diagnosis, among many others, by utilizing Classification, Regression, Time Series Forecasting, Cluster Analysis, Anomaly Detection, Association Discovery, Principal Component Analysis, and Topic Modeling tasks.

Simple Boruta Feature Selection Scripting

In the previous post of this series about feature selection WhizzML scripts, we introduced the problem of having too many features in our dataset, and we saw how Recursive Feature Elimination helps us to detect and remove useless fields. In this second post, we will learn another useful script, Boruta.

Boruta feature selection with WhizzML

Introduction to Boruta

We talked previously about this feature selection script. If you want to know more about it, visit its info page, which contains also the WhizzML code.

The Boruta script uses field importances obtained from an ensemble, to mark fields as important or unimportant. It does this process iteratively, labeling on each iteration the fields that are clearly important or unimportant and leaving the rest of fields to be labeled on later iterations. The previous version of this script didn’t have any configuration options, so we made the main two parameters of the algorithm configurable by the user:

  • min-gain: Defines the minimum increment of the importance of one field compared to the importance of a field with random values. If the gain is higher than the value set, then it will be marked as important.
  • max-runs: Maximum number of iterations.

As you can see, there is no n parameter that specifies the number of features to obtain. This is its main difference vs. other algorithms. Boruta assumes that the user doesn’t need to know what the optimal number of features is.

Feature Selection with Boruta

Let’s apply Boruta. We will use the dataset described in our previous post which contains information for multiple sensors inside trucks. These will be the inputs that we will use:

Input parameters for Boruta execution

After 50 minutes, Boruta selects the following fields as important:

"cn_000", "bj_000", "az_000", "al_000", "am_0", "bt_000", "ci_000", 
"ag_001", "ag_003", "aq_000", "ag_002", "ck_000", "bu_000", "cn_004",
 "ay_009", "cj_000", "cs_002", "dn_000", "ba_005", "ee_005", "ap_000", 
"az_001", "ay_003", "cc_000", "bb_000", "ee_007", "ay_005", "cn_001", 
"ee_000"

29 fields were marked as important. Fields in bold and italics were also returned by Recursive Feature Elimination, as seen in the previous post. 18 of the 29 fields were returned by RFE.  The ensemble associated with the new filtered dataset has a phi coefficient of 0.84. The phi coefficient of the ensemble that uses the original dataset was 0.824. Boruta achieved a more accurate model!

As we have seen, Boruta can be very useful when we don’t have any idea of the optimal number of features or we suppose that there are some features which are not contributing at all. Boruta discards fields which are not useful at all for the model. Therefore, we are removing features without subtracting from the model performance. In the third post of this series, we will cover the third script: Best First Feature Selection. Don’t miss it!

Practical Recursive Feature Selection

With the Summer 2018 Release Data Transformations were added to BigML. SQL-style queries, feature engineering with the Flatline editor and options to merge and join datasets are great, but sometimes these are not enough. If we have hundreds of different columns in our dataset it can be hard to handle them.

We are able to apply transformations but to which columns? Which features are useful for our target prediction and which ones are only increasing resource usage and model complexity?

Later, with the Fall 2018 Release, Principal Component Analysis (PCA) was added to the platform in order to help with dimensionality reduction. PCA helps with the challenge of extracting the discriminative information in the data while removing those fields that only add noise and make it difficult for the algorithm to achieve the expected performance. However, PCA transformation yields new variables which are linear combinations of the original fields, and this can be a problem if we want to obtain interpretable models.

Recursive Feature Elimination with WhizzML

Feature Selection Algorithms will help you to deal with wide datasets. There are 4 main reasons to obtain the most useful fields in a dataset and discard the others:

  • Memory and CPU: Useless features consume unnecessary memory and CPU.
  • Model performance: Although a good model will be able to detect which are the important features in a dataset, sometimes, this noise generated by useless fields confuses our model, and we obtain better performance when we remove them.
  • Cost: Obtaining data is not free. If some columns are not useful, don’t waste your time and money trying to collect them.
  • Interpretability: Reducing the number of features will make our model simpler and easier to understand.

In this series of four blog posts, we will describe three different techniques that can help us in this task: Recursive Feature Elimination (RFE), Boruta algorithm, and Best-First Feature Selection. These three scripts have been created using WhizzML, BigML’s domain-specific language. In the fourth and final post, we will summarize the techniques and provide guidelines for which are better suited depending on your use case.

Some of you may already know about the Best-First and Boruta scripts since we have offered them in the WhizzML Scripts Gallery. We will provide some details about the improvements we made to those and the new script, RFE.

Introduction to Recursive Feature Elimination (RFE)

In this post, we are focusing on Recursive Feature Elimination. You can find it in BigML Script Gallery If you want to know more about this script, visit its info page.

This is a completely new script in WhizzML. RFE starts using all the fields and, iteratively, creates ensembles removing the least important field at each iteration. The process is repeated until the number of fields (set by the user in advance) is reached. One interesting feature of this script is that it can return the evaluation for each possible number of features. This is very helpful to find the ideal number of features we should use.  

The script input parameters are:

  • dataset-id: input dataset
  • n: final number of features expected
  • objective-id: objective field (target)
  • test-ds-id: test dataset to be used in the evaluations (no evaluations take place if empty)
  • evaluation-metric: metric to be maximized in evaluations (default if empty). Possible classification metrics: accuracy, average_f_measure, average_phi (default), average_precision, and average_recall. Possible regression metrics: mean_absolute_error, mean_squared_error, r_squared (default).

Our dataset: System failures in trucks dataset

This dataset, originally from the UCI Machine Learning Repository, contains information for multiple sensors inside trucks. The dataset consists of trucks with failures and the objective field determines whether or not the failure comes from the Air Pressure System (APS). This dataset will be useful for us for two reasons:

  • It contains 171 different fields, which is a sufficiently large number for feature selection.
  • Field names have been anonymized for proprietary reasons so we can’t apply domain knowledge to remove useless features.

As it is a very big dataset, we will use a sample of it with 15,000 rows.

Feature Selection with Recursive Feature Elimination

We will start applying Recursive Feature Elimination with the following inputs. We are using an n=1 because we want to obtain the performance of all possible subset of features, from 171 until 1. If we set a higher n, e.g. 50, the script would stop when it reached that number so we wouldn’t know how smaller subsets of features perform.

Input parameters of RFE execution

After 30 minutes, we obtain an output-features object that contains all the possible subsets of features and their performance. We can use it to create the graph below. From this, we can deduce the optimal number of features is around 25. From 25 features on, the performance is stable.

Evaluation score as a function of the number of features

Try it yourself with this Python script

Now that we know that we should obtain around 25 features, let’s run the script again to find out which are the optimal 25. This time, as we don’t need to perform evaluations, we won’t pass the test dataset to the script execution.

The script needs 20 minutes to finish the execution. The 25 most important fields that RFE returns are:

"bs_000", "cn_004", "cs_002","cn_000", "dn_000", "ay_008", "ba_005",    
"ee_005", "bj_000", "az_000", "al_000", "am_0", "ay_003", "ci_000", 
"ba_007",  "aq_000", "ag_002", "ee_007", "ck_000", "bc_000", "ay_005", 
"ba_002", "ee_000", "cm_000", "ai_000"

From the script execution, we can obtain a filtered dataset with these 25 fields. The ensemble associated with this filtered dataset has a phi coefficient of 0.815. The phi coefficient of the ensemble that uses the original dataset was only a bit higher, 0.824. That sounds good!

As we have seen, Recursive Feature Elimination is a simple but powerful feature selection script with only a few parameters, serving as a very useful way to get an idea of which features are actually contributing to our model. In the next post, we will see how we can achieve similar results using Boruta. Stay tuned!

Machine Learning meets Social Good to tackle Data Quality Challenges for Enterprises

BigML partners with WorkAround to provide datasets tagged and cleaned by skilled refugees. WorkAround, a crowdsourcing platform for refugees and displaced people, partners with BigML, the leading Machine Learning Platform accessible for everyone, to give more economic opportunities to end users.

In a world of increasing automation, it is easy to forget the human work that goes into making Machine Learning happen. Quality data is the linchpin to accurate outcomes from Machine Learning algorithms, but finding providers that can deliver clean and accurate data is challenging. However, WorkAround makes this possible while working with skilled refugees and displaced people who are otherwise unable to work due to government restrictions, lack of access to banking, and other barriers. With this partnership, BigML customers will enjoy the benefits of having their data cleaned and tagged without the burden of having to perform these tasks by themselves, thus being able to dedicate more time to other strategic tasks.

“I started WorkAround because aid is not a sustainable solution for anyone to move forward,” says Wafaa Arbash, WorkAround’s co-founder and CEO, who watched frustrated as many of her fellow Syrians fled conflict only to be left with few options for employment in host communities, despite having higher education and previous work experience. “People don’t need handouts, they need economic opportunities.”

Although the 1951 UN Refugee Convention signed by 144 countries grants refugees the right to work, the reality is that most host countries block or severely limit local access to jobs. “WorkAround basically saved my life,” said Oro Mahjoob, a WorkArounder since July of 2017, “it gave me a chance to work and earn enough to pay my rent with only having an internet connection and a device.”

Francisco Martin, BigML’s CEO emphasized: “BigML is excited to offer more ways to ensure high-quality data is made available for a variety of Machine Learning tasks executed on our platform. Our mission of democratizing Machine Learning is further extended to cover data preparation thanks to our partnership with WorkAround all the while contributing to a worthy social cause.”

Principal Component Analysis Webinar Video: Dimensionality Reduction Made Easy!

BigML has brought Principal Component Analysis (PCA) to the platform. PCA is a key unsupervised Machine Learning technique used to transform a given dataset in order to yield uncorrelated features and reduce dimensionality. PCA fundamentally transforms a dataset defined by possibly correlated variables into a set of uncorrelated variables, called principal components. When used for dimensionality reduction, these principal components often allow improvements in the results of supervised modeling tasks by reducing overfitting as there remain fewer relationships to consider between variables after the process.

Additionally, BigML PCA’s unique implementation lets you transform many different data types automatically without requiring you to configure it manually. That is, BigML PCA can handle numeric and non-numeric data types, including text, categorical, items fields, as well as combinations of different data types. PCA is ideal for domains with high dimensional data including bioinformatics, quantitative finance, and signal processing, among others.

Now you can easily create BigML PCAs through the BigML Dashboard benefiting from intuitive visualizations, via the REST API if you prefer to work programmatically, or via WhizzML and a wide range of bindings for automation. To see how, please watch the launch webinar video on the BigML YouTube channel.

For further learning about Principal Component Analysis, please visit the release page, where you can find:

  • The slides used during the webinar.
  • The detailed documentation to learn how to use PCA with the BigML Dashboard and the BigML API.
  • The series of blog posts that gradually explain PCA. We start with an introductory post that explains the basic concepts, followed by a use case that presents how to apply dimensionality reduction with PCA to Cander data, and three more posts on how to use and interpret Principal Component Analysis through the BigML Dashboard, API, as well as WhizzML and Python Bindings.

Thanks for watching the webinar and for your positive feedback! As usual, your comments are always welcome, feel free to contact the BigML Team at support@bigml.com.

Recapping BigML’s 2018 in Numbers

Another year has gone by in a hurry in the Machine Learning world of BigML. 2018 saw the interest in Machine Learning from all industries continually get stronger. Not only are we seeing an increase in the level of awareness and sophistication towards productive business applications running existing processes more efficiently, but we’re also witnessing novel use cases turning Machine Learning into an all together strategic asset. When things happen so fast, one can sometimes find it a challenge to stop and reflect on milestones and achievements. So below are the highlights of what made 2018 another special year for us.

bigml_summary_2018

In 2018, the BigML platform crossed the 80,000 registrations mark worldwide adding to our milestones since inception in 2011. Our users keep making a difference in their workplaces, government agencies, as well as educational institutions all over the map putting the BigML platform to use in the most creative ways.

5 Major Releases + 23 Enhancements

2018 saw a wealth of new features on BigML opening up many more compositional workflows involving the collective resources and the corresponding API endpoints.

  • The year started out with the Operating Thresholds and Organizations capabilities being launched. Operating Thresholds let you better adjust the tradeoff between false positives and false negatives in your classification models while Organizations help assign different roles and privileges to different users in workgroups respectively.
  • The OptiML release followed, complementing an already impressive array of supervised learning resources for tackling classification and regression problems by applying Bayesian Parameter Optimization.
  • Our Fusions release gave the platform a whole new dimension allowing users to easily mix and match multiple models into an ensemble regardless if the underlying algorithms are different.
  • Keeping the momentum alive we launched Data Transformations, a much-requested collection of new features that let users manipulate and pre-process their data into a Machine Learning-ready format even without any SQL expertise.
  • Finally, Principal Component Analysis (PCA) will mark the year-end addition that extends BigML with a practical dimensionality reduction functionality adaptable to all types of input data e.g., categorical, numeric, text.

releases_2018

Aside from major releases, we made 23 smaller but noteworthy improvements to the BigML platform including but not limited to: Feature Engineering with Flatline Editor, Sliding WindowsSQL in the BigML API, New Text Analysis Options, Prediction Explanation and many more. You can find a full list of enhancements on our What’s New page in case you’d like to try out the ones you may have missed.

As usual, we’ve also kept BigML Tools updated to make sure insights from BigML resources find their way to different platforms. One such example is the newest version of our Predict App for Zapier, which allows you to easily automate your Machine Learning workflows without any coding by importing your data in real-time from the most popular web apps.

Making Enterprise Machine Learning accessible with BigML Lite

We also need to drop a special mention for BigML Lite, which is the latest addition to BigML’s product line up.

BigML Lite offers a fast-track route for implementing your first use cases. Ideal to get immediate value for startups, small to mid-size enterprises or in a single department of a large enterprise ready to benefit from Machine Learning with their first predictive use cases. Now, for as low as $10,000/year, any of these scenarios can be realized as a stepping stone to company-wide Machine Learning initiatives with a larger scope.

BigML Dashboard in Chinese

Part of being a global company as seen in the body of users representing 182 countries in 6 continents is recognizing the need to customize the user experience to match local expectations.

In 2018, as a result of growing demand in Asia, BigML has taken a giant step by translating the BigML Dashboard to Chinese. You can expect further local customization options down the road as we strive to delight our global users.

498,849 Code Changes through 225 Deploys!

All our releases and the new features were made possible due to some serious heavy lifting by our product development group, which make up a large percentage of our 33 FTE strong BigML Team.  Just to give a glimpse of the level of non-stop activity, our team updated our backend codebase 282,786 times, API codebase 41,413 times and our Web codebase 174,650 times. These improvements and additions were carried out through 225 production deployments dotting the entire year.

code_lines_2018

58 Events in 5 Continents

It’s hard to match the excitement of connecting with BigMLers during real-life events to hear their stories and receive their feedback. In 2018, we continued the tradition of organizing and delivering Machine Learning schools with VSSML18 and added to the rotation the Machine Learning School in Doha 2018 with record attendance. Next year, Seville will be also part of the ML education circuit with MLSEV.

On the industry-specific events side, the 2ML event held in Madrid, Spain, has been fruitful so we intend to continue this collaboration with Barrabes in the 2019 edition. To kick off the new year, we’ll make a stop in Edinburgh, Scotland in January sharing our experiences in automating decision making for businesses.

schools_2018

171 Ambassador Applications, 7 Certification Batches

Our Education Program saw continued growth in 2018, with the addition of 171 new applicants to help promote Machine Learning on their campuses, having 265 in total since we launched the Education Program. BigML Ambassadors span the globe and include students as well as educators. To boot, as part of BigML’s Internship Program, we’ve hired 3 interns that made valuable contributions. We are thrilled to see a continually increasing interest in gaining hands-on Machine Learning experience, having received hundreds of applications for internships and full-time positions throughout the year.learnml_2018

Moreover, in its second year of existence, our team of expert Machine Learning instructors completed 7 rounds of BigML’s Certification Program passing on their deep expertise to newly minted BigML Engineers.

BigML Preferred Partner Program

Last but not least is the new BigML Preferred Partner Program that we announced in the last quarter of 2018. With 3 levels and different rewards, it enables new synergies covering multiple sales and project delivery collaboration scenarios.

Our first announcement on that front involved Jidoka, a leader in RPA (Robotic Process Automation), and earlier this week we have revealed our partnership with SlicingDice, which offers a highly performant all-in-one data warehouse service that will in the near future come with built-in Machine Learning capabilities. Stay tuned for more partnership announcements in the new year as we’re in discussion with an exciting lineup of new BigML partners that will each make a difference in their respective industry verticals.

84 Blog Posts (so far)

We’ve kept our blog running on all cylinders throughout the year. 84 new posts were added to our blog, which has long been recognized as one of the Top 20 Machine Learning blogs. Below is a selection of posts that were popular on our social media channels in case you’re interested in catching up with some Machine Learning reading during the holidays.

blogposts_2018

Looking Forward to 2019

Hope this post gave a good tour of what’s been happening around our neck of the woods. We fully intend to continue carrying the Machine Learning for everyone flag in the new year. As part of our commitment to democratize Machine Learning by making it simple and beautiful for everyone, we will be sharing more of our insights, customer success stories, and all the new features we will bring you with each new release in 2019. As always, thanks for being part of BigML’s journey!

Principal Component Analysis: Technical Overview

This past week we’ve been blogging about BigML’s new Principal Component Analysis (PCA) feature. In this post, we will continue on that topic and discuss some of the mathematical fundamentals of PCA, and reveal some of the technical details of BigML’s implementation.

A Geometric Interpretation

Let’s revisit our old friend the iris dataset. To simplify visualizations, we’ll just look at two of its variables: petal length and sepal length. Imagine now that we take some line in this 2-dimensional space and we sum the perpendicular distances between this line and each point in the dataset. As we rotate this line around the center of the dataset, the value of this sum changes. You can see this process in the following animation.

What we see here is that the value of this sum reaches a minimum when the line is aligned with direction in which the dataset is most “stretched out”. We call a vector that points in this direction the first “principal component” of the dataset. It is plotted on the left side of the figure below so you don’t need to make yourself dizzy trying to see it in the animated plot.

The number of principal components is equal to the dimensionality of the dataset, so how do we find the second one for our 2D iris data? We need to first cancel out the the influence of the first one. From each data point, we will subtract its projection along the first principal component. As this is the final principal component, we will be left with all the points in a neat line, so we don’t need to go spinning another vector around. This is the result shown on the right side of the figure below.

pca

The importance of a principal component is usually measured by its “explained variance”. Our example dataset has a total variance of 3.80197. After we subtract the first principal component, the total variance is reduced to 0.14007. The explained variance for PC1 is therefore:

\frac{3.80197 - 0.14007}{3.80197} \approx 0.96315

This process can be generalized for data with higher dimensions. For instance, starting with a 3D point cloud, subtracting the first principal component gives you all the points in a plane, and then the process for the finding the second and third components is the same as what we just discussed. However, for even a moderate number of dimensions, the process becomes extremely difficult to visualize, so we need to turn to some linear algebra tools.

Covariance Decomposition

Given a collection of data points, we can compute its covariance matrix. This is a quantification of how each variable in the dataset varies both individually and together with other variables. Our 2D iris dataset has the following covariance matrix:

\Sigma = \begin{bmatrix} 3.11628 & 1.27432 \\ 1.27432 &  0.68569 \end{bmatrix}

Here, sepal length and petal length are the first and second variables respectively.  The numbers on the main diagonal are their variance. The variance of sepal length is several times that of petal length, which we can also see when we plot the data points. The value on the off diagonal is the covariance between the two variables. A positive value here indicates that they are positively correlated.

Once we have the covariance matrix, we can perform eigendecomposition on it to obtain a collection of eigenvectors and eigenvalues.

                     \mathbf{e}_1 = [ 0.91928 , 0.39361]  \qquad \lambda_1 = 3.66189

                     \mathbf{e}_2 = [0.39361, -0.91928] \qquad \lambda_2 = 0.14007

We can now make a couple observations. First, the eigenvectors are essentially identical to the principal component directions we found using our repeated line fits. Second, the eigenvalues are equal to the amount of explained variance for the corresponding component. Since covariance matrices can be constructed for any number of dimensions, we now have a method that can be applied whatever size problem we wish to analyze.

Missing Values

When we move beyond the realm of toy datasets like iris, one of the challenges we encounter is how to deal with missing values, as the calculation of the covariance matrix does not admit those. One strategy could be to simply drop all the rows in your dataset that contain missing data. With a high proportion of missing values however, this may lead to an unacceptable loss of data. Instead, BigML’s implementation employs the unbiased estimator described here. First, we form a naive estimate by replacing all missing values with 0 and computing the covariance matrix as normal. Call this estimate \tilde{\Sigma}, and let \mathrm{diag}(\tilde{\Sigma}) be the same matrix with the off-diagonal elements set to 0. We also need to compute a parameter \delta, which is the proportion of values which are not missing. This is easily derived from the dataset’s field summaries. The unbiased estimate is then calculated as:

\Sigma^{(\delta)} = (\delta^{-1} - \delta^{-2})\mathrm{diag}(\tilde{\Sigma}) + \delta^{-1}\tilde{\Sigma}

With our example data, we randomly erase 100 values. Since we have 150 2-dimensional datapoints, that gives us \delta = 2/3. The naive and corrected missing value covariance matrix estimates are:

\tilde{\Sigma}=\begin{bmatrix} 2.25964 & 0.51222  \\ 0.51222 & 0.44881 \end{bmatrix} \qquad \Sigma^{(\delta)} = \begin{bmatrix}3.38946 & 1.15251 \\ 1.15251 & 0.67322\end{bmatrix}

The corrected covariance matrix is clearly much nearer to “true” values.

Non-Numeric Variables

PCA is typically applied to numeric-only data, but BigML’s implementation incorporates some additional techniques which allow for the analysis of datasets containing other data types. With text and items fields, we explode them into multiple numeric fields, one per item in their tag cloud, where the value is the term count.

Categorical fields are slightly more involved. BigML uses a pair of methods called Multiple Correspondence Analysis (MCA) and Factorial Analysis of Mixed Data (FAMD). Each field is expanded into multiple 0/1 valued fields y_1 \ldots y_k, one per categorical value. If is the mean of each such field, then we compute the shifted and scaled value, using one of the two expressions below:

x_{MCA} = \frac{y - p}{J\sqrt{Jp}} \qquad x_{FAMD} = \frac{y - p}{\sqrt{p}}

If the dataset consists entirely of categorical fields, the first equation is used, and J is the total number of fields. Otherwise, we have a mixed datatype dataset and we use the second equation. After these transformations are applied, we then feed the data into our usual PCA solver.

We hope this post has shed some light on the technical details of Principal Component Analysis, as well as highlight the unique technical features BigML brings to the table. To put your new-found understanding to work, head over to the BigML Dashboard and start running some analyses!

Want to know more about PCA?

For further questions and reading, please remember to visit the release page and join the release webinar on Thursday, December 20, 2018. Attendance is FREE of charge, but space is limited so register soon!

Bringing Automated Machine Learning to the All-in-One Data Warehouse

SlicingDice and BigML partner to bring the very first Data Warehouse embedding Automated Machine Learning. This All-in-One solution will provide guardrails for thousands of organizations struggling to keep up with insight discovery and decision automation goals.

Due to the accelerating data growth in our decade, the focus for all businesses has naturally turned to data collection, storing, and computation. Nevertheless, what companies really need is getting useful insights from their data to ultimately automate decision-making processes. This is where Machine Learning can add tremendous value. By applying the right Machine Learning techniques to solve a specific business problem, any company can increase their revenue, optimize resources, improve processes, and automate manual, error-prone tasks. With this vision in mind, BigML and SlicingDice, the leading Machine Learning platform and the unique All-in-One data solution company respectively, are joining forces to provide a more complete and uniform solution that helps businesses get to the desired insights hidden in their data much faster.

BigML and SlicingDice’s partnership embodies the strong commitment from both companies to bring powerful solutions to data problems in a simple fashion. SlicingDice offers an All-in-One cloud-based Data Solution that is easy to use, fast and cost-effective for any data challenge. Thus, the end customers do not need to spend excessive time configuring, pre-processing, and managing their data. SlicingDice will provide Machine Learning-ready datasets for companies to start working on their Machine Learning projects through the BigML platform, which will be seamlessly integrated into the SlicingDice site. As a result, thousands of organizations can make the best of their data by having it all organized and accessible from one platform, to solve and automate Classification, Regression, Time Series Forecasting, Cluster Analysis, Anomaly Detection, Association Discovery, and Topic Modeling tasks thanks to the underlying BigML platform.

This integration is ideal for tomorrow’s data-driven companies that have large volumes of data and the need to carefully manage their data in a cost-effective manner. Take, for example, a large IoT project that needs to deploy over 150 thousand sensors distributed in several regions with billions of insertions per day. Normally, this type of data could be very costly and difficult to manage and maintain, but with the SlicingDice and BigML’s integrated approach, data and process complexities are abstracted while risks are mitigated. The client can then not only visualize all this data in real-time business dashboards for hundreds of users located in different geographical areas but also apply Machine Learning to truly start automating decision making, with a very accessible solution, clearly optimizing their resources in a traceable and proven way.

Francisco Martín, BigML’s CEO shared, “It is critically important to acknowledge the costly challenges enterprises face when having to prepare their data for Machine Learning and automate Machine Learning workflows by themselves. By joining forces with SlicingDice, we aim to drastically simplify such initiatives. Our joint customers will be able to focus on becoming truly data-driven businesses with agile and adaptable decision-making capabilities able to meet the ever-shifting competitive and demand-driven dynamics in their respective industries.”

SlicingDice’s CEO, Gabriel Menegatti, wishes “to enable any and every company to be able to tackle their data challenges using simpler tools, which delivers value to them faster. We wanted to offer companies a data solution that is comprehensive, fast and cheap. We built that. Now, by leveraging BigML’s technology, those companies can take their analytics to the next level, using Machine Learning to take the next step in their data journeys. We’re sure companies can seize this opportunity and ensure data is treated as an asset and not just an operational bottleneck.”

%d bloggers like this: