Skip to content

In this third post about feature selection scripts in WhizzML, we will introduce the third and final algorithm, Best-First Feature Selection (Best-First). In the first post, we discussed Recursive Feature Selection, and in the second post, we covered Boruta

# Introduction to Best-First Feature Selection

You can find this script in the BigML Script Gallery If you want to know more about it, visit its info page.

Best-First selects the n best features for modeling a given dataset, using a greedy algorithm. It starts by creating N models, each of them using only one of the N features of our dataset as input. The feature that yields the model with the best performance is selected. In the next iteration, it creates another set of N-1 models with two input features: the one selected in the previous iteration and another of the N-1 remaining features. Again, the combination of features that gives the best performance is selected. The script stops when it reaches the number of desired features which is specified in advance by the user.

One improvement we made to this script includes k-fold cross-validation for the model evaluation process at each iteration. This ensures that the good or bad performance of one model is not produced by chance because of a single favorable train/test split.

Since this is the most time-consuming script of the dimensionality reduction scripts described in this series of posts, another useful feature has been added to this script: early-stop. We can configure the script to stop the execution if there are a certain number of iterations where the additional features do not improve the model performance. We created two new inputs for that:

• early-stop-performance: An early-stop-performance improvement value (in %) used as a threshold to consider if a new feature has a better performance compared to previous iterations.
• max-low-perf-iterations: The maximum number of consecutive iterations allowed that may have a lower performance than the early-stop performance set. It needs to be set as a percentage of the initial number of features in the dataset.

Finally, there are two more inputs that can be very useful:

• options: It allows you to configure the kind of model that will be created at each iteration and its parameters.
• pre-selected-fields: List of field IDs to be pre-selected as best features. The script won’t consider them but they will be included in the output.

# Feature selection with Best-First Feature Selection

As this is a time-consuming script, we won’t apply it to the full Trucks APS dataset used in the first post in case you wanted to quickly replicate the results. We will use a subset of the original dataset that uses the 29 fields selected by the Boruta script in our second post. Then we will apply these parameters:

We have used a max-n of 20 because that’s the number of features that we want to select. As we want the script to return exactly 20 features, we are using an early-stop-performance value of -100 to bypass the early stop feature. After 1 hour, Best-First selects these 20 fields as important:

"bj_000", "ag_002", "ba_005", "cc_000", "ay_005", "am_0", "ag_001", "cn_000",
"cn_001", "cn_004","cs_002","ag_003", "az_000", "bt_000", "bu_000", "ee_005",
"al_000", "bb_000","cj_000", "ee_007"


In the fourth and final post, we will compare RFE, Boruta, and Best-First to see which one is better suited for different use cases. We will also explore the results of the evaluations performed to the reduced datasets and compare them with the original ones. Stay tuned!

At BigML, we strive to bring the power of Machine Learning to as many diverse environments as possible. Now you can easily power your Internet of Things (IoT) devices with Classifiers, Regressors, Anomaly Detectors, Deep Neural Networks, and more with the BigML bindings for Node-RED.

The BigML Node-RED bindings aim to make it easier to create and deploy ML-powered IoT devices using one of the most used development environments for IoT: Node-RED. Node-RED is a flow-based, visual programming development tool that allows you to wire together hardware devices, APIs and online services, as part of the Internet of Things. Node-RED provides a web browser-based flow editor which can be used to visually create a JavaScript web service.

Thanks to the BigML Node-RED bindings, you will be able to carry through ML tasks using the BigML platform. For example, tasks such as creating a model from a remote data source, making a prediction using a pre-existing model when a new event occurs, and so on, will be as easy as dragging and dropping the relevant BigML nodes on to the Node-RED canvas and wiring them together. As a bonus, the BigML Node-RED bindings are based on WhizzML, our domain-specific language for automating Machine Learning workflows. This will allow you to easily integrate your Node-RED flows with any advanced ML workflows your use case requires.

## Setting up Node-RED with the BigML Node-RED bindings

Let’s see first how you can set-up your Node-RED environment. Installing Node-RED is super-easy if you already have Node and npm installed on your machine. Just run the following shell command:

$sudo npm install -g --unsafe-perm node-red  Once Node-RED is installed, you launch it by executing the following command: $ node-red


Now, you can point your browser to http://localhost:1880/ and access the Node-RED visual flow editor, shown in the image below.

Note that there are alternative ways to install and run Node-RED on your machine or IoT device. Check the documentation linked above for more options.

## Your first Node-RED flow with BigML: Creating an ensemble

Now that you have Node-RED installed on your machine, we can define a flow to create an ML resource on BigML.

To get a rough idea about the way Node-RED works, let’s create a very basic flow that outputs some JSON to the Node-RED debug console. Once we have that in place, we will add a BigML node to carry through our ML task.

As a first step, just grab the default inject and debug nodes from the node palette on the left-side of the node-RED editor to the middle canvas. Then connect the inject node output port to the debug node input port. You should get the flow displayed in the next image:

Notice the two blue dots on each of the nodes. That is the Node-RED way of telling you those nodes have changes that have not been deployed yet. When you are ready with your changes, you can deploy them by clicking the red Deploy button in the top-right corner. If everything looks right, Node-RED will update the status of the nodes by removing the blue dot.

You can customize the two nodes by double-clicking on each of them and configuring their options. For now, just click the Deploy button and then the small square box left of the inject node. This will make a timestamp message to be injected in the flow and reach the debug node, which simply outputs the message payload to the debug console, as shown in the following image.

Now, let’s build a Machine Learning workflow to create a new model from a remote data source. As you likely know, this requires three steps:

• Creating a BigML source from your remote data source.
• Creating a dataset from the source.
• Finally, creating the model using that dataset.

So, our Node-RED flow will include three nodes, one to create the source, another to create the dataset, and another to create the model.

Before doing this, we will need to install the BigML Node-RED bindings, which will require going back to the command line. If you have not modified Node-RED default configuration, it will store all of its stuff in its user data directory, which is ~/.node-red by default. In that directory, you can install any additional Node-RED nodes you would like to use as npm packages. In our present case, just execute the following command to have the BigML Node-RED bindings installed:

### BigML Dashboard in Chinese

Part of being a global company as seen in the body of users representing 182 countries in 6 continents is recognizing the need to customize the user experience to match local expectations.

In 2018, as a result of growing demand in Asia, BigML has taken a giant step by translating the BigML Dashboard to Chinese. You can expect further local customization options down the road as we strive to delight our global users.

## 498,849 Code Changes through 225 Deploys!

All our releases and the new features were made possible due to some serious heavy lifting by our product development group, which make up a large percentage of our 33 FTE strong BigML Team.  Just to give a glimpse of the level of non-stop activity, our team updated our backend codebase 282,786 times, API codebase 41,413 times and our Web codebase 174,650 times. These improvements and additions were carried out through 225 production deployments dotting the entire year.

## 58 Events in 5 Continents

It’s hard to match the excitement of connecting with BigMLers during real-life events to hear their stories and receive their feedback. In 2018, we continued the tradition of organizing and delivering Machine Learning schools with VSSML18 and added to the rotation the Machine Learning School in Doha 2018 with record attendance. Next year, Seville will be also part of the ML education circuit with MLSEV.

On the industry-specific events side, the 2ML event held in Madrid, Spain, has been fruitful so we intend to continue this collaboration with Barrabes in the 2019 edition. To kick off the new year, we’ll make a stop in Edinburgh, Scotland in January sharing our experiences in automating decision making for businesses.

## 171 Ambassador Applications, 7 Certification Batches

Our Education Program saw continued growth in 2018, with the addition of 171 new applicants to help promote Machine Learning on their campuses, having 265 in total since we launched the Education Program. BigML Ambassadors span the globe and include students as well as educators. To boot, as part of BigML’s Internship Program, we’ve hired 3 interns that made valuable contributions. We are thrilled to see a continually increasing interest in gaining hands-on Machine Learning experience, having received hundreds of applications for internships and full-time positions throughout the year.

Moreover, in its second year of existence, our team of expert Machine Learning instructors completed 7 rounds of BigML’s Certification Program passing on their deep expertise to newly minted BigML Engineers.

### BigML Preferred Partner Program

Last but not least is the new BigML Preferred Partner Program that we announced in the last quarter of 2018. With 3 levels and different rewards, it enables new synergies covering multiple sales and project delivery collaboration scenarios.

Our first announcement on that front involved Jidoka, a leader in RPA (Robotic Process Automation), and earlier this week we have revealed our partnership with SlicingDice, which offers a highly performant all-in-one data warehouse service that will in the near future come with built-in Machine Learning capabilities. Stay tuned for more partnership announcements in the new year as we’re in discussion with an exciting lineup of new BigML partners that will each make a difference in their respective industry verticals.

## 84 Blog Posts (so far)

We’ve kept our blog running on all cylinders throughout the year. 84 new posts were added to our blog, which has long been recognized as one of the Top 20 Machine Learning blogs. Below is a selection of posts that were popular on our social media channels in case you’re interested in catching up with some Machine Learning reading during the holidays.

## Looking Forward to 2019

Hope this post gave a good tour of what’s been happening around our neck of the woods. We fully intend to continue carrying the Machine Learning for everyone flag in the new year. As part of our commitment to democratize Machine Learning by making it simple and beautiful for everyone, we will be sharing more of our insights, customer success stories, and all the new features we will bring you with each new release in 2019. As always, thanks for being part of BigML’s journey!

This past week we’ve been blogging about BigML’s new Principal Component Analysis (PCA) feature. In this post, we will continue on that topic and discuss some of the mathematical fundamentals of PCA, and reveal some of the technical details of BigML’s implementation.

# A Geometric Interpretation

Let’s revisit our old friend the iris dataset. To simplify visualizations, we’ll just look at two of its variables: petal length and sepal length. Imagine now that we take some line in this 2-dimensional space and we sum the perpendicular distances between this line and each point in the dataset. As we rotate this line around the center of the dataset, the value of this sum changes. You can see this process in the following animation.

What we see here is that the value of this sum reaches a minimum when the line is aligned with direction in which the dataset is most “stretched out”. We call a vector that points in this direction the first “principal component” of the dataset. It is plotted on the left side of the figure below so you don’t need to make yourself dizzy trying to see it in the animated plot.

The number of principal components is equal to the dimensionality of the dataset, so how do we find the second one for our 2D iris data? We need to first cancel out the the influence of the first one. From each data point, we will subtract its projection along the first principal component. As this is the final principal component, we will be left with all the points in a neat line, so we don’t need to go spinning another vector around. This is the result shown on the right side of the figure below.

The importance of a principal component is usually measured by its “explained variance”. Our example dataset has a total variance of 3.80197. After we subtract the first principal component, the total variance is reduced to 0.14007. The explained variance for PC1 is therefore:

$\frac{3.80197 - 0.14007}{3.80197} \approx 0.96315$

This process can be generalized for data with higher dimensions. For instance, starting with a 3D point cloud, subtracting the first principal component gives you all the points in a plane, and then the process for the finding the second and third components is the same as what we just discussed. However, for even a moderate number of dimensions, the process becomes extremely difficult to visualize, so we need to turn to some linear algebra tools.

# Covariance Decomposition

Given a collection of data points, we can compute its covariance matrix. This is a quantification of how each variable in the dataset varies both individually and together with other variables. Our 2D iris dataset has the following covariance matrix:

$\Sigma = \begin{bmatrix} 3.11628 & 1.27432 \\ 1.27432 & 0.68569 \end{bmatrix}$

Here, sepal length and petal length are the first and second variables respectively.  The numbers on the main diagonal are their variance. The variance of sepal length is several times that of petal length, which we can also see when we plot the data points. The value on the off diagonal is the covariance between the two variables. A positive value here indicates that they are positively correlated.

Once we have the covariance matrix, we can perform eigendecomposition on it to obtain a collection of eigenvectors and eigenvalues.

$\mathbf{e}_1 = [ 0.91928 , 0.39361] \qquad \lambda_1 = 3.66189$

$\mathbf{e}_2 = [0.39361, -0.91928] \qquad \lambda_2 = 0.14007$

We can now make a couple observations. First, the eigenvectors are essentially identical to the principal component directions we found using our repeated line fits. Second, the eigenvalues are equal to the amount of explained variance for the corresponding component. Since covariance matrices can be constructed for any number of dimensions, we now have a method that can be applied whatever size problem we wish to analyze.

# Missing Values

When we move beyond the realm of toy datasets like iris, one of the challenges we encounter is how to deal with missing values, as the calculation of the covariance matrix does not admit those. One strategy could be to simply drop all the rows in your dataset that contain missing data. With a high proportion of missing values however, this may lead to an unacceptable loss of data. Instead, BigML’s implementation employs the unbiased estimator described here. First, we form a naive estimate by replacing all missing values with 0 and computing the covariance matrix as normal. Call this estimate $\tilde{\Sigma}$, and let $\mathrm{diag}(\tilde{\Sigma})$ be the same matrix with the off-diagonal elements set to 0. We also need to compute a parameter $\delta$, which is the proportion of values which are not missing. This is easily derived from the dataset’s field summaries. The unbiased estimate is then calculated as:

$\Sigma^{(\delta)} = (\delta^{-1} - \delta^{-2})\mathrm{diag}(\tilde{\Sigma}) + \delta^{-1}\tilde{\Sigma}$

With our example data, we randomly erase 100 values. Since we have 150 2-dimensional datapoints, that gives us $\delta = 2/3$. The naive and corrected missing value covariance matrix estimates are:

$\tilde{\Sigma}=\begin{bmatrix} 2.25964 & 0.51222 \\ 0.51222 & 0.44881 \end{bmatrix} \qquad \Sigma^{(\delta)} = \begin{bmatrix}3.38946 & 1.15251 \\ 1.15251 & 0.67322\end{bmatrix}$

The corrected covariance matrix is clearly much nearer to “true” values.

# Non-Numeric Variables

PCA is typically applied to numeric-only data, but BigML’s implementation incorporates some additional techniques which allow for the analysis of datasets containing other data types. With text and items fields, we explode them into multiple numeric fields, one per item in their tag cloud, where the value is the term count.

Categorical fields are slightly more involved. BigML uses a pair of methods called Multiple Correspondence Analysis (MCA) and Factorial Analysis of Mixed Data (FAMD). Each field is expanded into multiple 0/1 valued fields $y_1 \ldots y_k$, one per categorical value. If is the mean of each such field, then we compute the shifted and scaled value, using one of the two expressions below:

$x_{MCA} = \frac{y - p}{J\sqrt{Jp}} \qquad x_{FAMD} = \frac{y - p}{\sqrt{p}}$

If the dataset consists entirely of categorical fields, the first equation is used, and J is the total number of fields. Otherwise, we have a mixed datatype dataset and we use the second equation. After these transformations are applied, we then feed the data into our usual PCA solver.

We hope this post has shed some light on the technical details of Principal Component Analysis, as well as highlight the unique technical features BigML brings to the table. To put your new-found understanding to work, head over to the BigML Dashboard and start running some analyses!

# Want to know more about PCA?

For further questions and reading, please remember to visit the release page and join the release webinar on Thursday, December 20, 2018. Attendance is FREE of charge, but space is limited so register soon!

SlicingDice and BigML partner to bring the very first Data Warehouse embedding Automated Machine Learning. This All-in-One solution will provide guardrails for thousands of organizations struggling to keep up with insight discovery and decision automation goals.

Due to the accelerating data growth in our decade, the focus for all businesses has naturally turned to data collection, storing, and computation. Nevertheless, what companies really need is getting useful insights from their data to ultimately automate decision-making processes. This is where Machine Learning can add tremendous value. By applying the right Machine Learning techniques to solve a specific business problem, any company can increase their revenue, optimize resources, improve processes, and automate manual, error-prone tasks. With this vision in mind, BigML and SlicingDice, the leading Machine Learning platform and the unique All-in-One data solution company respectively, are joining forces to provide a more complete and uniform solution that helps businesses get to the desired insights hidden in their data much faster.

BigML and SlicingDice’s partnership embodies the strong commitment from both companies to bring powerful solutions to data problems in a simple fashion. SlicingDice offers an All-in-One cloud-based Data Solution that is easy to use, fast and cost-effective for any data challenge. Thus, the end customers do not need to spend excessive time configuring, pre-processing, and managing their data. SlicingDice will provide Machine Learning-ready datasets for companies to start working on their Machine Learning projects through the BigML platform, which will be seamlessly integrated into the SlicingDice site. As a result, thousands of organizations can make the best of their data by having it all organized and accessible from one platform, to solve and automate Classification, Regression, Time Series Forecasting, Cluster Analysis, Anomaly Detection, Association Discovery, and Topic Modeling tasks thanks to the underlying BigML platform.

This integration is ideal for tomorrow’s data-driven companies that have large volumes of data and the need to carefully manage their data in a cost-effective manner. Take, for example, a large IoT project that needs to deploy over 150 thousand sensors distributed in several regions with billions of insertions per day. Normally, this type of data could be very costly and difficult to manage and maintain, but with the SlicingDice and BigML’s integrated approach, data and process complexities are abstracted while risks are mitigated. The client can then not only visualize all this data in real-time business dashboards for hundreds of users located in different geographical areas but also apply Machine Learning to truly start automating decision making, with a very accessible solution, clearly optimizing their resources in a traceable and proven way.

Francisco Martín, BigML’s CEO shared, “It is critically important to acknowledge the costly challenges enterprises face when having to prepare their data for Machine Learning and automate Machine Learning workflows by themselves. By joining forces with SlicingDice, we aim to drastically simplify such initiatives. Our joint customers will be able to focus on becoming truly data-driven businesses with agile and adaptable decision-making capabilities able to meet the ever-shifting competitive and demand-driven dynamics in their respective industries.”

SlicingDice’s CEO, Gabriel Menegatti, wishes “to enable any and every company to be able to tackle their data challenges using simpler tools, which delivers value to them faster. We wanted to offer companies a data solution that is comprehensive, fast and cheap. We built that. Now, by leveraging BigML’s technology, those companies can take their analytics to the next level, using Machine Learning to take the next step in their data journeys. We’re sure companies can seize this opportunity and ensure data is treated as an asset and not just an operational bottleneck.”