Skip to content

Build your own models to predict the 2019 Oscars

It’s that time of the year when movie fans around the world get glued to their TVs sucking in everything the Oscars have come to represent over the years: the nominees and the snubs, celebrities, designer outfits, rumors of impending breakups or newcomers making the waves, and oh yes, also some of the best movies of the year before.  Following on the footsteps of our success last year, on our part, we’re getting ready to once again predict which special performance or production deserves to win this year’s gold-plated statues that signify perhaps the highest achievement in the 131-year-old business of fast-moving pictures.

2019 Oscars

This year, in an attempt to involve all our readers in this fun exercise (and a nice intro use case to Machine Learning), we’re publishing the corresponding dataset in the BigML gallery. Rest assured we’ve already done most of the hard work to gather and verify the completeness of the data. It sports 20 categorical, 56 numeric, 42 items, and 1 datetime field totaling 119 fields giving you plenty of details about various aspects of the past nominees and winners. The dataset is organized such that each record represents a unique movie identified by the field movie_id. The first 17 fields have to do with the metadata associated with each movie e.g., release_date, genre, synopsis, duration, metascore.  The following fields are dedicated to recording the outcomes of past Academy Awards and 19 other relevant awards such as Golden Globes, Screen Actors Guild, BAFTA and more.  Finally, we have some automatically generated datetime fields based on the Release Date of the movie entry. Please note that this rather abbreviated dataset comes with the limitation of making predictions based on movie titles only, which means in those instances where multiple persons are nominated from a single movie, you’ll have to make a judgment call between those nominees.

Oscar Nominees 2000-2018

Click on the above image and clone this public dataset to your BigML Dashboard.

To make your own predictions, you’ll need to perform a time split and create a training dataset spanning the period 2000-2017 as well as a test dataset for the movies released in 2018 — essentially, the nominees for 2019 Oscars.  The dataset is prepared in a way to handle multiple awards to save time. So instead of dealing with a different dataset for each award, you can simply drop the unneeded target fields and select as your target field the award you’re trying to predict. For instance, if you’re looking to predict the Best Movie, then you select Oscar_Best_Picture_Won as the target and the rest of fields sharing the naming convention Oscar_XXXXX_Won are to be excluded.

Here are some additional clues for newbies:

  • Get familiar with the dataset by building some scatterplot visualizations
  • Start with simpler methods like Models or Logistic Regressions to see what fields seem to correlate well with the outcome you’re looking to predict (i.e. use Model Summary Report)
  • Add more sophisticated techniques like Deepnets or Ensembles later on
  • Execute some side by side Evaluation Comparisons to compare your best performing classification models
  • Try an OptiML and see how automatic Machine Learning performs vs. your previous attempts
  • For additional peace of mind, validate models with last years predictions as a tie-breaker exercise
  • See if you can build some Fusions from your top classifiers to improve the robustness of your predictions further
  • Compare your predictions to those of human experts, and better yet, see how they deviate by using the handy predictions explanations feature of BigML.
  • BONUS: Go beyond what we supply here and add your own features and Data Transformations to the original movie dataset for an additional edge.

What are you waiting for, join in the fun, impress some friends and let us know how your predictions turn out with a shoutout to @bigmlcom on Twitter!

 

Powering the Next Wave of Intelligent Devices with Machine Learning – Part 3

In the second part of this series, we explored how the BigML Node-RED bindings work in more detail and introduced the key concepts of input-output matching and node reification which will allow you to create more complex flows. In this third and final part of this introductory series, we are going to review what we know about inputs and outputs in a more systematic way, to introduce debugging facilities, and present an advanced type of node that allows you to inject WhizzML code directly into your flows.

Details about node inputs and outputs

Each BigML node has a varying number of inputs and outputs, which are embedded in the message payload that Node-RED propagates across nodes. For example, the ensemble node has one input called dataset and one output called ensemble. That means the following two things:

  • An ensemble node expects by default to receive a dataset input. This can be provided by any of the upstream nodes through their outputs, which are added to the message payload, or as a property of the ensemble node configuration.
  • The ensemble output is sent over to downstream nodes with the ensemble key. This is a consequence of the fact that when a node sends an output value, this is appended to the message payload using that node output port label as a key. This way, downstream nodes can use that key to access the output value of that node.

You can change the input and output port labels when you need to connect two nodes whose inputs and outputs do not match. Say for example that a node has an output port label generically named resource and that you want to use that output value in a downstream node that requires a dataset input. You can easily access the upstream node configuration and change the node settings as shown in the following image.

tutorial-1-23.jpg

One thing you should be aware of is that all downstream nodes will be able to see and use any output values generated by upstream nodes, unless another node uses the same key to send its output out. For example, consider the following partial flow, where all inputs and outputs are shown at the same time:

Input/output ports

If you inspect the connection between Lookup Dataset and Dataset Split, you will see that both labels have the value dataset. To reiterate the flow explained above, this will make Lookup Dataset store its output in the message payload under the dataset key. Correspondently, Dataset Split expects its input under the key dataset, so all will work out just fine.

If you inspect the connection between Dataset Spit and Make Model, you will see that Dataset Split produces two outputs, training-dataset and test-dataset, in accordance with its expected behavior which is splitting a dataset into two parts, one for training a model and the other to later evaluate it. On the other hand, Make Model expects a dataset input.

Now, if you were to run the flow as it is defined, you would not get any error. The flow would be executed through, but it would produce an incorrect result because Make Model would use the dataset value produced by Lookup Dataset instead of the training dataset value produced by Dataset Split.

You have two options to fix this issue: either you change Dataset Split‘s output so it uses a dataset label instead of training-dataset; or you modify the Make Model input so it uses training-dataset instead of dataset. In the former case, the dataset value produced by Lookup Dataset will be overridden by the value with the same name produced by Dataset Split.

How to debug problems

When you build a flow that causes an error when you run it, a good approach to follow is to force each node to be reified and connected to a debug node that will allow you to inspect the output generated from that node so you can detect any anomalies or unexpected results. This will allow you to make sure that each node sends out a message whose payload actually contains the information downstream nodes expect to receive.

For example, consider the following flow. An error could occur at any node but we will not get any useful information until the whole WhizzML code has been generated and sent to the BigML platform to be executed.

A complex workflow

A rather trivial approach to get more information for each node would be connecting each node to a debug node. This would provide for each debugged node the WhizzML code generated at that node. Unfortunately, since this is information available previous to the WhizzML code execution, we get no information about the actual produced outputs, which are sent along within the message payload.

Debugging a complex flow

If you enable the reify option for each node, you are actually forcing the execution of each BigML node and thus you will also get to know which outputs each node generates by inspecting its message payload. This can be of great help when, for example, a downstream node complains about some missing information, improperly formatted information, or you simply get the wrong result, e.g., by using a wrong resource.

Additionally, when you reify each node, you will divide the whole WhizzML code that the flow generates into smaller, independent chunks that you will be able to run in the BigML Dashboard, which provides a more user-friendly environment for you to assess why a flow is failing.

To streamline debugging even more, the BigML Node-RED bindings provide two special flags you can specify in the message payload you inject in your flow or inside the flow context. The first one, BIGML_DEBUG_TRACE will make each node output the WhizzML code it generates on the Node-RED console. So, you do not have to connect each BigML node to a debug node to get that information, although it is perfectly fine if you do.

WhizzML for  evaluation :
(define lookup-dataset-11  (lambda (r) (let (result (head (resource-ids (resources "dataset" (make-map ["name__contains" "limit" "order"] ["iris" 2 "Ascending"])))) ) (merge r (make-map ["dataset"] [result])))))
(define dataset-split-12  (lambda (r) (let (dataset (if (contains? r "dataset") (get r "dataset") "" ) result (create-random-dataset-split dataset 0.75 { "name" "Dataset - Training"} { "name" "Dataset - Test"}) ) (merge r (make-map ["training-dataset" "test-dataset"] result)))))
(define model-13  (lambda (r) (let (training-dataset (if (contains? r "training-dataset") (get r "training-dataset") "" ) result (create-and-wait "model" (make-map [(resource-type training-dataset)] [training-dataset])) ) (merge r (make-map ["model"] [result])))))
(define evaluation-14  (lambda (r) (let (test-dataset (if (contains? r "test-dataset") (get r "test-dataset") "" ) model (if (contains? r "model") (get r "model") "" ) result (create-and-wait "evaluation" (make-map [(resource-type model) "dataset"] [model test-dataset])) ) (merge r (make-map ["evaluation"] [result])))))
(define init {"inputData" {"petal length" 1.35}, "limit" 2, "BIGML_DEBUG_REIFY" false, "BIGML_DEBUG_TRACE" true})
(define lookup-dataset-11-out (lookup-dataset-11 init))
(define dataset-split-12-out (dataset-split-12 lookup-dataset-11-out))
(define model-13-out (model-13 dataset-split-12-out))
(define evaluation-14-out (evaluation-14 model-13-out))

WhizzML for  Filter result :
(define lookup-dataset-11  (lambda (r) (let (result (head (resource-ids (resources "dataset" (make-map ["name__contains" "limit" "order"] ["iris" 2 "Ascending"])))) ) (merge r (make-map ["dataset"] [result])))))
(define dataset-split-12  (lambda (r) (let (dataset (if (contains? r "dataset") (get r "dataset") "" ) result (create-random-dataset-split dataset 0.75 { "name" "Dataset - Training"} { "name" "Dataset - Test"}) ) (merge r (make-map ["training-dataset" "test-dataset"] result)))))
(define model-13  (lambda (r) (let (training-dataset (if (contains? r "training-dataset") (get r "training-dataset") "" ) result (create-and-wait "model" (make-map [(resource-type training-dataset)] [training-dataset])) ) (merge r (make-map ["model"] [result])))))
(define evaluation-14  (lambda (r) (let (test-dataset (if (contains? r "test-dataset") (get r "test-dataset") "" ) model (if (contains? r "model") (get r "model") "" ) result (create-and-wait "evaluation" (make-map [(resource-type model) "dataset"] [model test-dataset])) ) (merge r (make-map ["evaluation"] [result])))))
(define filter-result-15  (lambda (r) (let (evaluation (if (contains? r "evaluation") (get r "evaluation") "" ) result (get (fetch evaluation (make-map ["output_keypath"] ["result"])) "result") ) (merge r (make-map ["evaluation"] [result])))))
(define init {"inputData" {"petal length" 1.35}, "limit" 2, "BIGML_DEBUG_REIFY" false, "BIGML_DEBUG_TRACE" true})
(define lookup-dataset-11-out (lookup-dataset-11 init))
(define dataset-split-12-out (dataset-split-12 lookup-dataset-11-out))
(define model-13-out (model-13 dataset-split-12-out))
(define evaluation-14-out (evaluation-14 model-13-out))
(define filter-result-15-out (filter-result-15 evaluation-14-out))

As you can see, for each node you get the whole WhizzML program that is being generated for the whole flow.

Similarly, BIGML_DEBUG_REIFY will reify each node without requiring you to manually change its configuration. In this case as well, each node will print on the Node-RED console the WhizzML code it attempted to execute:

WhizzML for  evaluation :
(define evaluation-9  (lambda (r) (let (test-dataset (if (contains? r "test-dataset") (get r "test-dataset") "" ) model (if (contains? r "model") (get r "model") "" ) result (create-and-wait "evaluation" (make-map ["dataset" (resource-type model)] [test-dataset model])) ) (merge r (make-map ["evaluation"] [result])))))
(define init {"BIGML_DEBUG_REIFY" true, "BIGML_DEBUG_TRACE" true, "dataset" "dataset/5c3dc6948a318f053900002f", "inputData" {"petal length" 1.35}, "limit" 2, "model" "model/5c489dc33980b5340f007d3a", "test-dataset" "dataset/5c489dbd3514cd374702713c", "training-dataset" "dataset/5c489dbc3514cd3747027139"})
(define evaluation-9-out (evaluation-9 init))

WhizzML for  Filter result :
(define filter-result-10  (lambda (r) (let (evaluation (if (contains? r "evaluation") (get r "evaluation") "" ) result (get (fetch evaluation (make-map ["output_keypath"] ["result"])) "result") ) (merge r (make-map ["evaluation"] [result])))))
(define init {"training-dataset" "dataset/5c489dbc3514cd3747027139", "BIGML_DEBUG_TRACE" true, "model" "model/5c489dc33980b5340f007d3a", "dataset" "dataset/5c3dc6948a318f053900002f", "inputData" {"petal length" 1.35}, "limit" 2, "evaluation" "evaluation/5c489dce3514cd37470271b0", "BIGML_DEBUG_REIFY" true, "test-dataset" "dataset/5c489dbd3514cd374702713c"})
(define filter-result-10-out (filter-result-10 init))

In this case, each code snippet is complete with the inputs provided by the previous node, stored in the init global, so you can more easily check its correctness and/or try to execute it in BigML.

Injecting WhizzML Code

As we mentioned, WhizzML, BigML’s domain-specific language for defining custom ML workflows, provides the magic behind the BigML Node-RED bindings. This opens up a wealth of possibilities by embedding a node inside of your Node-RED flows to execute generic WhizzML code. In other words, if our bindings for Node-RED do not already provide a specific kind of node for a given task, you can create one with the right WhizzML code that does what you need.

For example, we could consider the following case:

  • We want to predict using an existing ensemble.
  • We calculate the prediction using two different methods, then choose
    the result that has the highest confidence.

To carry through this task in Node-RED, we define the following flow.

Selecting the best prediction

The portion of the flow delimited by the dashed rectangle is the same prediction workflow we described in part 2 of this series. You can then add a new prediction node making sure the two prediction nodes use different settings for Operating kind. You can use Confidence for one, and Votes for the other.

Setting the operating kind

Another detail to note is renaming the two prediction nodes output labels so they do not clash. Indeed, if you leave the two nodes with their default output port labels, which will read prediction for both of them, the second prediction node will override the first’s output. So, just use prediction1 and prediction2 as port labels for the two nodes.

Changing the prediction nodes output labels

Finally, add a WhizzML node, available through the left-hand node palette, and configure it as shown in the following image.

WhizzML node to select the best prediction

Since the WhizzML node is going to use the two predictions outputted by the previous nodes, we should also make that explicit in the WhizzML input port label configuration, as shown in the following image:

Specifying the inputs to the WhizzML node

This is the exact code you should paste into the WhizzML field:

(let (p1 ((fetch prediction1) "prediction")
      p2 ((fetch prediction2) "prediction")
      c1 ((fetch prediction1) "confidence")
      c2 ((fetch prediction2) "confidence"))
      (if (> c1 c2) [p1 c1] [p2 c2]))

As you see, the WhizzML node uses prediction1 and prediction2. Those variables must match the labels you defined for the prediction nodes output ports and the WhizzML node input port.

Now, if you inject a new message, with the same format as the one used for the prediction use case introduced earlier, you should get the following output:

The selected prediction

Conclusion

We can’t wait to see what developers will be able to create using the BigML Node-RED bindings to make IoT devices that are able to learn from their environment. Let us know how you are using the BigML Node-RED bindings and provide any feedback to support@bigml.com.

Comparing Feature Selection Scripts

In this series about feature selection, the first three posts covered three different WhizzML scripts that can help you with this task: Recursive Feature Elimination, Boruta and Best-First Feature Selection. We explained how they work and the needed parameters for each one of them, applying the scripts to the system failures in trucks dataset described in the first post.

Feature Selection Scripts

As we previously explained, this kind of script can help us deal with wide datasets by selecting the most useful features. They are an interesting alternative to dimensionality reduction algorithms such as Principal Component Analysis (PCA). Furthermore, they provide the advantage that you don’t lose any model interpretability because you are not transforming your features.

Feature Selection algorithms can work in two different ways:

  • They can start using all the fields from the dataset and, iteratively, remove the least important fields. This is how Recursive Feature Elimination and Boruta work.
  • They can start with 0 fields, and, iteratively, add the most important features. This is how Best-First Feature Selection works.

Let’s compare the results from these three scripts. To that end, we have used them with a reduced version of the dataset mentioned previously. This reduced version, the same that we used in the Best-First post, has 29 fields and 15,000 rows.

In the table below, we can see a comparison between the scripts. We have annotated the execution times, the number of output fields, and the number of output fields in common between each pair of scripts. For each script output dataset, we have created and evaluated an ensemble.

  1. *  Using max-runs of 10 and min-gain of 0.01 (default parameters) 
  2.  Using the same input parameters as in the previous post.
  3.  phi-score with the 29 fields dataset is 0.84. 

From these tests, we extract some interesting conclusions:

  • Recursive Feature Selection is a simple script that runs extremely fast with only a few parameters, all without sacrificing accuracy. Its results are clearly consistent with the ones from the other scripts.
  • Boruta is a useful script that has an interesting feature: it is free from user bias because the n parameter, that represents the number of features to select, is not required.
  • Best-First Feature Selection is the most time-consuming of the scripts so we should use it with smaller datasets or on a previously reduced one. However, it is the only one that starts with 0 fields, and the information from the very first iterations is useful to see which are the most important features of our dataset.

The system failures in trucks dataset seemed to be a difficult dataset to work with. The large number of fields and their useless names made it hard to apply domain knowledge to it. These scripts helped us to automatically obtain the most important features without loosing modeling performance.

Now it’s your turn! Try out these new scripts and let us know if you have any feedback at support@bigml.com. What’s more, give WhizzML a try and create your own scripts that help automate your frequent tasks.

Powering the Next Wave of Intelligent Devices with Machine Learning – Part 2

In the first part of this series, we introduced the BigML Node-RED bindings and showed how to install and use them to create a simple BigML-powered flow in Node-RED. In this second installment, we are going to create a second flow which will give us the opportunity to consider in greater detail important concepts such as input-output matching and node reification.

Using an existing ensemble for prediction

As a second example of how you can use BigML with Node-RED, let’s build another flow that will use the ensemble created in our first installment to make a prediction each time a new event comes in.

One great way to identify a BigML resource is through a tag you assign it at creation time. The tag could represent what the ensemble is used for, or any other kind of information that can help you distinguish it from other resources of the same type in your BigML account. For example, you may want at some point to create a new version of that ensemble by including more recent training data. If you keep the same tag for each successive version of the ensemble, you will be able to find all ensembles sharing the same tag and identify the most recent version by looking at the creation date. Another approach for doing the same is creating a project that will uniquely home all the successive versions of the ensemble. In this case, you would not filter based on tags, rather on the project.

To give more substance to this, we are going now to show how you can create a flow to:

  1. Select the most recent ensemble tagged with a given tag.
  2. Use it to create a remote prediction whenever a new event comes in.

First thing, we need to have an ensemble with a tag of our liking, so we know which tag to use at step 1. To this aim, let’s modify slightly the flow we defined in the previous section to make it assign a tag to the ensemble it creates. For this, just double-click the ensemble node and lookup the Tags field and make its content read like that in the following image.

Assigning a tag to a resource

The Tags field value is ["ProductionEnsemble"] because you can specify any number of tags for your BigML resources. For example, to also assign a FraudDetection tag, you would use
["ProductionEnsemble", "FraudDetection"].

Once you have done that, click the Done button, then the Deploy button, and finally inject a new message with the inject node to have the flow create a new set of resources, including our tagged ensemble.

Now, we can create a new sub-flow in our diagram using a Find node. Find it in the left-hand node palette and drag it onto the canvas area, then double-click it to access its configuration. Here, we want to specify the kind of resource the node should lookup and a tag it should contain, as the following image shows:

Finding a tagged resource

When you are done with this, click Done. Then, add a Prediction node to the canvas and connect the Find node output to the Prediction node input. Next, add a Reify node to control the execution of our flow and connect it with the Prediction node.

Finally, we need an inject node sending over your BigML API credentials. You can copy/paste the original inject node you already have. Besides your credentials, the inject node should also inject an event used as an input for the prediction. We do this by adding an input_data field to the JSON that is injected, as shown in the image below.

tutorial-1-15

Now, if we attempt to run this flow by injecting an event (just click the inject node input pad), we will get an error.

An error in our flow

If you look to the error message closely, you will see it display the node that triggered the error, i.e., the prediction node, and the error cause:

Error detail

This basically means that the prediction node did not find a required value, i.e. model, in the incoming message. If you hover the output port of the Find Latest Production Ensemble node and the input port of the prediction node, you will see the former’s output port is named resource while the latter’s input port is named model.

Error detail

What this means is that:

  • The find node will add a resource property to the message payload.
  • The prediction node will require a model property and an input_data property to work properly.

We made sure the input_data was provided by the inject node, so what must be missing is the model property. The model property represents the model we want to use for the prediction. But, hey! This is exactly what the find node should produce in its output. Hence, the issue here is a mismatch between the find node output and the prediction node input.

We can fix this by renaming the find node output port to model. To do this, double-click the find node and then display the node settings pane, just below the node properties pane we have been using all the time. Here, add model as an output port name for the only output port that is defined, as shown in the image below.

tutorial-1-16c.jpg

With this in place, click Done, then Deploy and inject a new message. This time the flow will execute correctly and give the following output, where you can see a prediction was created and its outcome stored under the key result.

Error detail

If you wanted to get the prediction outcome under a different key, you’d only have to change the reify node settings and specify that key as the output port name, as the following image displays.

tutorial-1-18

The importance of reifying nodes

In both of our examples above, we have used a special node, called a Reify node, at the end of our flow. This had basically two objectives:

  • Triggering the execution of the flow on BigML. When you create a flow diagram using the BigML Node-RED bindings, what happens behind the scene is a WhizzML script is created to carry through that flow. This requires you to tell NodeRED when your flow is complete and you want to execute it.
  • Extracting a value from a resource. Since many BigML operations create new resources, which are identified through a resource ID, the Reify node also serves a different purpose, that of getting the actual resource definition and extracting a specific value from it. We have seen that in action in our last example, where we created a prediction and extracted the output key, which was then sent forth with the payload under the result key or predictionOutcome depending on the node configuration.

On a more abstract level, you could say that you need to reify when you want to go from the BigML/WhizzML realm down to concrete values which you can pass on to other kinds of Node-RED nodes. This means whenever you want to consume the result of a BigML flow in a non-BigML node, you should reify it. We did exactly this in the examples presented here before injecting the BigML node output into the Node-RED debug node. Another situation where you will want to reify your BigML node output is when you connect it to multiple nodes, including to multiple BigML nodes.

Since reifying is such a common step, the BigML Node-RED bindings provide an additional way to reify a node’s output. In fact, you can reify any node output by selecting the Reify property in that node edit panel, as shown in the following image.

tutorial-1-21

You can use this option whenever you want to reify a node and do not need to get the corresponding resource to extract a specific value from it (as the Reify node will allow you to do by providing an output key path as discussed above).

A better way to pass credentials to nodes

We have already looked at how you can provide your BigML credentials so the nodes you create can access your BigML account. Though very easy to do, this option will have your credentials moved along your flow embedded in the message payload. This might not be a good solution for you, so the BigML Node-RED bindings provide an additional way to let your BigML nodes know what BigML account they should access and be able to send the required credentials out.

In addition to sending your BigML credentials with the message payload, you can store them inside the flow context, which is a special data structure Node-RED manages so it is accessible from within a flow. To set flow context properties, you can use a Node-RED standard change node. Drag it from the node palette and then set its properties as shown in the following image.

tutorial-1-21

The change node will only do its work when it gets triggered by an event. So, you should make sure to trigger it before you actually attempt to reify any BigML node. The following image shows how you can do that in a reliable way.

Injecting credentials into a flow context

Conclusion

In this second part of our series about the BigML Node-RED bindings, we discussed how to properly connect inputs and outputs, pass your credentials so they are not transmitted across the whole flow, and node reification. In the next installment of this series, we will present more advanced material, including an in-depth discussion of inputs and outputs, strategies for debugging errors, and how to add a WhizzML processor able to run your own WhizzML code. Let us know how you are using the BigML Node-RED bindings and provide any feedback to support@bigml.com. Stay tuned for part 3!

Automated Best-First Feature Selection

In this third post about feature selection scripts in WhizzML, we will introduce the third and final algorithm, Best-First Feature Selection (Best-First). In the first post, we discussed Recursive Feature Selection, and in the second post, we covered Boruta

Best First Feature Selection with WhizzML

Introduction to Best-First Feature Selection

You can find this script in the BigML Script Gallery If you want to know more about it, visit its info page.

Best-First selects the n best features for modeling a given dataset, using a greedy algorithm. It starts by creating N models, each of them using only one of the N features of our dataset as input. The feature that yields the model with the best performance is selected. In the next iteration, it creates another set of N-1 models with two input features: the one selected in the previous iteration and another of the N-1 remaining features. Again, the combination of features that gives the best performance is selected. The script stops when it reaches the number of desired features which is specified in advance by the user.

One improvement we made to this script includes k-fold cross-validation for the model evaluation process at each iteration. This ensures that the good or bad performance of one model is not produced by chance because of a single favorable train/test split.

Since this is the most time-consuming script of the dimensionality reduction scripts described in this series of posts, another useful feature has been added to this script: early-stop. We can configure the script to stop the execution if there are a certain number of iterations where the additional features do not improve the model performance. We created two new inputs for that:

  • early-stop-performance: An early-stop-performance improvement value (in %) used as a threshold to consider if a new feature has a better performance compared to previous iterations.
  • max-low-perf-iterations: The maximum number of consecutive iterations allowed that may have a lower performance than the early-stop performance set. It needs to be set as a percentage of the initial number of features in the dataset.

Finally, there are two more inputs that can be very useful:

  • options: It allows you to configure the kind of model that will be created at each iteration and its parameters.
  • pre-selected-fields: List of field IDs to be pre-selected as best features. The script won’t consider them but they will be included in the output.

Feature selection with Best-First Feature Selection

As this is a time-consuming script, we won’t apply it to the full Trucks APS dataset used in the first post in case you wanted to quickly replicate the results. We will use a subset of the original dataset that uses the 29 fields selected by the Boruta script in our second post. Then we will apply these parameters:

We have used a max-n of 20 because that’s the number of features that we want to select. As we want the script to return exactly 20 features, we are using an early-stop-performance value of -100 to bypass the early stop feature. After 1 hour, Best-First selects these 20 fields as important:

"bj_000", "ag_002", "ba_005", "cc_000", "ay_005", "am_0", "ag_001", "cn_000", 
"cn_001", "cn_004","cs_002","ag_003", "az_000", "bt_000", "bu_000", "ee_005", 
"al_000", "bb_000","cj_000", "ee_007"  

In the fourth and final post, we will compare RFE, Boruta, and Best-First to see which one is better suited for different use cases. We will also explore the results of the evaluations performed to the reduced datasets and compare them with the original ones. Stay tuned!

Powering the Next Wave of Intelligent IoT Devices with Machine Learning – Part 1

At BigML, we strive to bring the power of Machine Learning to as many diverse environments as possible. Now you can easily power your Internet of Things (IoT) devices with Classifiers, Regressors, Anomaly Detectors, Deep Neural Networks, and more with the BigML bindings for Node-RED.

The BigML Node-RED bindings aim to make it easier to create and deploy ML-powered IoT devices using one of the most used development environments for IoT: Node-RED. Node-RED is a flow-based, visual programming development tool that allows you to wire together hardware devices, APIs and online services, as part of the Internet of Things. Node-RED provides a web browser-based flow editor which can be used to visually create a JavaScript web service.

Thanks to the BigML Node-RED bindings, you will be able to carry through ML tasks using the BigML platform. For example, tasks such as creating a model from a remote data source, making a prediction using a pre-existing model when a new event occurs, and so on, will be as easy as dragging and dropping the relevant BigML nodes on to the Node-RED canvas and wiring them together. As a bonus, the BigML Node-RED bindings are based on WhizzML, our domain-specific language for automating Machine Learning workflows. This will allow you to easily integrate your Node-RED flows with any advanced ML workflows your use case requires.

Setting up Node-RED with the BigML Node-RED bindings

Let’s see first how you can set-up your Node-RED environment. Installing Node-RED is super-easy if you already have Node and npm installed on your machine. Just run the following shell command:

$ sudo npm install -g --unsafe-perm node-red

Once Node-RED is installed, you launch it by executing the following command:

$ node-red

Now, you can point your browser to http://localhost:1880/ and access the Node-RED visual flow editor, shown in the image below.

The Node-RED flow editor on launch

Note that there are alternative ways to install and run Node-RED on your machine or IoT device. Check the documentation linked above for more options.

Your first Node-RED flow with BigML: Creating an ensemble

Now that you have Node-RED installed on your machine, we can define a flow to create an ML resource on BigML.

To get a rough idea about the way Node-RED works, let’s create a very basic flow that outputs some JSON to the Node-RED debug console. Once we have that in place, we will add a BigML node to carry through our ML task.

As a first step, just grab the default inject and debug nodes from the node palette on the left-side of the node-RED editor to the middle canvas. Then connect the inject node output port to the debug node input port. You should get the flow displayed in the next image:

Your first Node-RED flow

Notice the two blue dots on each of the nodes. That is the Node-RED way of telling you those nodes have changes that have not been deployed yet. When you are ready with your changes, you can deploy them by clicking the red Deploy button in the top-right corner. If everything looks right, Node-RED will update the status of the nodes by removing the blue dot.

You can customize the two nodes by double-clicking on each of them and configuring their options. For now, just click the Deploy button and then the small square box left of the inject node. This will make a timestamp message to be injected in the flow and reach the debug node, which simply outputs the message payload to the debug console, as shown in the following image.

Your first Node-RED flow in action
Now, let’s build a Machine Learning workflow to create a new model from a remote data source. As you likely know, this requires three steps:

  • Creating a BigML source from your remote data source.
  • Creating a dataset from the source.
  • Finally, creating the model using that dataset.

So, our Node-RED flow will include three nodes, one to create the source, another to create the dataset, and another to create the model.

Before doing this, we will need to install the BigML Node-RED bindings, which will require going back to the command line. If you have not modified Node-RED default configuration, it will store all of its stuff in its user data directory, which is ~/.node-red by default. In that directory, you can install any additional Node-RED nodes you would like to use as npm packages. In our present case, just execute the following command to have the BigML Node-RED bindings installed:

cd $HOME/.node-red
npm install bigml-nodered

Then restart your node-red process to have it load the new nodes. This should populate your Node-RED node palette with a wealth of new BigML nodes, as the following image shows.

BigML Nodes
Now, you can drag and drop the BigML nodes we mentioned above and connect them as in the following image. Thereafter, we are going to configure the nodes appropriately.

BigML Nodes
To configure each node, double-click it and then set its properties as described below:

Source:

BigML Source configuration

Dataset:

BigML Dataset configuration

Ensemble:

BigML Ensemble configuration

Reify:

BigML Reify configuration

As you can see, each node contains a real wealth of configuration parameters. You can find a thorough description of each of them on the BigML API page. For the sake of this example, we have just modified the description associated with each node.

Before attempting to execute this workflow, one important thing we should consider is authentication. The BigML API, which the BigML Node-RED bindings use, requires a user to authenticate themselves by specifying a username and an API key. We should provide this information if we want BigML to execute our flow. The BigML Node-RED bindings support several ways to specify authentication information. For this example, we will resort to providing username and API key in the payload message. To do this, we will customize our inject node so it initializes the message payload with a specific JSON. First, we will change the inject node type to make it a JSON node, as displayed in the following image.

tutorial-1-10

Next, we will define the JSON as in the following image, so it contains our API credentials.

tutorial-1-11

With the credentials set, we can finally inject the message into the flow, which will effectively start the execution. This flow will create a source, a dataset, and an ensemble in your BigML account using the specified arguments. If you go to your BigML Dashboard, you can check this out for yourself and see how the created resources look and use them as any other resources that exist in your Dashboard. Since we are using a Node-RED debug node at the end of our flow, we can additionally inspect our flow results in Node-RED debug sidebar, as shown in the following image.

Flow execution results

There, you can see how each node’s output was stored in the message payload under the corresponding output port name. This property enables the use of any node’s output in downstream nodes – provided they are not overridden by any intermediate node.

Conclusion

In this first instalment, we have just skimmed the very basics of using BigML with Node-RED. In a second instalment, we will create a more complex flow, which will give us the opportunity to cover important topics such as input-output connections, debugging, and so on. If you are developing IoT devices and would like to leverage our best-in-class Machine Learning algorithms to make them more intelligent, do not hesitate to get in touch with us at support@bigml.com.

Strategic Partnership for Joint Machine Learning Solutions

A1 Digital, a subsidiary of the A1 Telekom Austria Group, and BigML, the leading Machine Learning platform company based in USA and Spain, have agreed on a strategic partnership for joint Machine Learning solutions. Companies can now use leading Machine Learning solutions based on the European cloud platform Exoscale.

With the partnership with BigML, A1 Digital aims to accelerate its growth in the Machine Learning market in Europe and beyond through fast-paced Machine Learning innovation and synergies with its European cloud solution Exoscale that will allow A1 Digital’s customers to benefit from a purely European alternative when it comes to Machine Learning platforms.

Machine Learning driven applications allow companies of all sizes to extract value out of its data: e.g., to increase revenues, to reduce costs and risks, or improve customer satisfaction or security. BigML’s Machine Learning platform already helps hundreds of organizations worldwide to prepare their data for Machine Learning and to build, evaluate, and deploy predictive models, the essential part of every Machine Learning application, in a fully automated fashion.

“With A1 Digital, we have found the ideal partner to gain an even stronger foothold in Europe”, says Francisco Martin, CEO of BigML. “As businesses are shifting workloads to the cloud due to its attractive economics, it becomes absolutely critical to find trustworthy service providers to run tomorrow’s highly optimized Machine Learning applications driving incremental operational efficiencies and novel digital transformation use cases. BigML’s comprehensive platform on top of the Exoscale cloud can be customized to meet the most demanding functional and operational requirements of any European business.”

“With BigML, we are pleased to have found a partner with whom we can realize one of the key technologies for IoT applications, i.e., for fully networked and intelligent products”, says Elisabetta Castiglioni, CEO of A1 Digital. “BigML’s Machine Learning platform enables us to structure and optimize Machine Learning processes like any other business process. With Exoscale, we offer highly available and high-performance cloud servers and guarantee the highest data security at the same time.

The impact of this partnership is quite noticeable mostly in Europe, where thousands of companies have access to Exoscale, and therefore to the BigML platform. With BigML they can easily solve and automate a wide variety of use cases such as fraud detection, customer segmentation, churn analysis, predictive maintenance, propensity to buy, or healthcare diagnosis, among many others, by utilizing Classification, Regression, Time Series Forecasting, Cluster Analysis, Anomaly Detection, Association Discovery, Principal Component Analysis, and Topic Modeling tasks.

Simple Boruta Feature Selection Scripting

In the previous post of this series about feature selection WhizzML scripts, we introduced the problem of having too many features in our dataset, and we saw how Recursive Feature Elimination helps us to detect and remove useless fields. In this second post, we will learn another useful script, Boruta.

Boruta feature selection with WhizzML

Introduction to Boruta

We talked previously about this feature selection script. If you want to know more about it, visit its info page, which contains also the WhizzML code.

The Boruta script uses field importances obtained from an ensemble, to mark fields as important or unimportant. It does this process iteratively, labeling on each iteration the fields that are clearly important or unimportant and leaving the rest of fields to be labeled on later iterations. The previous version of this script didn’t have any configuration options, so we made the main two parameters of the algorithm configurable by the user:

  • min-gain: Defines the minimum increment of the importance of one field compared to the importance of a field with random values. If the gain is higher than the value set, then it will be marked as important.
  • max-runs: Maximum number of iterations.

As you can see, there is no n parameter that specifies the number of features to obtain. This is its main difference vs. other algorithms. Boruta assumes that the user doesn’t need to know what the optimal number of features is.

Feature Selection with Boruta

Let’s apply Boruta. We will use the dataset described in our previous post which contains information for multiple sensors inside trucks. These will be the inputs that we will use:

Input parameters for Boruta execution

After 50 minutes, Boruta selects the following fields as important:

"cn_000", "bj_000", "az_000", "al_000", "am_0", "bt_000", "ci_000", 
"ag_001", "ag_003", "aq_000", "ag_002", "ck_000", "bu_000", "cn_004",
 "ay_009", "cj_000", "cs_002", "dn_000", "ba_005", "ee_005", "ap_000", 
"az_001", "ay_003", "cc_000", "bb_000", "ee_007", "ay_005", "cn_001", 
"ee_000"

29 fields were marked as important. Fields in bold and italics were also returned by Recursive Feature Elimination, as seen in the previous post. 18 of the 29 fields were returned by RFE.  The ensemble associated with the new filtered dataset has a phi coefficient of 0.84. The phi coefficient of the ensemble that uses the original dataset was 0.824. Boruta achieved a more accurate model!

As we have seen, Boruta can be very useful when we don’t have any idea of the optimal number of features or we suppose that there are some features which are not contributing at all. Boruta discards fields which are not useful at all for the model. Therefore, we are removing features without subtracting from the model performance. In the third post of this series, we will cover the third script: Best First Feature Selection. Don’t miss it!

Practical Recursive Feature Selection

With the Summer 2018 Release Data Transformations were added to BigML. SQL-style queries, feature engineering with the Flatline editor and options to merge and join datasets are great, but sometimes these are not enough. If we have hundreds of different columns in our dataset it can be hard to handle them.

We are able to apply transformations but to which columns? Which features are useful for our target prediction and which ones are only increasing resource usage and model complexity?

Later, with the Fall 2018 Release, Principal Component Analysis (PCA) was added to the platform in order to help with dimensionality reduction. PCA helps with the challenge of extracting the discriminative information in the data while removing those fields that only add noise and make it difficult for the algorithm to achieve the expected performance. However, PCA transformation yields new variables which are linear combinations of the original fields, and this can be a problem if we want to obtain interpretable models.

Recursive Feature Elimination with WhizzML

Feature Selection Algorithms will help you to deal with wide datasets. There are 4 main reasons to obtain the most useful fields in a dataset and discard the others:

  • Memory and CPU: Useless features consume unnecessary memory and CPU.
  • Model performance: Although a good model will be able to detect which are the important features in a dataset, sometimes, this noise generated by useless fields confuses our model, and we obtain better performance when we remove them.
  • Cost: Obtaining data is not free. If some columns are not useful, don’t waste your time and money trying to collect them.
  • Interpretability: Reducing the number of features will make our model simpler and easier to understand.

In this series of four blog posts, we will describe three different techniques that can help us in this task: Recursive Feature Elimination (RFE), Boruta algorithm, and Best-First Feature Selection. These three scripts have been created using WhizzML, BigML’s domain-specific language. In the fourth and final post, we will summarize the techniques and provide guidelines for which are better suited depending on your use case.

Some of you may already know about the Best-First and Boruta scripts since we have offered them in the WhizzML Scripts Gallery. We will provide some details about the improvements we made to those and the new script, RFE.

Introduction to Recursive Feature Elimination (RFE)

In this post, we are focusing on Recursive Feature Elimination. You can find it in BigML Script Gallery If you want to know more about this script, visit its info page.

This is a completely new script in WhizzML. RFE starts using all the fields and, iteratively, creates ensembles removing the least important field at each iteration. The process is repeated until the number of fields (set by the user in advance) is reached. One interesting feature of this script is that it can return the evaluation for each possible number of features. This is very helpful to find the ideal number of features we should use.  

The script input parameters are:

  • dataset-id: input dataset
  • n: final number of features expected
  • objective-id: objective field (target)
  • test-ds-id: test dataset to be used in the evaluations (no evaluations take place if empty)
  • evaluation-metric: metric to be maximized in evaluations (default if empty). Possible classification metrics: accuracy, average_f_measure, average_phi (default), average_precision, and average_recall. Possible regression metrics: mean_absolute_error, mean_squared_error, r_squared (default).

Our dataset: System failures in trucks dataset

This dataset, originally from the UCI Machine Learning Repository, contains information for multiple sensors inside trucks. The dataset consists of trucks with failures and the objective field determines whether or not the failure comes from the Air Pressure System (APS). This dataset will be useful for us for two reasons:

  • It contains 171 different fields, which is a sufficiently large number for feature selection.
  • Field names have been anonymized for proprietary reasons so we can’t apply domain knowledge to remove useless features.

As it is a very big dataset, we will use a sample of it with 15,000 rows.

Feature Selection with Recursive Feature Elimination

We will start applying Recursive Feature Elimination with the following inputs. We are using an n=1 because we want to obtain the performance of all possible subset of features, from 171 until 1. If we set a higher n, e.g. 50, the script would stop when it reached that number so we wouldn’t know how smaller subsets of features perform.

Input parameters of RFE execution

After 30 minutes, we obtain an output-features object that contains all the possible subsets of features and their performance. We can use it to create the graph below. From this, we can deduce the optimal number of features is around 25. From 25 features on, the performance is stable.

Evaluation score as a function of the number of features

Try it yourself with this Python script

Now that we know that we should obtain around 25 features, let’s run the script again to find out which are the optimal 25. This time, as we don’t need to perform evaluations, we won’t pass the test dataset to the script execution.

The script needs 20 minutes to finish the execution. The 25 most important fields that RFE returns are:

"bs_000", "cn_004", "cs_002","cn_000", "dn_000", "ay_008", "ba_005",    
"ee_005", "bj_000", "az_000", "al_000", "am_0", "ay_003", "ci_000", 
"ba_007",  "aq_000", "ag_002", "ee_007", "ck_000", "bc_000", "ay_005", 
"ba_002", "ee_000", "cm_000", "ai_000"

From the script execution, we can obtain a filtered dataset with these 25 fields. The ensemble associated with this filtered dataset has a phi coefficient of 0.815. The phi coefficient of the ensemble that uses the original dataset was only a bit higher, 0.824. That sounds good!

As we have seen, Recursive Feature Elimination is a simple but powerful feature selection script with only a few parameters, serving as a very useful way to get an idea of which features are actually contributing to our model. In the next post, we will see how we can achieve similar results using Boruta. Stay tuned!

Machine Learning meets Social Good to tackle Data Quality Challenges for Enterprises

BigML partners with WorkAround to provide datasets tagged and cleaned by skilled refugees. WorkAround, a crowdsourcing platform for refugees and displaced people, partners with BigML, the leading Machine Learning Platform accessible for everyone, to give more economic opportunities to end users.

In a world of increasing automation, it is easy to forget the human work that goes into making Machine Learning happen. Quality data is the linchpin to accurate outcomes from Machine Learning algorithms, but finding providers that can deliver clean and accurate data is challenging. However, WorkAround makes this possible while working with skilled refugees and displaced people who are otherwise unable to work due to government restrictions, lack of access to banking, and other barriers. With this partnership, BigML customers will enjoy the benefits of having their data cleaned and tagged without the burden of having to perform these tasks by themselves, thus being able to dedicate more time to other strategic tasks.

“I started WorkAround because aid is not a sustainable solution for anyone to move forward,” says Wafaa Arbash, WorkAround’s co-founder and CEO, who watched frustrated as many of her fellow Syrians fled conflict only to be left with few options for employment in host communities, despite having higher education and previous work experience. “People don’t need handouts, they need economic opportunities.”

Although the 1951 UN Refugee Convention signed by 144 countries grants refugees the right to work, the reality is that most host countries block or severely limit local access to jobs. “WorkAround basically saved my life,” said Oro Mahjoob, a WorkArounder since July of 2017, “it gave me a chance to work and earn enough to pay my rent with only having an internet connection and a device.”

Francisco Martin, BigML’s CEO emphasized: “BigML is excited to offer more ways to ensure high-quality data is made available for a variety of Machine Learning tasks executed on our platform. Our mission of democratizing Machine Learning is further extended to cover data preparation thanks to our partnership with WorkAround all the while contributing to a worthy social cause.”

%d bloggers like this: