Skip to content

Bigger Results from Smaller Data with Linear Regression

Least squares linear regression is one of the canonical algorithms in the statistical literature. Part of the reason for this is that it’s very good as a pedagogical tool. It’s very easy to visualize, especially in two dimensions, a line going through a set of points, and the distances from the line to each point representing the error of the classifier.  Kind people have even created nice animations to help you:

GeoGebra Linear Regression

And herein, we have machine learning itself in a nutshell. We have the training data (the points), the model (the line), the objective (the distances from point to line) and the process of optimizing the model against the data (changing the parameters of the line so that those distances are small).

It’s all very tidy and relatively easy to understand . . . and then comes the day of the Laodiceans, when you realize that not every function is a linear combination of the input variables, and you must learn about gradient trees and deep neural networks, and your innocence is shattered forever.

Part of the reason to prefer these more complex classifiers to simpler ones like linear regression (and its sister technique for classification, logistic regression) is that they are often generalizations of the simpler techniques. You can, in fact, view linear regression as a certain type of neural network; specifically one with no hidden layers, non-linear activation functions, or fancy things like convolution and recurrence. 

So, then, what’s the use in turning back to linear regression, if we already have other techniques that do the same and more besides?  Answers to this question often come in two flavors:

  1. You need speed.  Fitting a neural network can take a long time whereas fitting a linear regression is near-instantaneous even for medium-sized datasets. Similarly for prediction:  A simple linear model will, in general, be orders of magnitude faster to predict than a deep neural network of even moderate complexity. Faster fits let you iterate more and focus on feature engineering.
  2. You have small training data.  When you don’t have much training data, overfitting becomes more of a concern; complex classifiers may fit the data a bit better, but if you have small training data, your test sets are very small and so your estimates of goodness of fit become unreliable. Using models like linear regression reduces your risk of overfitting simply by giving you less variables to fit.

We’re going to focus on the second case for the rest of this blog post. One way of looking at this is the classic view in machine learning theory that the more parameters your model has, the more data you need to fit those properly.  This is a good and useful view. However, I find it just as useful to think about this from the opposite direction: We can use restrictive modeling assumptions as a sort of “stand in” for the training data we don’t have.

Consider the problem of using a decision tree to fit a line. We’ll usually end up with a sort of “staircase approximation” to the line. The more data we have, the tighter the staircase will fit the line, but we can’t escape the fact that each “step” in the staircase requires us to have at least one data point sitting on it, and we’ll never get a perfect fit.

This is unfortunate, but the upside is loads of flexibility. Decision trees don’t care a lick whether the underlying objective is linear or not; you can do the same sort of staircase approximation to fit any function at all.

Using linear regression allows us to sacrifice flexibility to get a better fit from less data.  Consider again the same line. How many points does it take from that line for linear regression to get a perfect fit?  Two. The minimum error line is the one and only line that travels through both points, which is precisely the line you’re looking for. No, we can’t fit all or even most functions with linear regressions, but if we restrict ourselves to lines, we can find the best fit with very little data.

Linear Regression: More Power

Some of you may find the reasoning implied above to be a bit circular: “You can learn a very good model using linear regression, provided that you know in advance that a line is a good fit to the data.” It’s a fair point, but it can be surprising how often you find that this logic applies. It’s not odd to have a set of features, where changes in those features induce directly proportional changes in the objective, simply because those are the sorts of features amenable to machine learning in general. And in fact, these sorts of relationships abound, especially in the sciences, where linear and quadratic equations go much of the way towards predicting what happens in the natural world.

As an example, here’s a dataset of buildings that has measurements of the roof surface area and wall surface area for all of them, and the heat loss in BTUs for the building. Physics tells us that heat loss is proportional to these areas, and you break them out into roof and wall surface areas because those things are insulated differently. The dataset also has only 12 buildings, so we’ll use nine for training and three for test. Is it possible to get a reasonable model using so little data?

houses.png

If we try a vanilla ensemble of 10 trees, we get an r-squared on the holdout set of 0.85. This isn’t bad, all things considered! Again, we’ve only got nine training points, so something so well-correlated to the objective is pretty impressive.

trees.png Now, let’s see if we can do better by making linear assumptions. After all, we said at the top that heat loss is, in fact, proportional to the given surface areas. Lo and behold, linear models serve us well: We are able to recover the “true” model for heat loss through a surface near-exactly.

compare.png

One caveat here is that you’re only evaluating on three points, so it’s hard to know if the performance difference we see is significant. We might want to try cross-validation to see if the results continue to hold. However, Occam’s razor principle implores to choose the simpler model even if their performances are equal, and prediction will be faster to boot.

Old Wine in New Bottles

Yes, linear regression is somewhat old-fashioned, and in this day and age where datasets are getting larger all the time, the use cases aren’t as many as they used to be.  We make a mistake, though, to equate “fewer” with “none”.  When you’ve got small data and linear phenomena, linear regression is still queen of the castle.

Introduction to Linear Regression

BigML’s upcoming release on Thursday, March 21, 2019, will be presenting our latest resource to the platform: Linear Regressions. In this post, we’ll do a quick introduction to General Linear Models before we move on to the remainder of our series of 6 blog posts (including this one) to give you a detailed perspective of what’s behind the new capabilities. Today’s post explains the basic concepts that will be followed by an example use case. Then, there will be three more blog posts focused on how to use Linear Regression through the BigML DashboardAPI, and WhizzML for automation. Finally, we will complete this series of posts with a technical view of how Linear Regressions work behind the scenes.

Introduction to Linear Regression

Understanding Linear Regressions

Linear Regression is a supervised Machine Learning technique that can be used to solve, you guessed it, regression problems. Learning a linear regression model involves estimating the coefficients values for independent input fields that together with the intercept (or bias), determine the value of the target or objective field. A positive coefficient (b_i > 0), indicates a positive correlation between the input field and the objective field, while negative coefficients (b_i < 0) indicate a negative correlation. Higher absolute coefficient values for a given field can be interpreted to have a greater influence on final predictions.

By definition, the input fields (x_1, x_2, …, x_n) in the linear regression formula need to be numeric values. However, BigML linear regressions can support any type of fields by applying a set of transformations to categorical, text, and items fields. Moreover, BigML can also handle missing values for any type of field.

It’s perhaps fair to say linear regression is the granddaddy of statistical techniques that is required reading for any Machine Learning 101 student as it’s considered a fundamental supervised learning technique. Its strength is in its simplicity, which also implies it is pretty easy to interpret vs. most other algorithms. As such, it makes for a nice quick and dirty baseline regression model similar to Logistic Regression for classification problems. However, it is also important to grasp those situations linear regression may not be the best fit, despite its simplicity and explainability:

  • It works best when the features involved are independent or to put it another way less correlated with one another.
  • The method is also known to be fairly sensitive to outliers.  A single data point far from the mean values can end up significantly affecting the slope of your regression line, in turn, hurting the models chances to better generalize come prediction time.

Of course, using the standardized Machine Learning resources on the BigML platform, you can mitigate these issues and get more mileage from the subsequent iterations of your Linear Regressions. For instance,

  • If you have many columns in your dataset (aka a wide dataset) you can use Principal component analysis (PCA) to transform such a dataset in order to obtain uncorrelated features.
  • Or, by using BigML Anomalies, you can easily identify and remove the few outliers skewing your linear regression to arrive at a more acceptable regression line.

Here’s where you can find the Linear Regression capability on the BigML Dashboard: Linear Regression on Dashboard

Want to know more about Linear Regressions?

If you would like to learn more about Linear Regressions and find out how to apply it via the BigML Dashboard, API or WhizzML please stay tuned for the rest of this series of blogs posts to be published in the next week.

Linear Regression Joins the Suite of Supervised Methods on BigML

The latest BigML release brings a tried and true Machine Learning algorithm to the platform: Linear Regression. We intend to make it generally available on Thursday, March 21, 2019. This simple technique is well understood and widely used across industries. As such, it has been a frequently requested algorithm by our customers and we are happy to add it to our collection of supervised learning methods.

As the name implies, this algorithm assumes a linear relationship between the input fields and the output (objective) field, which enables you to discover relationships between quantitative, continuous variables. Since BigML has advanced data transformation capabilities, our implementation of linear regression can support any type of field including categorical, text, and items fields, and can even handle missing values. To give a sense of how Linear Regression is applied out in the real world, it’s often used to analyze product performance, conduct market research, perform sales forecasting, and make stock market predictions, among many other use cases.

One of the main benefits of Linear Regression is its simplicity, which affords a high level of interpretability. This makes it a good technique for doing quick tests and model iterations to establish a baseline to solve regression problems. Like any other technique, there are tradeoffs so there will be circumstances where Linear Regression is not a suitable model for your uses case. We will explain some of those considerations in more detail in our subsequent posts.

As usual, this release comes with a series of blog posts that progressively explain Linear Regression through a real use case and brief tutorials on how to apply it via the BigML Dashboard, API, WhizzML and bindings. While we will not be having a live webinar for this release, feel free to contact us at support@bigml.com with any questions or feedback as always.

Want to know more about Linear Regression?

If you are curious to learn more about how to apply Linear Regression using the BigML platform, please stay tuned for the rest of this series of blogs posts to be published over the next week.

Seville becomes the capital of innovation with the first Machine Learning School in Andalusia

184 decision makers, analysts, domain experts, and entrepreneurs coming from all around the world gathered on March 7 and 8 at EOI Andalucía to join the first edition of our Machine Learning School held in Seville, Spain (#MLSEV). The event was co-organized by EOI and BigML, in collaboration with the Andalusian Government and Seville City Council; and sponsored by La Caseta, qosIT Consulting, and ITlligent

Attendees came from 13 countries (Andorra, Brazil, China, Denmark, India, Ireland, Italy, Lebanon, the Netherlands, Portugal, Spain, United Kingdom, and United States) to enjoy the two-day event that offered several master classes along with workshops to put into practice the concepts learned in them. Also presented were eight real-world use cases showing how big and small organizations are already applying Machine Learning, such as Rabobank, TDK, T2Client, Talento Corporativo, SlicingDice, Jidoka, Good Rebels, and AlterWork in areas like banking, industry, marketing, the legal sector, among others.

165 attendees represented 92 private companies and big corporations, which highlights that companies are ready to adopt Machine Learning to work more efficiently. The remaining 19 attendees represented 10 universities and other educational institutions. As usual, international networking was an important advantage for attendees, who also had the chance to discuss their Machine Learning projects with the BigML Team at the Genius Bar.

During the opening and closing remarks, we were honored to have the collaboration of the Government of Andalusia. Manuel Ortigosa, Secretary-General of companies, innovation, and entrepreneurship; and Manuel Alejandro Hidalgo, Secretary-General of Economy, presented MLSEV as a great opportunity for the region to bring new ideas and innovative companies to Andalusia in order to help tech business grow in the south of Spain. The Government of Spain was also represented by Raúl Blanco, Secretary-General of Industry and small and middle size companies in the Ministry of Industry, Commerce, and Tourism. Additionally, Francisco Velasco, Director of EOI Business School Andalusia, also opened and closed the event emphasizing the importance of celebrating such a Machine Learning crash course at EOI Andalusia.

Juan Ignacio de Arcos, MLSEV Chairman, Business Analytics Executive Programme’s Director at EOI, and BigML Strategic Advisor, closed the event with a special mention to the companies attending the event, who realized applying Machine Learning is a strategic decision to be ahead of their competitors. 

For more information about the program, speakers, and other details, please visit the event page here. Or check the event photos here. Stay tuned for more Machine Learning event announcements, as there are more editions to come in Seville and other cities worldwide!

2019 Oscars Predictions: Results Are In

The 91st Academy Awards this Sunday, the first without a host in 30 years, proceeded without a hitch and seemed to sit well with the worldwide audience. For the third year in a row, we applied the BigML Machine Learning platform to predict the winners. This year, we got 4 out of 8 right for the major award categories. While this may seem mediocre, it’s notable that the confidence scores for the most likely nominee to win for 3 out of the 8 categories were well below 50%, meaning those were virtual coin toss type categories with multiple weak favorites going up against each other. Lo and behold, we whiffed on all three weak favorites: Best Picture, Best Supporting Actress and Best Original Screenplay.

2019 Oscars Predictions Results BigML

At this stage, we can merely speculate the reasons behind the Academy members’ votes, but we can peek behind the curtain to understand how our Machine Learning models made their predictions. So, let’s dive in! Our results are shown in the table below. For two of the missed categories, the actual winners were our second choice, and Green Book, the winner of Best Picture was a close tie as our number 3 pick.

BigML Oscars 2019 Predictions results

This year we relied on two new tools added to our toolbox that can be game-changers when it comes to improving accuracy and saving time in your Machine Learning (ML) workflows. The first method involved OptiML (an optimization process for model selection and parameterization) which is both robust and incredibly easy-to-use on BigML. Once we had collected and prepared the datasets, which is often the most challenging part of any ML project, all we had to do was hover over the “cloud action” menu and click OptiML. Really, that’s it!

BigML OptiML 1-Click

After running for about an hour, the OptiML returns a list of top models for us to inspect further and apply our domain knowledge. In that relatively short amount of time, the OptiML processed over 700 MB of data, created nearly 800 resources and evaluated almost 400 models. How about that?!

BigML OptiML Best Director results

Next, we took the list of selected models (the top performing 50% out of the total model candidates from OptiML) and built a Fusion, which combines multiple supervised learning models and aggregates their predictions The idea behind this technique is to balance out the individual weaknesses of single models, which can lead to better performance than any one method in particular (but not always, see this post for more details). The screenshot below shows the Fusion model for the Best Director category, which was comprised of 13 decision tree, 45 ensembles, 41 logistic regressions and 2 deepnets. The combined predictions of all those models contributed to our pick of Alfonso Cuarón, director of Roma, to take home the prize.

BigML Fusion for Best Director Oscars 2019

Have we really done the best Machine Learning can do? Is there a reason to believe that OptiML may not have found the best solution to this problem? My colleague, Charles Parker, BigML’s VP of Machine Learning Algorithms, chimes in with an explanation of how things get a little hazy here: Remember, OptiML is essentially doing model selection by estimating performance on multiple held out samples of the data. Since our Oscar data only goes back about 20 years, the number of positive examples in each held out test set is just a fraction of those 20 or so examples. Our estimation of the performance of each model in the OptiML will then be driven primarily by just a tiny number of movies. Indeed, if we mouse over standard deviation icon next to the model’s performance estimate in the OptiML (see screenshot below), we’ll see that the standard deviation of the estimate is so large that the performance numbers of nearly all of the models returned are within one standard deviation of the top model’s performance.  

BigML Fusion for Best Picture Oscars prediction

What does this mean?  For one thing, it means that you don’t have enough data to test these models thoroughly enough to tell them apart. Thankfully, OptiML does enough training-testing splits to show us this, so we don’t make the mistake of thinking that the very best model is meaningfully better than most other models in the list.  

Unfortunately, this is a mistake that is made all too often by people using Machine Learning in the wild. There are many, many cases in which, if you try enough models, you’ll get very good results on a single training and testing split, or even a single run of cross-validation. This is a version of the multiple comparisons problem; if you try enough things, you’re bound to find something that “works” on your single split just by random chance, but won’t do well on real world data. As you try more and more things, the tests you should use to determine whether one thing is “really better” than another need to be stricter and stricter, or you risk falling into one of these random chances, a form of overfitting.

In OptiML’s case, the easiest and most robust way to get a stricter test is to seek out more testing data. But we can’t time travel (yet!), and so we’re stuck with the data we have. The upshot of all of this is that, yes, there may very well be a better model out there, but with the data that we have, it will be difficult to say for sure that we’ve arrived at something clearly better than everything that OptiML has tried.

As it turned out, BigML was not alone in missing the mark for the top category predictions. DataRobot was counting on Roma to win Best Picture, and Green Book was not in their top three. Microsoft Bing and TIME also put their bets on Roma, so it goes to show you the reality of algorithmic predictions being tested in real world scenarios where patterns and “rules” don’t always apply.

Alright, alright, enough of the serious talk. As pioneers of MLaaS here at BigML, we care deeply about these matters concerning the quality and application of ML-powered findings so we couldn’t pass this chance to discuss. But back to red carpet results…we enjoyed the challenge of once again putting our ML platform to the test of predicting the most prestigious award show in the entertainment industry. To all users who experimented with making their own models to predict the Oscars, let us know how your results came out on Twitter @bigmlcom or shoot us a note at feedback@bigml.com.

Predicting the 2019 Oscars Winners with Machine Learning

Following the success of predicting 6 out of 6 for the Oscars last year, we have the bar set high for using Machine Learning to predict the 2019 Oscars winners. This year, however, the results are not as obvious. For some of the top categories, our projected results show ties for who gets to take home the coveted gold statuette. Nevertheless, we are excited to share our predictions and see how the Academy Awards pan out this Sunday!

Oscars 2019 Predictions

Davidlohr Bueso/Flickr.com

Once again, we apply the standard Machine Learning workflow of collecting and preparing a dataset, building and evaluating models, to ultimately make predictions. Using the 1-click OptiML on BigML to find the best models, we easily process more than 100 variables and determine patterns based on the movies that won in the past and to make well-informed estimates for this year’s nominees.

The Data

Earlier this week, we published our Movies dataset and encouraged users to build their own models to predict the 2019 Oscars. Machine Learning models typically improve with more data instances so we are keeping all the previous data and features we had brought together for previous year’s predictions, and we added data from 2018, all of which amounts to a total of 1,235 movies from 2000 to 2018, where each film has 100+ features including:

2019 Oscars predictions dataset on BigML

The Models

In addition to using deepnets as we did for the 2018 predictions, this year we also utilize OptiML, the optimization process on BigML that automatically finds the best supervised model, along with Fusions, which combines multiple supervised models for improved performance. So for each award category, we trained two separate model types to see how the predictions would compare and which method would give the best results.

For the new workflow we tried this year, we first built the OptiML, which returns a list of top performing models including deepnets, ensembles, logistic regressions, and decision trees. This powerful method saves you the difficult and time-consuming work of hand-tuning multiple supervised algorithms. With truly the click of a button, we can automatically build and evaluate hundreds of models. As you can see in the screenshot below, after a mere 16 minutes, our OptiML has already evaluated 126 models.

2019 Oscars OptiML on BigML

After our OptiML was finished, we created a Fusion of the top models and then made a Batch Prediction. As an example of the insights that can be gleaned from this process, our models determine which are the most important fields to predict the “Best Picture” award, as shown below in the field importance report for our Fusion model). The “Critics Choice won categories” appears to be the strongest indicator, contributing 27% to our model’s predictions.

Field Importances 2019 Oscars Predictions

The Predictions

Now comes the fun part. Let’s predict the 2019 winners! For each category, we predict the winner and the scores for the rest of the nominees.

In the battle for the Best Picture, our models went back and forth between The Favourite and Roma. The deepnet predicted The Favourite will be the favorite (no pun intended!) with a score of 37. The OptiML + Fusion models predicted Roma would be the big winner, but with a lower probability score of 24, so we stuck with the deepnet predictions for this category.

Best Picture Prediction

For Best Director, our models are much more confident. Alfonso Cuarón, director of Roma, is the likely winner with a score of 70 and the other nominees trail far behind.

For Best Actress, Glenn Close in The Wife is the leading lady with a score of 93.

The strongest prediction from all our models was a score of 96 for the Best Actor award going to Rami Malek for this stellar performance in Bohemian Rhapsody. Rock on, Rami!

Our models weren’t as convinced for the Best Supporting Actress category. Emma Stone is our pick with a humble score of 23. Even the machines can’t figure it out all the time.

Bouncing back, our models are feeling pretty good about Mahershala Ali in the Green Book winning Best Supporting Actor with a score of 64.

First Reformed seems to be the best bet for Best Original Screenplay with a score of 46, followed by close ties between Roma (17) and The Favourite (16) once again.

And last but not least, our models give the gold to BlacKkKlansman for Best Original Screenplay.

This concludes our 2019 Oscars predictions. After curling up on the couch with popcorn, we’ll be on the edge of our seats as we find out how our Machine Learning models performed while watching the awards show live this Sunday, February 24th. Check back on Monday when we’ll share our results for this year’s predictions…fingers crossed!

Machine Learning and RPA in Action: Email Management

We recently announced the strategic alliance between Jidoka and BigML, where we explained the integration of RPA with other technologies such as Machine Learning. With this integration, Jidoka can provide Machine Learning capabilities in their RPA process automation platform.

To explain the advantages and possibilities offered by this integration, today we present a practical example of the application of both technologies, Jidoka’s RPA and BigML’s Machine Learning: the automation of an e-mail classification process, a use case that will be presented by Jidoka’s CEO, Víctor Ayllón, at the #MLSEV, our first Machine Learning School in Seville, which will be held on March 7-8 in Seville (Spain).

Imagine for a moment that you are responsible for the customer service department of a large company. You and your team receive on a daily basis a very large number of customer emails that are addressed to different departments of the company. You end up spending a lot of time processing these e-mails and redirecting them to the most suitable department to deal with the customer’s request, perhaps using an incident management tool for this task. As it is a process that is performed manually, opening and processing emails one by one, you are conscious of many requests not being dealt with as quickly as would be desirable and you ask yourself the critical question: how can I make this whole process more agile and responsive?

The combination of RPA+ML can be the answer. We explain it to you in the short video below that describes step by step how these two technologies complement each other to automate this process from start to finish, focusing on what each one can “do best”. The video presents the automated process that Jidoka’s CEO, Víctor Ayllón, will present in detail at #MLSEV on March 7-8 in Seville. You cannot miss it!

Jidoka’s software robot takes care of repetitive and mechanical tasks: it opens the mailbox and checks for unread emails, extracts their contents and accesses the Machine Learning tool for analysis. On the other hand, BigML is in charge of processing and interpreting the information contained in the mails, in order to identify, through a predictive model, which department they are related to. But automation is not finished here. Once the target department has been determined, the Jidoka robot resumes the process and uses the BigML prediction to assign a task to the relevant department, using the company’s ticket or incident management platform (in this case Atlassian JIRA).

In this way, using different systems and corporate applications (email manager, task management tool, etc.), RPA and ML complement each other in the execution of the process, and together they make it automatized and faster. 

Build your own models to predict the 2019 Oscars

It’s that time of the year when movie fans around the world get glued to their TVs sucking in everything the Oscars have come to represent over the years: the nominees and the snubs, celebrities, designer outfits, rumors of impending breakups or newcomers making the waves, and oh yes, also some of the best movies of the year before.  Following on the footsteps of our success last year, on our part, we’re getting ready to once again predict which special performance or production deserves to win this year’s gold-plated statues that signify perhaps the highest achievement in the 131-year-old business of fast-moving pictures.

2019 Oscars

This year, in an attempt to involve all our readers in this fun exercise (and a nice intro use case to Machine Learning), we’re publishing the corresponding dataset in the BigML gallery. Rest assured we’ve already done most of the hard work to gather and verify the completeness of the data. It sports 20 categorical, 56 numeric, 42 items, and 1 datetime field totaling 119 fields giving you plenty of details about various aspects of the past nominees and winners. The dataset is organized such that each record represents a unique movie identified by the field movie_id. The first 17 fields have to do with the metadata associated with each movie e.g., release_date, genre, synopsis, duration, metascore.  The following fields are dedicated to recording the outcomes of past Academy Awards and 19 other relevant awards such as Golden Globes, Screen Actors Guild, BAFTA and more.  Finally, we have some automatically generated datetime fields based on the Release Date of the movie entry. Please note that this rather abbreviated dataset comes with the limitation of making predictions based on movie titles only, which means in those instances where multiple persons are nominated from a single movie, you’ll have to make a judgment call between those nominees.

Oscar Nominees 2000-2018

Click on the above image and clone this public dataset to your BigML Dashboard.

To make your own predictions, you’ll need to perform a time split and create a training dataset spanning the period 2000-2017 as well as a test dataset for the movies released in 2018 — essentially, the nominees for 2019 Oscars.  The dataset is prepared in a way to handle multiple awards to save time. So instead of dealing with a different dataset for each award, you can simply drop the unneeded target fields and select as your target field the award you’re trying to predict. For instance, if you’re looking to predict the Best Movie, then you select Oscar_Best_Picture_Won as the target and the rest of fields sharing the naming convention Oscar_XXXXX_Won are to be excluded.

Here are some additional clues for newbies:

  • Get familiar with the dataset by building some scatterplot visualizations
  • Start with simpler methods like Models or Logistic Regressions to see what fields seem to correlate well with the outcome you’re looking to predict (i.e. use Model Summary Report)
  • Add more sophisticated techniques like Deepnets or Ensembles later on
  • Execute some side by side Evaluation Comparisons to compare your best performing classification models
  • Try an OptiML and see how automatic Machine Learning performs vs. your previous attempts
  • For additional peace of mind, validate models with last years predictions as a tie-breaker exercise
  • See if you can build some Fusions from your top classifiers to improve the robustness of your predictions further
  • Compare your predictions to those of human experts, and better yet, see how they deviate by using the handy predictions explanations feature of BigML.
  • BONUS: Go beyond what we supply here and add your own features and Data Transformations to the original movie dataset for an additional edge.

What are you waiting for, join in the fun, impress some friends and let us know how your predictions turn out with a shoutout to @bigmlcom on Twitter!

 

Powering the Next Wave of Intelligent Devices with Machine Learning – Part 3

In the second part of this series, we explored how the BigML Node-RED bindings work in more detail and introduced the key concepts of input-output matching and node reification which will allow you to create more complex flows. In this third and final part of this introductory series, we are going to review what we know about inputs and outputs in a more systematic way, to introduce debugging facilities, and present an advanced type of node that allows you to inject WhizzML code directly into your flows.

Details about node inputs and outputs

Each BigML node has a varying number of inputs and outputs, which are embedded in the message payload that Node-RED propagates across nodes. For example, the ensemble node has one input called dataset and one output called ensemble. That means the following two things:

  • An ensemble node expects by default to receive a dataset input. This can be provided by any of the upstream nodes through their outputs, which are added to the message payload, or as a property of the ensemble node configuration.
  • The ensemble output is sent over to downstream nodes with the ensemble key. This is a consequence of the fact that when a node sends an output value, this is appended to the message payload using that node output port label as a key. This way, downstream nodes can use that key to access the output value of that node.

You can change the input and output port labels when you need to connect two nodes whose inputs and outputs do not match. Say for example that a node has an output port label generically named resource and that you want to use that output value in a downstream node that requires a dataset input. You can easily access the upstream node configuration and change the node settings as shown in the following image.

tutorial-1-23.jpg

One thing you should be aware of is that all downstream nodes will be able to see and use any output values generated by upstream nodes, unless another node uses the same key to send its output out. For example, consider the following partial flow, where all inputs and outputs are shown at the same time:

Input/output ports

If you inspect the connection between Lookup Dataset and Dataset Split, you will see that both labels have the value dataset. To reiterate the flow explained above, this will make Lookup Dataset store its output in the message payload under the dataset key. Correspondently, Dataset Split expects its input under the key dataset, so all will work out just fine.

If you inspect the connection between Dataset Spit and Make Model, you will see that Dataset Split produces two outputs, training-dataset and test-dataset, in accordance with its expected behavior which is splitting a dataset into two parts, one for training a model and the other to later evaluate it. On the other hand, Make Model expects a dataset input.

Now, if you were to run the flow as it is defined, you would not get any error. The flow would be executed through, but it would produce an incorrect result because Make Model would use the dataset value produced by Lookup Dataset instead of the training dataset value produced by Dataset Split.

You have two options to fix this issue: either you change Dataset Split‘s output so it uses a dataset label instead of training-dataset; or you modify the Make Model input so it uses training-dataset instead of dataset. In the former case, the dataset value produced by Lookup Dataset will be overridden by the value with the same name produced by Dataset Split.

How to debug problems

When you build a flow that causes an error when you run it, a good approach to follow is to force each node to be reified and connected to a debug node that will allow you to inspect the output generated from that node so you can detect any anomalies or unexpected results. This will allow you to make sure that each node sends out a message whose payload actually contains the information downstream nodes expect to receive.

For example, consider the following flow. An error could occur at any node but we will not get any useful information until the whole WhizzML code has been generated and sent to the BigML platform to be executed.

A complex workflow

A rather trivial approach to get more information for each node would be connecting each node to a debug node. This would provide for each debugged node the WhizzML code generated at that node. Unfortunately, since this is information available previous to the WhizzML code execution, we get no information about the actual produced outputs, which are sent along within the message payload.

Debugging a complex flow

If you enable the reify option for each node, you are actually forcing the execution of each BigML node and thus you will also get to know which outputs each node generates by inspecting its message payload. This can be of great help when, for example, a downstream node complains about some missing information, improperly formatted information, or you simply get the wrong result, e.g., by using a wrong resource.

Additionally, when you reify each node, you will divide the whole WhizzML code that the flow generates into smaller, independent chunks that you will be able to run in the BigML Dashboard, which provides a more user-friendly environment for you to assess why a flow is failing.

To streamline debugging even more, the BigML Node-RED bindings provide two special flags you can specify in the message payload you inject in your flow or inside the flow context. The first one, BIGML_DEBUG_TRACE will make each node output the WhizzML code it generates on the Node-RED console. So, you do not have to connect each BigML node to a debug node to get that information, although it is perfectly fine if you do.

WhizzML for  evaluation :
(define lookup-dataset-11  (lambda (r) (let (result (head (resource-ids (resources "dataset" (make-map ["name__contains" "limit" "order"] ["iris" 2 "Ascending"])))) ) (merge r (make-map ["dataset"] [result])))))
(define dataset-split-12  (lambda (r) (let (dataset (if (contains? r "dataset") (get r "dataset") "" ) result (create-random-dataset-split dataset 0.75 { "name" "Dataset - Training"} { "name" "Dataset - Test"}) ) (merge r (make-map ["training-dataset" "test-dataset"] result)))))
(define model-13  (lambda (r) (let (training-dataset (if (contains? r "training-dataset") (get r "training-dataset") "" ) result (create-and-wait "model" (make-map [(resource-type training-dataset)] [training-dataset])) ) (merge r (make-map ["model"] [result])))))
(define evaluation-14  (lambda (r) (let (test-dataset (if (contains? r "test-dataset") (get r "test-dataset") "" ) model (if (contains? r "model") (get r "model") "" ) result (create-and-wait "evaluation" (make-map [(resource-type model) "dataset"] [model test-dataset])) ) (merge r (make-map ["evaluation"] [result])))))
(define init {"inputData" {"petal length" 1.35}, "limit" 2, "BIGML_DEBUG_REIFY" false, "BIGML_DEBUG_TRACE" true})
(define lookup-dataset-11-out (lookup-dataset-11 init))
(define dataset-split-12-out (dataset-split-12 lookup-dataset-11-out))
(define model-13-out (model-13 dataset-split-12-out))
(define evaluation-14-out (evaluation-14 model-13-out))

WhizzML for  Filter result :
(define lookup-dataset-11  (lambda (r) (let (result (head (resource-ids (resources "dataset" (make-map ["name__contains" "limit" "order"] ["iris" 2 "Ascending"])))) ) (merge r (make-map ["dataset"] [result])))))
(define dataset-split-12  (lambda (r) (let (dataset (if (contains? r "dataset") (get r "dataset") "" ) result (create-random-dataset-split dataset 0.75 { "name" "Dataset - Training"} { "name" "Dataset - Test"}) ) (merge r (make-map ["training-dataset" "test-dataset"] result)))))
(define model-13  (lambda (r) (let (training-dataset (if (contains? r "training-dataset") (get r "training-dataset") "" ) result (create-and-wait "model" (make-map [(resource-type training-dataset)] [training-dataset])) ) (merge r (make-map ["model"] [result])))))
(define evaluation-14  (lambda (r) (let (test-dataset (if (contains? r "test-dataset") (get r "test-dataset") "" ) model (if (contains? r "model") (get r "model") "" ) result (create-and-wait "evaluation" (make-map [(resource-type model) "dataset"] [model test-dataset])) ) (merge r (make-map ["evaluation"] [result])))))
(define filter-result-15  (lambda (r) (let (evaluation (if (contains? r "evaluation") (get r "evaluation") "" ) result (get (fetch evaluation (make-map ["output_keypath"] ["result"])) "result") ) (merge r (make-map ["evaluation"] [result])))))
(define init {"inputData" {"petal length" 1.35}, "limit" 2, "BIGML_DEBUG_REIFY" false, "BIGML_DEBUG_TRACE" true})
(define lookup-dataset-11-out (lookup-dataset-11 init))
(define dataset-split-12-out (dataset-split-12 lookup-dataset-11-out))
(define model-13-out (model-13 dataset-split-12-out))
(define evaluation-14-out (evaluation-14 model-13-out))
(define filter-result-15-out (filter-result-15 evaluation-14-out))

As you can see, for each node you get the whole WhizzML program that is being generated for the whole flow.

Similarly, BIGML_DEBUG_REIFY will reify each node without requiring you to manually change its configuration. In this case as well, each node will print on the Node-RED console the WhizzML code it attempted to execute:

WhizzML for  evaluation :
(define evaluation-9  (lambda (r) (let (test-dataset (if (contains? r "test-dataset") (get r "test-dataset") "" ) model (if (contains? r "model") (get r "model") "" ) result (create-and-wait "evaluation" (make-map ["dataset" (resource-type model)] [test-dataset model])) ) (merge r (make-map ["evaluation"] [result])))))
(define init {"BIGML_DEBUG_REIFY" true, "BIGML_DEBUG_TRACE" true, "dataset" "dataset/5c3dc6948a318f053900002f", "inputData" {"petal length" 1.35}, "limit" 2, "model" "model/5c489dc33980b5340f007d3a", "test-dataset" "dataset/5c489dbd3514cd374702713c", "training-dataset" "dataset/5c489dbc3514cd3747027139"})
(define evaluation-9-out (evaluation-9 init))

WhizzML for  Filter result :
(define filter-result-10  (lambda (r) (let (evaluation (if (contains? r "evaluation") (get r "evaluation") "" ) result (get (fetch evaluation (make-map ["output_keypath"] ["result"])) "result") ) (merge r (make-map ["evaluation"] [result])))))
(define init {"training-dataset" "dataset/5c489dbc3514cd3747027139", "BIGML_DEBUG_TRACE" true, "model" "model/5c489dc33980b5340f007d3a", "dataset" "dataset/5c3dc6948a318f053900002f", "inputData" {"petal length" 1.35}, "limit" 2, "evaluation" "evaluation/5c489dce3514cd37470271b0", "BIGML_DEBUG_REIFY" true, "test-dataset" "dataset/5c489dbd3514cd374702713c"})
(define filter-result-10-out (filter-result-10 init))

In this case, each code snippet is complete with the inputs provided by the previous node, stored in the init global, so you can more easily check its correctness and/or try to execute it in BigML.

Injecting WhizzML Code

As we mentioned, WhizzML, BigML’s domain-specific language for defining custom ML workflows, provides the magic behind the BigML Node-RED bindings. This opens up a wealth of possibilities by embedding a node inside of your Node-RED flows to execute generic WhizzML code. In other words, if our bindings for Node-RED do not already provide a specific kind of node for a given task, you can create one with the right WhizzML code that does what you need.

For example, we could consider the following case:

  • We want to predict using an existing ensemble.
  • We calculate the prediction using two different methods, then choose
    the result that has the highest confidence.

To carry through this task in Node-RED, we define the following flow.

Selecting the best prediction

The portion of the flow delimited by the dashed rectangle is the same prediction workflow we described in part 2 of this series. You can then add a new prediction node making sure the two prediction nodes use different settings for Operating kind. You can use Confidence for one, and Votes for the other.

Setting the operating kind

Another detail to note is renaming the two prediction nodes output labels so they do not clash. Indeed, if you leave the two nodes with their default output port labels, which will read prediction for both of them, the second prediction node will override the first’s output. So, just use prediction1 and prediction2 as port labels for the two nodes.

Changing the prediction nodes output labels

Finally, add a WhizzML node, available through the left-hand node palette, and configure it as shown in the following image.

WhizzML node to select the best prediction

Since the WhizzML node is going to use the two predictions outputted by the previous nodes, we should also make that explicit in the WhizzML input port label configuration, as shown in the following image:

Specifying the inputs to the WhizzML node

This is the exact code you should paste into the WhizzML field:

(let (p1 ((fetch prediction1) "prediction")
      p2 ((fetch prediction2) "prediction")
      c1 ((fetch prediction1) "confidence")
      c2 ((fetch prediction2) "confidence"))
      (if (> c1 c2) [p1 c1] [p2 c2]))

As you see, the WhizzML node uses prediction1 and prediction2. Those variables must match the labels you defined for the prediction nodes output ports and the WhizzML node input port.

Now, if you inject a new message, with the same format as the one used for the prediction use case introduced earlier, you should get the following output:

The selected prediction

Conclusion

We can’t wait to see what developers will be able to create using the BigML Node-RED bindings to make IoT devices that are able to learn from their environment. Let us know how you are using the BigML Node-RED bindings and provide any feedback to support@bigml.com.

Comparing Feature Selection Scripts

In this series about feature selection, the first three posts covered three different WhizzML scripts that can help you with this task: Recursive Feature Elimination, Boruta and Best-First Feature Selection. We explained how they work and the needed parameters for each one of them, applying the scripts to the system failures in trucks dataset described in the first post.

Feature Selection Scripts

As we previously explained, this kind of script can help us deal with wide datasets by selecting the most useful features. They are an interesting alternative to dimensionality reduction algorithms such as Principal Component Analysis (PCA). Furthermore, they provide the advantage that you don’t lose any model interpretability because you are not transforming your features.

Feature Selection algorithms can work in two different ways:

  • They can start using all the fields from the dataset and, iteratively, remove the least important fields. This is how Recursive Feature Elimination and Boruta work.
  • They can start with 0 fields, and, iteratively, add the most important features. This is how Best-First Feature Selection works.

Let’s compare the results from these three scripts. To that end, we have used them with a reduced version of the dataset mentioned previously. This reduced version, the same that we used in the Best-First post, has 29 fields and 15,000 rows.

In the table below, we can see a comparison between the scripts. We have annotated the execution times, the number of output fields, and the number of output fields in common between each pair of scripts. For each script output dataset, we have created and evaluated an ensemble.

  1. *  Using max-runs of 10 and min-gain of 0.01 (default parameters) 
  2.  Using the same input parameters as in the previous post.
  3.  phi-score with the 29 fields dataset is 0.84. 

From these tests, we extract some interesting conclusions:

  • Recursive Feature Selection is a simple script that runs extremely fast with only a few parameters, all without sacrificing accuracy. Its results are clearly consistent with the ones from the other scripts.
  • Boruta is a useful script that has an interesting feature: it is free from user bias because the n parameter, that represents the number of features to select, is not required.
  • Best-First Feature Selection is the most time-consuming of the scripts so we should use it with smaller datasets or on a previously reduced one. However, it is the only one that starts with 0 fields, and the information from the very first iterations is useful to see which are the most important features of our dataset.

The system failures in trucks dataset seemed to be a difficult dataset to work with. The large number of fields and their useless names made it hard to apply domain knowledge to it. These scripts helped us to automatically obtain the most important features without loosing modeling performance.

Now it’s your turn! Try out these new scripts and let us know if you have any feedback at support@bigml.com. What’s more, give WhizzML a try and create your own scripts that help automate your frequent tasks.

%d bloggers like this: