Skip to content

BigML Certifications are Here!

At BigML, we believe that the best way to add business value is by showing and not just telling what is possible via Machine Learning techniques. This has been the main reason why we prefer to give our user community free and easy access to our full-featured platform without having to fill out endless online forms to even get a glimpse. BigML has also been practicing what it preaches on the “getting hands-on” front when it comes to actively helping our customers launch their first Machine Learning use cases built on our platform.

BigML Certifications

As happy as we are in seeing customers grow the application areas of Machine Learning in their organizations, we can’t help but notice that many more customers are requesting BigML to get involved in their projects.  With this heightened awareness, BigML has set its sights to systematically address the need to certify our partners, which has led to today’s announcement of BigML Certifications.  This is a great opportunity for BigML partners to demonstrate their mastery of the rapidly growing BigML’s Machine Learning-as-a-Service platform while further differentiating themselves from competitive analytical services organizations offering more arcane methods relying on the more traditional statistical analysis tools.

Not yet a BigML partner? Well, what are you waiting for?  Contact us today to find out more on how the new wave Machine Learning-as-a-Service platforms can help you deliver actionable insights and real-world smart applications to your clients in days or weeks, and not months or years!

BigML Certifications come in two flavors: BigML Certified Engineer and BigML Certified Architect. In order to be eligible to enroll into the BigML Certified Engineer courses you must show certain level of proficiency in Machine Learning, BigML Dashboard, BigML API, and WhizzML. The following getting started assets will help you to get up and running in no time: Tutorials, API documentation, and WhizzML.

BigML Certified Engineer

This certification track prepares analysts, scientists, and software developers to become BigML Certified Engineers. Topics covered include:

  • Advanced Data Transformations for Machine Learning (3 hours)
  • Advanced Modeling (3 hours)
  • Advanced API (3 hours)
  • Advanced WhizzML (3 hours)
  • EXAM (3 hours)

BigML Certified Architect

This certification track prepares BigML Certified Engineers to become BigML Certified Architects. Once you’ve successfully passed the BigML Certified Engineer Exam, you are eligible to enroll into the BigML Certified Architect Courses. Topics covered include:

  • Designing Large-Scale Machine Learning Solutions (3 hours)
  • Measuring the Impact of Machine Learning Solutions (3 hours)
  • Using Machine Learning to Solve Machine Learning Problems (3 hours)
  • Lessons Learned Implementing Machine Learning Solutions (3 hours)
  • EXAM (3 hours)

Be sure to check out the certifications page for more on pricing and to pre-order yours.  As always, let us know if you have any special needs or feedback.

AI Startup Battle in Boston – The AI has spoken!

It is now well established that advances in Artificial Intelligence (AI) technology have opened up new markets and new opportunities. So much so, that hearing early-stage investors preaching how AI will automate everything is not a surprise nor a far fetched idea anymore. Even though, investors are keen to admit the disruptive power of AI, they still have a harder time admitting the same when it comes to the venture capital industry itself. The idea of automating early-stage investments is slowly taking ground in the VC community, but a lot convincing still needs to be done. Thus said, and knowing the industry’s appetite for competition, what a better way than organizing a startup contest with the jury being an AI? That is why we created the AI Startup Battle, our best attempt to show the world that even VCs can be disrupted one advancement at a time.

The last edition of the AI Startup Battle took place last Wednesday (Oct. 12 2016) during PAPIs ‘16, the 3rd International Conference on Predictive Applications and APIs. Four startups competed on stage at the Microsoft New England Research & Development Center (MIT Campus) and an impartial AI, powered by PreSeries and Telefónica Open Future_ chose the winner without having humans influencing the outcome.


Joonko, winner of the AI Startup Battle at PAPIs ’16 – Represented by Ilit Raz, CEO & CoFounder (left) and Guy Grinwald, CTO & CoFounder (right)

After being questioned live on-stage by the algorithm, the four startups were delivered a score from 0 to 100 representing their long-term likelihood of success. This edition’s winner with a total score of 89.24 is Joonko, which provides data-driven solutions to help companies improve the diversity and inclusion of their workforce. They will be offered an investment of up $50,000, an incredible place to work, access to mentors, business partners, a global network of talent as well as the opportunity to reach millions of customers.

The second place was for Cognii with a close score of 83.84, they are dedicated to improve the quality, affordability and scalability of education with the help of Artificial Intelligence. The third position went to Heartbeat Ai Technologies with a score of 70.73; they aim to design emotionally intelligent technologies and tools to help machines understand people’s feelings, needs and motivations. Finally the fourth position, was for Palatine Analytics with a score of 70.71; they help companies evaluate the current and future performance of their employees by using Artificial Intelligence and Predictive Analytics.


From left to right: Poul Petersen (CIO at BigML), Lana Novikova (CEO at Heartbeat Ai Technologies, Miguel Suarez (Strategic Advisor at BigML), Guy Grinwald (CTO & CoFounder at Joonko), Ilit Raz (CEO & CoFounder at Joonko), Archil Cheishvili (Founder at Palatine Analytics), and Dharmendra Kanejiya (Founder & CEO at Cognii)

Following the event, Francisco J. Martin, President of PreSeries, and Cofounder and CEO of BigML said “Today was further testament to the increasing level of interest in a quantifiable, data-driven approach to evaluating early stage startups. We have been continuously improving the models that make PreSeries possible as evidenced by the variety of questions ranging from team experience and depth to prior investor interest, as well as intellectual property and current traction. Many traditional investors were skeptical when we started this journey, but we are now witnessing that a growing number of institutional investors are starting to see the merit in PreSeries’ approach. It’s safe to say our ‘crazy idea’ will move onward with an emboldened spirit”.

Stay tuned for updates on the next AI Startup Battle!

AI Startup Battle in Boston – Meet the contenders!

If you are somewhat familiar with the world of startups today, you have probably noticed how startup competitions keep popping up all around the place. From the biggest competition to the more modest, every early-stage venture can now find its way under the spotlight. But despite their growing number, startup contests are for the most part still relying on the same approach. Meaning, carefully selected companies pitching in front of carefully selected juries.

By design, a competition’s result will reflect its jury’s subjectivity. Even when decades of research in psychology and behavioral economics show us that putting human bias at the center of a selection process might not be the best solution. For lack of better alternatives, humans are believed to be the best and only option. Yet when dealing with predicting the success of a startup, a jury will often provide you with as many opinions as there are members of the jury. In the end, the result is a consensus of opinions based on five minute presentations and handful of slides.


Luckily, there is still hope for a more scientific approach that does not remove the fun out of the competition! Our solution? The AI Startup Battle.

The AI Startup Battle is a startup contest powered by PreSeries, a joint venture from BigML and Telefonica Open Future_ with the objective of creating the world’s first platform to automate early stage investments. The second edition of the Battle will be taking place on Oct. 12 as part of the PAPIs ’16 conference on predictive applications and APIs. Join us at the Microsoft N.E.R.D. center on the MIT campus where you’ll see live real-world & high-stakes AI judging startups.

The first edition that was previously held in Valencia in March 2016 where PreSeries’ impartial algorithm crowned Novelti, a company that uses online machine learning algorithms to turn IoT data streams into actionable intelligence. Novelti will be presenting on stage this week and kickstart the contest for this year’s participants.

Let’s have a quick look at the contenders:

Cognii: they’re developing a leading edge assessment technology to evaluate essay-type answers for online learning platforms. Their exclusive natural language processing technology can also give customized feedback, not just a score, to engage students in an active learning process and improving their knowledge retention. Cognii’s solution is offered through an API for all online learning platforms, including LMS (Learning Management System), MOOCs (Massive Online Open Course), and more.



Joonko: The first data-driven solution for workforce diversity. It integrates into companies’ SaaS platforms and analyzes real actions in real time. The data collected is unbiased – this way, organizations can ensure that all employees get an equal opportunity to succeed in a safe, non-judgmental way. Diversity is a business problem, not just an HR one.


Palatine Analytics: Palatine helps companies evaluate the current and future performance of their employees by using AI and Predictive Analytics. With Palatine, you can collect reliable data points by incentivizing employees through Palatine’s real-time AI-driven feedback system capturing accuracy of evaluations, recognizing strengths and weaknesses of employees, and using predictive analytics to accurately predict their future performance.


Heartbeat Ai Technologies: The mission of Heartbeat Ai is to design emotionally intelligent technologies and tools to help machines understand people’s feelings, needs and motivations, and ultimately improve our emotional wellbeing. How? Language uniquely enables the differentiation of fine-grained emotions. The approach first teaches machines to understand fine-grained emotions from language and context. Then, it builds a broader understanding of human needs, desires and motivations.


Good luck to all participants!

BigML Summer 2016 Release Webinar Video is Here!

Many thanks for the enthusiastic feedback on BigML’s Summer 2016 Release webinar that formally introduced Logistic Regression to the BigML Dashboard. We had a number of inquiries from those that missed the broadcast, so we’re happy to share that you can now watch the entire webinar on the BigML Youtube channel:

As for more study resources, we recommend that you visit the Summer Release page, which contains all related resource links including:

  • The Logistic Regression documentation that goes into detail on both the BigML Dashboard and the BigML API implementations of this supervised learning technique.
  • The series of 6 blog posts covering everything between the basics to how you can fully automate your Logistic Regression workflows with WhizzML.

As a parting reminder, BigML offers a special education program for those students or lecturers that want to actively spread the word about Logistic Regression and other Machine Learning capabilities in their institutions. We are proud that we currently have more than 80 ambassadors and over 600 universities around the world enjoying our PRO subscription plans for FREE for a full year. Thanks for your hand in making the BigML community great!

Hype or Reality? Stealing Machine Learning Models via Prediction APIs

Wired magazine just published an article with the interesting title How to Steal an AI, where the author explores the topic of reverse engineering Machine Learning algorithms based on a recently published academic paper: Stealing Machine Learning Models via Prediction APIs.

How to Steal an AI

BigML was contacted by the author via email prior to the publication and within 24 hours we responded via a lengthy email that sums up our stance on the topic. Unfortunately, the article incorrectly stated that BigML did not respond. We are in the process of helping the author correct that omission. Update: the Wired article has now been updated and includes a short paragraph that summarizes BigML’s response. In the meanwhile, to set the record straight, we are publishing the highlights of our response below for the benefit of the BigML community as we take any security and privacy related issue very seriously:

WIRED Author:

“I’d really appreciate if anyone at BigML can comment on the security or privacy threat this [ways of “stealing” machine learning models from black-box platforms] might represent to BigML’s machine learning platform, given that it seems certain models can be reverse engineered via a series of inputs and outputs by anyone who accesses them on BigML’s public platform.”


  • Models built using BigML’s platform are only accessible to their owners who already have complete and white-box access to them, so this research does not expose or represent any security or privacy threat to BigML’s platform at all.

  • BigML’s users can access the underlying structure of their own models. This means that they can not only introspect their models using BigML’s visualizations but also fully download their model and use them in their own applications as they wish. BigML does not charge users for making predictions with their own models. So there is no need to reverse-engineer them as might be the case when you use Azure ML or Amazon ML. These services charge the owners of the models for making predictions with their own models.

  • BigML allows users to share models with other BigML users either in a white-box mode or in a black-box mode. In the latter case, if a user wanted to monetize her model by charging for predictions to another user, the user being charged might try to reproduce such model and avoid to continue paying for predictions. There is currently no BigML user charging for predictions. Again, this research does not expose or represent any security or privacy threat to BigML’s platform at all.

On Obviousness

  • Anyone versed in Machine Learning can see that many of the results of the publication are obvious. Any machine-learned model that is made available becomes a “data labeling API”, so it can, unsurprisingly, be used to label enough data to reproduce the model to some degree.  These researchers are focused on elaborate attacks that learn decision trees exactly (which does seem interesting academically), but far simpler algorithms will and always have been able to generate a lossy reproduction of a machine-learned model.  In fact, this is the exact trick that Machine Learning itself pulls on human experts: The human provides labeled data and the machine learns a model that replicates (to the degree made possible by the data) the modeling process of the human labeler.  It is therefore utterly unremarkable that this also works if it is a machine providing the labeled data.

  • As an instructive example, imagine you want to reverse-engineer the pricing strategy of an airline. It is unimportant how the model used by the airline was created; using a Machine Learning API,  an open source ML package, or a collection of rules provided by experts.  If one looks up at the price for enough flights, days, and lead times, one will soon have enough data to replicate the pricing strategy.

On Charging for Predictions:

  • BigML does not charge customers for predictions with their own models.  We think that this research might be relevant for services like Amazon ML or Azure ML, since they are charging users for predictions. Users of those services could try to reproduce the model or simply cache model responses to avoid being charged. Selling predictions is not a long-term money-making proposition unless you keep improving the classifier so that your predictions keep improving too. In other words, this shows how charging for predictions is a poor business strategy, and how BigML’s business model (charging for overall computational capacity to build many models for many different predictive use cases in an organization) is therefore more reasonable.

  • In BigML, this research would only be significant in the scenario where a BigML user publicly offers their trained model for paid predictions but wants to keep it secret.  We do not currently  have any customers exposing black-box models (except the ones created by these researchers).  But if that were the case, a user can guarantee that reconstructing the model will have a prohibitive cost by setting a higher price for each prediction.

On Applicability:

  • Some models are easier to reproduce while others are considerably harder. This research shows that their most elaborate method is only useful for single trees.  When the confidence level of a prediction is provided, the difficulty of the learning problem decreases.  However, when the models are more complex (such as Random Decision Forests) the process to replicate a model is not amenable to many of the techniques described in the paper, so models can only be approximated via the method we describe above.

  • If we wanted to offer a monetized black-box prediction platform in a serious way (and we are sure that we do not), we would encourage users to use complex models rather than individual trees. We can easily detect and throttle the kind of systematic search across the input space that would be required to efficiently reconstruct a complex classifier.

On Machine Learning APIs:

  • One thing is very clear to us though, Machine Learning APIs help researchers in many areas to start experimenting with machine-learned models in a way that other tools have never allowed. Mind you this is coming from a team with backgrounds in ML research. In fact, the research these folks carried out would be far more difficult to pursue using old-fashioned Machine Learning tools such as R or SAS that are tedious and complicated.

Finally, some comments in defense of other Machine Learning services that are potentially subject to this issue.

On Legality: 

  • We assume that to a researcher in security trying to find things on which to publish a paper, everything looks like a “security issue”. Putting things in the same category data privacy or identity theft issues makes them sound dangerous and urgent. However, the vast majority of the paper describes security issues closer in nature to defeating copy protection in commercial software, or developing software that functions exactly as an existing commercial product.  While this sort of security breach is certainly unfortunate and something to be minimized, it is important to distinguish things that are often dangerous to the public at large from those that, in the vast majority of cases, do not pose as big a threat.

  • Software theft and reverse engineering isn’t new or unique to Machine Learning as a Service, and society typically relies on the legal system to provide incentives against such behavior.  Said another way, even if stealing software were easy, there is still an important disincentive to do so in that it violates intellectual property law.  To our knowledge, there has been no major IP litigation to date involving compromise of machine-learned models, but as machine learning grows in popularity the applicable laws will almost certainly mature and offer some recourse against the exploits that the authors describe.

Logistic Regression versus Decision Trees

The question of which model type to apply to a Machine Learning task can be a daunting one given the immense number of algorithms available in the literature. It can be difficult to compare the relative merits of two methods, as one can outperform the other in a certain class of problems while consistently coming in behind for another class. In this post, the last one of our series of posts about Logistic Regression, we’ll explore the differences between Decision Trees and Logistic Regression for classification problems, and try to highlight scenarios where one might be recommended over the other.

Decision Boundaries

Logistic Regression and trees differ in the way that they generate decision boundaries i.e. the lines that are drawn to separate different classes. To illustrate this difference, let’s look at the results of the two model types on the following 2-class problem:

Decision Trees bisect the space into smaller and smaller regions, whereas Logistic Regression fits a single line to divide the space exactly into two. Of course for higher-dimensional data, these lines would generalize to planes and hyperplanes. A single linear boundary can sometimes be limiting for Logistic Regression. In this example where the two classes are separated by a decidedly non-linear boundary, we see that trees can better capture the division, leading to superior classification performance. However, when classes are not well-separated, trees are susceptible to overfitting the training data, so that Logistic Regression’s simple linear boundary generalizes better.

Lastly, the background color of these plots represents the prediction confidence. Each node of a Decision Tree assigns a constant confidence value to the entire region that it spans, leading to a rather patchwork appearance of confidence values across the entire space. On the other hand, prediction confidence for Logistic Regression can be computed in closed-form for any arbitrary input coordinates, so that we have an infinitely more fine-grained result and can be more confident in our prediction confidence values.


Although the last example was designed to give Logistic Regression a performance advantage, its resulting f-measure did not exactly beat the Decision Tree’s by a huge margin. So what else is there to recommend Logistic Regression? Let’s look at the tree model view in the BigML web interface:

Model 5768cee47e0a8d34dd0167bd | - Chromium_056

When a tree consists of a large number of nodes, it can require a significant amount of mental effort to comprehend all the splits that lead up to a particular prediction. In contrast, a Logistic Regression model is simply a list of coefficients:


At a glance, we are able to see that an instance’s y-coordinate is just over three times as important as its x-coordinate for determining its class, which is corroborated by the slope of the decision boundary from the previous section. An important caveat to this is in regards to scale. If for example, x and y were given in units of meters and kilometers respectively, we should expect their coefficients to differ by a factor of 1000 in order to represent equal importance in a real-world, physical sense. Because Logistic Regression models are fully described by their coefficients, they are attractive to users who have some familiarity with their data, and are interested in knowing the influence of particular input fields on the objective.

Source Code

The code for this blog post consists a WhizzML script to train and evaluate both Decision Tree and Logistic Regression models, plus a Python script which executes the WhizzML and draws the plots. You can view it on GitHub.

Learn more about Logistic Regression in our release page. You will find documentation on how to use Logistic Regression with the BigML Dashboard and the BigML API. You can also watch the webinar, see the slideshow, and read the other blog posts of this series about Logistic Regression.

Automating Logistic Regression Workflows


Continuing with our series of posts about Logistic Regression in this fifth post we will focus on the point of view of a WhizzML user. WhizzML is BigML’s popular domain specific language for Machine Learning, which provides programmatic support for all the resources you work with in BigML. You can use WhizzML scripts to create a Logistic Regression, or to create a prediction or batch prediction based on a Logistic Regression.  

Let’s begin with the easiest one: If you want to create a Logistic Regression with all the default values you just need to create a script with the following source code:


As BigML’s API is asynchronous, the create call will probably return a response before the Logistic Regression is totally built. Thus, if you want to use the Logistic Regression to make predictions, you should wait until the creation process has been completed. If you want to stop the code from processing until the Logistic Regression is finished you can use the “create-and-wait-logisticregression” directive.


To modify the default value of a Logistic Regression property you can simply add it in the properties map as a pair: “<property_name>” <property_value>. For instance, to calculate a Logistic Regression with a dataset that contains missing values, the normal default behavior in BigML is to replace them by the mean. However, if you want to replace them by zero you should add default_numeric_value and set it to “zero”. The source code will be as follows:


You can modify any configuration option in similar fashion. The BigML API documentation contains detailed information about those properties.

What if you have an existing Logistic Regression, and you want to get the code needed to recreate it with WhizzML? No problem, programmer or not BigML has a solution for you. Say you already tuned a Logistic Regression in BigML and you want to repeat the process on a new source that you just uploaded to the service. You can easily use the scriptify utility. This will generate a script that will run the exact steps needed to reproduce the Logistic Regression. Just navigate to the Logistic Regression you want to replicate and click on the “SCRIPTIFY LOGISTIC REGRESSION” link.


If you want to create a prediction from your Logistic Regression with WhizzML, the code is also short and easy. You just need the ID of the Logistic Regression you want to use and a collection of the new instances that you want to predict for, i.e. your input data. In the input_data collection the field ID is used as key.  Here’s an example:


In case you need to predict not for a single instance but for a set of new instances, you will need to create a batch prediction from your Logistic Regression by using WhizzML.


Once your source code is in place, how do you execute your script?

  • Using the BigML Dashboard, look for the script you just created. Opening the script view will reveal the available inputs, and you will be able to select their new values after which you can start the execution. For instance, the first script on this post looks as follows, and it expects you to select the dataset you want to create the Logistic Regression from.
  • If you want to execute the script through the API, you need to know the ID of the script you previously created. To follow the same example, the dataset you want to create a Logistic prediction for (input “ds1”) should be included in the list of inputs. The corresponding request to the BigML API should be as below:

    curl "$BIGML_AUTH"
           -X POST
           -H 'content-type: application/json'
           -d '{"script": "script/55f007d21f386f5199000003",
                "inputs": [["ds1", "dataset/55f007d21f386f5199000000"]]}'

These Logistic Regressions should execute swiftly, while you reach out for you coffee.

If you have any doubt or you want to learn more about Logistic Regression please check out our release page for documentation on how to use Logistic Regression with the BigML Dashboard and the BigML API. You can also watch the webinar, see the slideshow, and read the other blog posts of this series about Logistic Regression.

Programming Logistic Regressions

In this post, the fourth one of our Logistic Regression series, we want to provide a brief summary of all the necessary steps to create a Logistic Regression using the BigML API. As we mentioned in our previous posts, Logistic Regression is a supervised learning method to solve classification problems, i.e., the objective field must be categorical and it can consist of two or more different classes.

The API workflow to create a Logistic Regression and use it to make predictions is very similar to the one we explained for the Dashboard in our previous post. It’s worth mentioning that any resource created with the API will automatically be created in your Dashboard too, so you can take advantage of BigML’s intuitive visualizations at any time.


In case you never used the BigML API before, all requests to manage your resources must use HTTPS and be authenticated using your username and API key to verify your identity. Find below a base URL example to manage Logistic Regressions.$BIGML_USERNAME;api_key=$BIGML_API_KEY

You can find your authentication details in your Dashboard account by clicking in the API Key icon in the top menu.


Ok, time to create a Logistic Regression from scratch!

1. Upload your Data

You can upload your data, in your preferred format, from a local file, a remote file (using a URL) or from your cloud repository e.g., AWS, Azure etc. This will automatically create a source in your BigML account.

First, you need to open up a terminal with curl or any other command-line tool that implements standard HTTPS methods. In the example below we are creating a source from a remote CSV file containing some patients data, each row representing one patient’s information.

curl "$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"remote": ""}'

2. Create a Dataset

After the source is created, you need to build a dataset, which serializes your data and transforms it into a suitable input for the Machine Learning algorithm.

curl "$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"source":"source/68b5627b3c1920186f000325"}'

Then, split your recently created dataset into two subsets: one for training the model and another for testing it. It is essential to evaluate your model with data that the model hasn’t seen before. You need to do this in two separate API calls that create two different datasets.

  • To create the training dataset, you need the original dataset ID, and the sample_rate  (the proportion of instances to include in the sample) as arguments. In the example below we are including 80% of the instances in our training dataset. We also set a particular seed argument to ensure that the sampling will be deterministic. This will ensure that the instances selected in the training dataset will never be part of the test dataset created with the same sampling hold out.

curl "$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"dataset":"dataset/68b5627b3c1920186f000325", 
            "sample_rate":0.8, "seed":"foo"}'
  • For the testing dataset, you also need the original dataset ID, and the sample_rate, but this time we combine it with the out_of_bag argument. The out of bag takes the (1- sample_rate) instances, in this case, 1-0.8=0.2. Using those two arguments along with the same seed used to create the training dataset, we ensure that the training and testing datasets are mutually exclusive.

curl "$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"dataset":"dataset/68b5627b3c1920186f000325", 
            "sample_rate":0.8, "out_of_bag":true, "seed":"foo"}'

3. Create a Logistic Regression

Next, use your training dataset to create a Logistic Regression. Remember that the field you want to predict must be categorical. BigML takes the last valid field in your dataset as the objective field by default, if it is not categorical and you didn’t specify another objective field, the Logistic Regression creation will throw an error. In the example below, we are creating a Logistic Regression including an argument to indicate the objective field. To specify the objective field you can either use the field name or the field ID:

curl "$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"dataset":"dataset/98b5527c3c1920386a000467", 

You can also configure a wide range of the Logistic Regression parameters at creation time. Read about all of them in the API documentation.

Usually, Logistic Regressions can only handle numeric fields as inputs, but BigML automatically performs a set of transformations such that it can also support categorical, text and items fields. BigML uses one-hot encoding by default, but you can configure other types of transformations using the different encoding options provided.

4. Evaluate the Logistic Regression

Evaluating your Logistic Regression is key to measure its predictive performance  against unseen data. Logistic Regression evaluations yield the same confusion matrix and metrics as any other classification model: precision, recall, accuracy, phi-measure and f-measure. You can read more about these metrics here.

You need the logistic regression ID and the testing dataset ID as arguments to create an evaluation using the API:

curl "$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"logisticregression":"logisticregression/50650bea3c19201b64000024",

Check the evaluation results and rerun the process by trying other parameter configurations and new features that may improve the performance. There is no general rule of thumb as to when a model is “good enough”. It depends on your context, e.g., domain, data limitations, current solution. For example, if you are predicting churn and you can currently predict only 30% of the churn, elevating that to 80% with your Logistic Regression would be a huge enhancement. However, if you are trying to diagnose cancer, 80% recall may not be enough to get the necessary approvals.

5. Make Predictions

Finally, once you are satisfied with your model’s performance, use your Logistic Regression to make predictions by feeding it new data. Logistic Regression in BigML can gracefully handle missing values for your categorical, text or items fields. This also holds true for numeric fields, as long as you have trained the model with missing_numerics=true (which is the default), otherwise instances with missing values for numeric fields will be dropped.

In BigML you can make predictions for a single instance or multiple instances (in batch). See below an example for each case.

To predict one new data point, just input the values for the fields used by the Logistic Regression to make your prediction. In turn, you get a probability for each of your objective field classes. The class with highest probability is the one predicted. All class probabilities must sum up to 100%.

curl "$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"logisticregression":"logisticregression/50650bea3c19201b64000024",
            "input_data":{"age":58, "bmi":36, "plasma glucose":180}}'

To make predictions for multiple instances simultaneously, use the logistic regression ID and the new dataset ID containing the observations you want to predict. You can configure your batchprediction so the final output file also contains the probabilities for all your classes besides the class predicted.

curl "$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"logisticregression":"logisticregression/50650bea3c19201b64000024",
            "probabilities": true}'

In the next post we will explain how to use Logistic Regression with WhizzML, which will complete our series.  One more to go…

If you want to learn more about Logistic Regression please visit our release page for documentation on how to use Logistic Regression with the BigML Dashboard and the BigML API. You can also watch the webinar, see the slideshow, and read the other blog posts of this series about Logistic Regression.

Predicting Airbnb Prices with Logistic Regression

This is the third post in the series that covers BigML’s Logistic Regression implementation, which gives you another method to solve classification problems, i.e., predicting a categorical value such as “churn / not churn”, “fraud / not fraud”, “high/medium/low” risk, etc. As usual, BigML brings this new algorithm with powerful visualizations to effectively analyze the key insights from your model results. This post demonstrates this popular classification technique via a use case that predicts the housing rental prices based on a simplified version of this Airbnb public dataset.

The Data

The dataset contains information about more than 13,000 different accommodations in Amsterdam and includes variables like room type, descriptionneighborhood, latitude, longitude, minimum stays, number of reviews, availability, and price.


By definition, Logistic Regression only accepts numeric fields as inputs, but BigML applies a set of automatic transformations to support all field types so you don’t have to waste precious time encoding your categorical and text data yourself.

Since the price is a numeric field, and Logistic Regression only works for classification problems, we discretize the target variable into two main categories: cheap prices (< €100 per night) and expensive prices (>= €100 per night).

Finally, we perform some feature engineering like calculating the distance from downtown using the latitude and longitude data in 1-click thanks to a WhizzML script that will be published soon in the BigML Gallery. Incredibly easy!

Let’s dive in!

The Logistic Regression

Creating a Logistic Regression is very easy, especially when using the 1-click Logistic Regression option. (Alternatively, you may prefer the configuration option to tune various model parameters.) After a short wait…voilá! The model has been created, and now you can visually inspect the results with both a two-fold chart (1D and 2D) and the coefficients table.

The Chart

The Logistic Regression chart allows you to visually interpret the influence of one or more fields on your predictions. Let’s see some examples.

In the image below, we selected the distance (in meters) from downtown for the x-axis. As you may expect, the probability of an accommodation to be cheap (blue line) increases as the distance increase, while the probability of being expensive (orange line) decreases. At some point (around 8 kilometers) the slope softens and the probabilities tend to be constant.

Following the same example, you can also see the combined influence of other field values by using the input fields form to the right. See in the images below, the impact of the room type on the correlation between distance and price. When “Shared room” is selected and the accommodation is 3 kilometers far from downtown, there is a 75% probability for the cheap class. However, if we select “Entire home/apt”, given the same distance, there is a 83% probability of finding an expensive rental.

The combined impact of two fields on predictions can be better visualized in the 2D chart. by clicking in the green switch at the top. A heat map chart containing the class probabilities is appears, and you can select the input fields for both axes. The image below shows the great difference in cheap and expensive price probabilities depending on the neighborhood, while the feature minimum nights in the x-axis seems to have less influence on the price.


You can also enter text and item values into the corresponding form fields on the right. Keeping the same input fields in the axis, see below the increase of the expensive class probability across all neighborhoods due to the presence of the word “houseboat” in the accommodation description.


The Coefficients Table

For more advanced users, BigML also displays a table where you can inspect all the coefficients for each of the input fields (rows) and each of the objective field classes (columns).

The coefficients can be interpreted in two ways:

  • Correlation direction: given an objective field class, a positive coefficient for a field indicates that higher values for that field will increase the probability of the class. By contrast, negative coefficients indicate a negative correlation between the field and the class probability.
  • Impact on predictions: higher coefficient values for a field results in a greater impact on predictions of that field.

In the example below, you can see the coefficient for the room type “Entire home/apt” is positive for the expensive class and negative for the cheap class, indicating the same behavior that we saw at the beginning of this post in the 1D chart.



After evaluating your model, when you finally are satisfied with it, you can go ahead and start making predictions. BigML offers predictions for a single instance or multiple instances (in batch).

In the example below, we are making a prediction for a new single instance: a private room located in Westerpark, with the word “studio” in the description and a minimum stay of 2 nights. The class predicted is cheap with a probability of 95.22% while the probability of being an expensive rental is just 4.78%.


We encourage you to check out the other posts in this series: the first post was about the basic concepts of Logistic Regression, the second post covered the six necessary steps to get started with Logistic Regression, this third post explains how to predict with BigML’s Logistic Regressions, the fourth and fifth posts will cover how to create a Logistic Regression with the BigML API and with WhizzML respectively, and finally, the sixth post will dive into the differences between Logistic Regression and Decision Trees. You can find all these posts in our release page, as well as more documentation on how to use Logistic Regression with the BigML Dashboard, the BigML API, and the complete webinar and its slideshow about Logistic Regression. 

Logistic Regressions: the 6 Steps to Predictions

BigML is bringing Logistic Regression to the Dashboard so you can solve complex classification problems with the help of powerful visualizations to inspect and analyze your results. Logistic Regression is one of the best-known supervised learning algorithms to predict binary or multi-class categorical values such as “True/False”, “Spam/ Not Spam”, “Offer A / Offer B / Offer C”, etc.

In this post we aim to take you through the 6 necessary steps to get started with Logistic Regression:


1. Uploading your Data

As usual, start by uploading your data to your BigML account. BigML offers several ways to do it, you can drag and drop a local file, connect BigML to your cloud repository (e.g., S3 buckets) or copy and paste a URL. BigML automatically identifies the field types. Field types and other source parameters can be configured by clicking in the source configuration option.

2. Create a Dataset

From your source view, use the 1-click dataset option to create a dataset, a structured version of your data ready to be used by a Machine Learning algorithm.


In the dataset view you will be able to see a summary of your field values, some basic statistics and the field histograms to analyze your data distributions. This view is really useful to see any errors or irregularities in your data. You can filter the dataset by several criteria and create new fields using different pre-defined operations.


Once your data is clean and free of errors you can split your dataset in two different subsets: one for training your model, and the other for testing. It is crucial to train and evaluate your model with different data to ensure it generalizes well against unseen data. You can easily split your dataset using the BigML 1-click option, which randomly sets aside 80% of the instances for training and 20% for testing.


3. Create a Logistic Regression

Now you are ready to create the Logistic Regression using your training dataset. You can use the 1-click Logistic Regression option, which will create the model using the default parameter values. If you are a more advanced user and you feel comfortable tuning the Logistic Regression parameters, you can do so by using the configure Logistic Regression option.


Find below a list containing a brief summary for each of the configuration parameters. If you want to learn more about them please check the Logistic Regression documentation .

  • Objective field: select the field you want to predict. By default BigML will take the last valid field in your dataset. Remember it must be categorical!

  • Default numeric value: if your numeric fields contain missing values, you can easily replace them by the field mean, median, maximum, minimum or zero using this option. It is inactive by default.

  • Missing numerics: if your numeric fields contain missing values but you think they have a meaning to predict the objective field, you can use this option to include them in the model. Otherwise, instances with missing numerics will be ignored. It is active by default.

  • Eps: set the value of the stopping criteria for the solver. Higher values can make the model faster, but they may result in a poorer predictive performance. You can set a float value between 0 and 1. It is set to 0.0001 by default.

  • Bias: include or exclude the intercept in the Logistic Regression formula. Including it yields better results in most cases. It is active by default.

  • Auto-scaled fields: automatically scale your fields so they all have the same magnitudes. This will also allow you to compare the field coefficients learned by the model afterwards. It is active by default.

  • Regularization: prevent the model from overfitting by using a regularization factor. You can choose between L1 and L2 regularization. The former usually gives better results. You can also tweak the inverse of the regularization strength.

  • Field codings: select the encoding option that works best for your categorical fields. BigML will automatically transform your categorical values into 0 -1 variables to support non-numeric fields as inputs, which is a method known as one-hot encoding. Alternatively, you can choose among three other types of codings: dummy coding, contrast coding and other coding. You can find a detailed explanation of each one in the documentation.

  • Sampling options: if you have a very large dataset, you may not need all the instances to create the model. BigML allows you to easily sample your dataset at the model creation time. 

At this point you may be wondering… ok, so which parameter values should I use?

Unfortunately, there is not a universal response for that. It depends on the data, the domain and the use case you are trying to solve. Our recommendation is that you try to understand the strengths and weaknesses of your model and iterate trying different features and configurations. To do this, the model visualizations explained in the next point play an essential role.

4. Analyze your Results

When your Logistic Regression has been created you can use BigML’s insightful visualizations to dive into the model results and see the impact of your features on model predictions. Take into account that most of the time, the greatest gains on performance come from feature selection and feature engineering, which can be the most time consuming part of the Machine Learning process. Analyzing the results carefully, and inspecting your model to understand the reasons behind the predictions is key in further validating the findings in contrast to expert opinion.

BigML provides a 1D and 2D chart and the coefficient table to analyze your results.

1D and 2D Chart

The Logistic Regression chart provides a visual way to analyze the impact of one or more fields on predictions.

For the 1D chart you can select one input field in the x-axis. In the prediction legend to the right, you will see the objective class predictions as you mouse over the chart area. 


For the 2D chart you can select two input fields, one per axis and the objective class predictions will be plotted in the color heat map chart.


By setting the values for the rest of input fields using the form below the prediction legend, you will be able to inspect the combined interaction of multiple fields on predictions.

Coefficients table

BigML also provides a table to display the coefficients learned by the Logistic Regression. Each coefficient has an associated a field (e.g., checking_status) and an objective field class (e.g., bad, good etc.). A positive coefficient indicates a positive correlation between the input field and the objective field class, while a negative coefficient indicates a negative relationship.

image 11.png

To find out more about the interpretation of the Logistic Regression chart and coefficient table results, follow our next blog post of this series: Predicting Airbnb Prices with BigML Logistic Regression.

5. Evaluate the Logistic Regression

Like any supervised learning method, Logistic Regression needs to be evaluated. Just click on the evaluate option in the 1-click menu and BigML will automatically select the remaining 20% of the dataset that you set aside for testing.

image 9.png

The resulting performance metrics to be analyzed are the same ones as for any other classifier predicting a categorical value.

You will get the confusion matrix containing the true positives, false positives, true negatives and false negatives along with the classification metrics: precision, recall, accuracy, f-measure and phi-measure. For a full description of the confusion matrix and classification measures see the corresponding documentation.

image 10.png

Rinse and repeat! As we mentioned at the end of step 3, repeat steps from 3 to 5 trying out different configurations, different features, etc. until you have a good enough model.

6. Make Predictions

When you finally reach a satisfying model performance, you can start making predictions with it. In BigML, you can make predictions for a new single instance or multiple instances in batch. Let’s take a quick look to both of them!

Single predictions

Click in the Predict option and set the values for your input fields.


A form containing all your input fields will be displayed and you will be able to set the values for a new instance. At the top of the view you will see the objective class probabilities changing as you change your input field values.

image 14.png

Batch predictions

Use the Batchprediction option in the 1-click menu and select the dataset containing the instances for which you want to know the objective field value.


You can configure several parameters of your batch prediction like the possibility to include all class probabilities in the batch prediction output dataset and file. When your batch prediction finishes you will be able to download the CSV file and see the output dataset.

image 15.png

In the next post we will cover a real use case using Logistic Regression to predict Airbnb prices to delve into the Logistic Regression results interpretation.

If you want to learn more about Logistic Regression please visit our release page for documentation on how to use Logistic Regression with the BigML Dashboard and the BigML API. You can also watch the webinar, see the slideshow, and read the other blog posts of this series about Logistic Regression.

%d bloggers like this: