Skip to content

BigML Spring Release: A Cluster of Powerful New Features!

The BigML team has been hard at work the past few months. While we haven’t figured out how to work more than 24 hours in a day, we think you’ll be pleased with the key features of BigML’s Spring Release.

First and foremost, we’re excited to announce Cluster Analysis, our first foray into unsupervised learning. Other key features in the Spring release include more filter and new field optionsonline dataset creation, major updates to BigMLer and more.



Now you can automatically group together the most similar instances in your dataset into different groups (i.e., clusters).

BigML’s clustering algorithm has been inspired by k-means, so you can select the number of groups to create (i.e., k) and also how each field in your dataset influences to which group each data point belongs (i.e., scales). Similar to our trees and ensembles, you can choose to create a cluster in only one click from your dataset. By default, BigML will create 8 clusters and we’ll apply automatic scaling to all the numeric fields, although the number of clusters as well as field scaling and weighting can be easily modified when you configure your cluster.

Scaling is very important as very often datasets will contain fields with very different magnitudes. For example, a demographics dataset might contain age and salary. If clustering is performed on those fields, salary will dominate the clusters while age is mostly ignored. Generally that’s not what you want when clustering, hence the auto-scale fields (balance_fields in the API) option. When auto-scale is enabled, all the numeric fields will be scaled so that their standard deviations are 1. This makes each field have roughly equivalent influence. You can also pick your own scale for each field.

Once you build a cluster you can use it to predict the centroid (i.e., find the closest centroid for a new data point) and also to create batch centroids in the same way batch predictions work.

Cluster Analysis has been released in Beta so please let us know if something does not work as expected or if think we’ve overlooked any helpful features.

By the way, you can also share clusters via private links in the same way you can share datasets or models:

cluster share

More Filter and New Field Options

In our Winter release we introduced Flatline—a Lisp-like language that can be used to filter rows of a dataset, and also to generate new fields using a mix of columns and rows. This language easily allows one to extend the number of filtering and new field creation options that BigML offers, and is now implemented in the BigML interface as part of our Spring Release. For example, you can now easily filter a dataset using different comparison, equality, missing value, and statistics functions. You can also create new fields discretizing, replacing missing fields, normalizing, and performing all kind of math transformations on previous values of your dataset. Have a look at the Filter Dataset and Add Field to Dataset options, and remember that you can also use Flatline to input any complex function that you might need.

dataset filter

Segment-based dataset creation in models

Have you ever wanted to create a new dataset for further analysis from a specific node in a tree? Now you can! When you’re in a model or sunburst view, simply mouse over a node and then press your keyboard’s shift button. This will freeze the view and allow you to export the rules for that segment and/or create a new dataset with the instances at that node.

model segment

Dataset exports

Now you can also export datasets from a dataset view into a comma-separated values (CSV) file. This works very well in combination with the dataset creation above as it can help you identify the instances that follow certain criteria.

dataset export

Ensemble Summary Report

A nifty (and perhaps underappreciated) feature of BigML is the ability to get a summary report of your model which shows your predicted data distribution, field importance and associated outcomes. You can now get a similar report for your Ensembles! This is a great way to get a quick summary on what fields have the greatest impact on your predictive outcome–something that can be very illustrative when working with new and/or wide datasets.

ensemble summary report

New BigMLer

BigMLer, our popular command line tool for machine learning, now features powerful new evaluation-guided techniques to support advanced predictive modeling. Specifically, through a new subcommand bigmler analyze you can quickly perform smart feature selection and node threshold selection. Feature selection detects the subset of features that will produce better models according to their evaluation measures (be it accuracy, phi or the one you like best). Node threshold selection finds out the number of nodes that your model should grow to in order to optimize evaluations. We’ll give you more details in an upcoming blog post.

And More..

There’s also a bunch of tiny small things like: directly log-in with Github, processing firebase URLs, and of course tons of improvements in our API, backend and infrastructure.

The Spring Release features are available immediately—simply log into your account and get started today! And be sure to let us know your feedback—we love hearing from our users and want to make sure that we continue to deliver the best machine learning platform possible.

Predicting Stock Swings with PsychSignal, Quandl and BigML

People like to tweet about stocks, so much so that ticker symbols get their own special dollar sign like $AAPL or $FB.  What if you could mine this data for insight into public sentiment about these stocks?  Even better, what if you could use this data to predict activity in the stock market? That’s the premise behind PsychSignal, a provider of “real time financial sentiment”. They harvest large streams of data from Twitter and other sources, then compute real time sentiment scores (one “bullish” and one “bearish”) on a scale from 0 to 4.  In a blog post titled Can the Bloomberg Terminal be “Toppled”?, a former Managing Director of Bloomberg Ventures asks an intriguing question: Could this kind of crowdsourced data be used to replace some of the functionality of a Bloomberg terminal?

So just for fun, we combined Quandl price and volume data with PsychSignal sentiment scores for 20 technology stocks. We then trained a simple model to predict whether the percentage “swing” (intraday high minus low) is higher or lower than the median, using only data available before the commencement of the trading day. Looking at the SunBurst view, we see a lot of bright green, which means the model is picking up some interesting correlations. For example, if the previous day’s close is down more than 3% from the previous day’s open, the opening price is less than the previous day’s close, and the previous day’s bearish signal is more than 0.84, then the model strongly predicts a price swing higher than the median (shown as a category named “2nd” in the screenshot below).

BigML Stock Swing Model

Evaluating the model on a single holdout set shows that it does much better than random guessing (see below). To be extra thorough, I used the BigML API to run 5-fold cross validation (not shown), confirming that average accuracy really is more than 64%.

Evaluation of BigML Stock Swing Model

Interestingly, if I try to predict whether a stock simply went up or down, the model is barely better than flipping a coin; for whatever reason, it’s easier to predict a tech stock’s intraday volatility than its daily gain or loss. Still, the accuracy of the “swing” model is impressive—just look at all that green in the SunBurst view. And you can bet your greek symbols that options traders are interested in predicting volatility.

Company founder James Crane-Baker puts this all in perspective: “Social media is such a rich vein of data about investor sentiment, it would be surprising if it didn’t contain useful information.” And this data is available in real time, so you could try building a model to make predictions using same-day data. Then maybe you’ll really start seeing some green!

Using BigML and Tableau to Visualize … Graffiti?

We’ve all heard about case interview questions where the victim (sorry, applicant) is asked to explain why manhole covers are round, or estimate how many piano tuning experts there are in the world. Well, here’s a new one for you: How many incidents of graffiti have been logged in San Francisco since July 2008? The answer: more than 162,000. That’s almost 80 incidents per day for more than 5 years, and those are just the ones that got called in to 311; presumably a great deal of street art went unappreciated.

Thankfully, the City has painstakingly recorded a wealth of detail for each graffito, including whether it is “offensive” or “not offensive”. Interestingly, about 61% of incidents are deemed “not offensive”, although this is the home of the Folsom Street Fair (NSFW) so we assume folks grade on a curve.

As a bonus, the data is already geocoded, so we can easily train a model to find parts of the city likely to have (un)offensive graffiti. The result is here, with 56% of offensive graffiti contained in only 27% of the dataset, a rectangle that stretches from the south end of Market Street to Ocean Beach (to the west) and the Marina (to the north). Within this rectangle, public property is defaced by offensive orange dots while private property is, er, enhanced with relatively unoffensive blue dots. (Click the map to see the full Tableau goodness.)

Graffiti in San Francisco

I’m still pondering why this particular rectangle, and specifically the public property within this rectangle, is such a magnet for offensive graffiti. Is public property just more vulnerable to the orange stuff? But then why didn’t the model find a heavy mix of orange around Van Ness and Civic Center (never mind the Tenderloin)? Is this rectangle more sparsely populated than downtown, making people more likely to indulge in the extra-bad behavior of writing not just graffiti, but offensive graffiti? I don’t like either of these explanations; perhaps San Francisco 311 has some insight.

What I do know is that the SunBurst is full of green, which always makes me happy, and evaluation on a single holdout set shows more than 80% accuracy, which is much better than guessing. The confusion matrix tells us that the model’s main weakness is false negatives, i.e. offensive graffiti mislabeled as not offensive.

Screen Shot 2014-04-18 at 8.28.13 PM

This is another great example of combining BigML and Tableau to create a compelling visualization: BigML finds a geographic pattern simply by analyzing latitude and longitude as numbers, and Tableau displays this pattern using its (really cool) built-in maps. And in typical Tableau fashion, the finding just leaps off the page.

Three Workshops on Predictive Applications

This is a guest post by Louis Dorard, author of Bootstrapping Machine Learning

When it comes to learn about Machine Learning APIs from experts, webinars are a great place to start. However, there’s nothing like in-person workshops. Today, BigML is announcing three exceptional workshops on Predictive Applications taking place in Spain in May. I will have the honour to speak at these workshops alongside Francisco and jao, who are the CEO and CTO of BigML. Here are the three dates:

  • Bellaterra (Barcelona): May 7 at 12pm, IIIA (Artificial Intelligence Research Institute) at the Campus of the Universitat Autonòma de Barcelona.
  • Madrid: May 8 at 5pm, Wayra/Telefonica.
  • Valencia: May 13 at 4pm, Universitat Politècnica de València.

Workshop on Predictive Applications Workshop on Predictive Applications Workshop on Predictive Applications








If you’re anywhere near these cities at that time, you should definitely come! Click the links to check out the posters above and get your invite!

Some more details

The aim of these workshops is to give you a quick and practical view of why and how to embrace predictive applications.

I will start with an introduction to Machine Learning, its possibilities, its limitations, and I will provide some advice on how to apply Machine Learning to your domain. I have the privilege of being the one to introduce BigML, but I will also review some other Prediction APIs. Then, Francisco will provide an introduction to BigML’s REST API. Finally, jao’s part will be the most hands-on as he will show step-by-step how to build a predictive application in 30 minutes.

We’re also thrilled to be joined in Madrid by Enrique Dans, who authors one of the most read technology blogs in Spanish and who also blogs in English, and by Richard Benjamins, who is Group Director BI & Big Data Internal Exploitation at Telefonica.  In Valencia, we’ll be hosted by Professor Vicent Botti who leads the Artificial Intelligence Group at the Universitat Politècnica de València.

I am happy to announce to the readers of the BigML blog that these workshops will coincide with the launch of my new book on Prediction APIs, on May 7. I will cover some of its content in my talks, but not only! You can already get a flavour of what’s coming by downloading the first chapters and the table of contents for free.

Don’t forget to send an email to to request an invite and don’t hesitate to say hi if you’re coming! I can be found on Twitter: @louisdorard. See you soon in Spain!

BigML API + Matlab = Beautiful, Interactive Trees.

Whatever your preferred working environment, we want to make BigML available to you. That’s why we’re happy to announce a new set of BigML API bindings for MATLAB. These bindings, in the form of a MATLAB class, expose all the functions of BigML’s RESTful API.

BigML-logo Matlab_Logo


Some of you may be aware that MATLAB includes a classification and regression tree implementation.  So what makes BigML different? Here are some points that stand out in our mind:

  • MATLAB presents its trees as static line drawings, as seen in the screenshot below of a classification tree trained on the iris dataset. In contrast, BigML’s models are fully interactive. You can view a model built with BigML here. We think that our model interface makes it easy to follow decision paths and tease out data patterns at a glance.
  • Users of MATLAB will admit that it’s not the most efficient computational platform, and that heavyweight scripts will quickly consume all your workstation’s resources. BigML on the other hand, is a cloud-based service. You can train your models in the background, leaving your machine free to perform other number crunching tasks.
  • One last important difference is that in order to use MATLAB’s trees, your MATLAB installation needs to include the statistics toolbox. Our API bindings will work with just a base MATLAB install.



Keep your eyes on our blog to see an application to showcase MATLAB and BigML working together. In the meantime, grab the API bindings here, and start playing. We can’t wait to see what neat things you can accomplish!

Automatic Weighting of Imbalanced Datasets

Very often datasets are imbalanced. That is, the number of instances for each of the classes in the target variable that you want to predict is not proportional to the real importance of each class in your problem. Usually, the class of interest is not the majority class. Imagine a dataset containing clickstream data that you want to use to create a predictive advertising application. The number of instances of users that did not click on an ad would probably be much higher than the number of click-through instances. So when you build a statistical machine-learning model of an imbalanced dataset, the majority (i.e., most prevalent) class will outweigh the minority classes. These datasets usually lead you to build predictive models with suboptimal classification performance. This problem is known as the class-imbalance problem and occurs in a multitude of domains (fraud prevention, intrusion detection, churn prediction, etc). In this post, we’ll see how you can deal with imbalanced datasets configuring your models or ensembles to use weights via BigML’s web interface. You can read how to create weighted models using BigML’s API here and via BigML’s command line here.

BigML Weighthing



A simple solution to cope with imbalanced datasets is re-sampling. That is, undersampling the majority class or oversampling the minority classes. In BigML, you can easily implement re-sampling by using multi-datasets and sampling for each class differently. However, usually basic undersampling takes away instances that might turn to be informative and basic oversampling does not add any extra information to your model.

Another way to not dismiss any information and actually work closer to the root of the problem is to use weights. That is, weighing instances accordingly to the importance that they have in your problem. This enables things like telecom customer churn models where each customer is weighted according to their Lifetime Value. Let’s next see what the impact of weighting a model might be and also examine the options to use weights in BigML.

The Impact of Weighting

Let me illustrate the impact of weighting on model creation by means of two sunburst visualizations for models of the Forest Covertype dataset. This dataset has 581,012 instances that belong to 7 different classes distributed as follows:

  • Lodgepole Pine:  283,301 instances
  • Spruce-Fur: 211,840 instances
  • Ponderosa Pine: 35,754 instances
  • Krummholz: 20,510 instances
  • Douglas-fir: 17,367 instances
  • Aspen: 9,493 instances
  • Cottonwood-Willow: 2,747 instances

The first sunburst below corresponds to a single (512-node) model colored by prediction that I created without using any weighting. The second one corresponds to a single (512-node) weighted model created using BigML’s new balance objective option (more on it below). In both sunbursts, red corresponds to the Cottonwood-Willow class and light green to the Aspen class. In the first sunburst, you can see that the model hardly ever predicts those classes. However, in the sunburst of the weighted model you can see that there are many more red and light green nodes that will predict those classes.

Screen Shot 2014-02-26 at 3.09.20 PM

Screen Shot 2014-02-26 at 3.17.42 PM

So as you can see, weighting helped make us aware of outcomes of predicted classes that are under-represented in the input data that otherwise would be shadowed by over-represented values.

Weighting Models in BigML

BigML gives you three ways to apply weights to your dataset:

  1. Using one of the fields in your dataset as a weight field;
  2. Specifying a weight for each class in the objective field; or
  3. Automatically balancing all the classes in the objective field.



Weight Field

Using this option, BigML will use one of the fields in your dataset as a weight for each instance. If your dataset does not have an explicit weight, you can add one using BigML new dataset transformations.  Any numeric field with no negative or missing values is valid as a weight field. Each instance will be weighted individually according to the value of its weight field. This method is valid for both classification and regression models.

Weighing by LTV

Objective Weight

This method for adding weights only applies to classification models. A set of objective weights may be defined, one weight per objective class. Each instance will be weighted according to its class weight. If a class is not listed in the objective weights, it is assumed to have a weight of 1.  Weights of value zero are valid as long as there are some other positive valued weights. If every weight does end up being zero (this can happen, for instance, if sampling the dataset produces only instances of classes with zero weight) then the resulting model will have a single node with a nil output.

Weight using Objective Weights

Balance Objective

The third method is a convenience shortcut for specifying weights for a classification objective which are inversely proportional to their category counts. This gives you an easy way to make sure that all the classes in your dataset are evenly represented.

Balance Objective


Finally, bear in mind that when you use weights to create a model its performance can be significantly impacted. In classification models, you’re actually trading off precision and recall of the classes involved. So it’s very important to pay attention not only to the plain performance measures returned by evaluations but also to the corresponding misclassification costs. That is, if you know the cost of  a false positive and the cost of a false negative in your problem, then you will want to weigh each class to minimize the overall misclassification costs when you build your model.

BigML + Tableau = Powerful Predictive Visualizations for Everyone

BigML is very excited to be collaborating with Tableau Software as a newly minted Tableau Technology Partner.  This is a natural fit as both tableau_partner_logocompanies share similar philosophies on making analytics accessible (and enjoyable) to more people through incorporation of intuitive and easy-to-use tools. Not surprisingly, there’s a lot of overlap between our users: our latest survey showed that over 50% of BigML’s users also use Tableau. And of course, we’re eager to introduce Tableau’s customers to the predictive power of BigML.

So what does this mean for BigML and Tableau users?  First and foremost, it means that BigML will be working on ways to enable easier usage across both tools. For starters, we have just launched a new feature that lets you export a BigML model directly to Tableau as a calculated field, as is demonstrated in the video below:

This approach unleashes the power of Tableau for prediction, letting users interact with a BigML model just like any other Tableau field.  In the video, for example, we color a bar chart by predicted profit, which yields valuable insights into Tableau’s “superstore” retail dataset.

Moving forward, we’ll be looking at other ways that we combine the powerful analytic and visual capabilities of Tableau and BigML to enable joint customers to do amazing things.

We’d love to hear from any BigML users who are also Tableau customers so we can get added direction on this collaboration and provide you with early access to future implementations. If you’re willing and interested, please email us at!

One Million Predictive Models Created and Counting!

Just over a year ago BigML hit a milestone when the 10,000th predictive model was created using our platform. At the time, we were very happy to see that our idea of building a service that helped people easily build predictive applications had started to get some traction.  One year and several thousand users later we are super excited that BigML has now supported the creation of over 1,000,000 predictive models—400,000 of which were created in just the last two months!  We are extremely happy to see a growing community of fellow data practitioners and developers using BigML across a number of domains and use cases: predictive advertising, predictive lead scoring, preventive customer churn, security, fraud detection, etc.

And we’re just getting started!  We have a roadmap of exciting features that will be coming out just around the corner… Stay tuned!


Smart Feature Selection with scikit-learn and BigML’s API

When trying to make data-driven decisions, we’re often faced with datasets that contain many more features than what we actually need for decision-making. As a consequence, building models becomes more computationally demanding, and even worse, model performance can suffer, as heavily parameterized models can lead to overfitting. How then, can we pare down our data to the most relevant set of features? In this post, we will discuss a solution proposed by Stanford scientists Ron Kohavi (now partner-level architect at Microsoft) and George H. John (now CEO at Rocket Fuel) in this article, in fact one of the the top 300 most-cited articles according to CiteSeerX.

The Basic Idea

Let’s consider a simple case where the data are composed of three features. We can denote each possible feature subset as a three digit binary string, where the value of each digit denotes whether the corresponding feature is included in the subset. For example, the string ‘101’ represents the subset which includes the first and third features from the original set of three. Our goal therefore, is to find which subset gives the best model performance. For any set of n features, there exist 2^n different subsets, so clearly an exhaustive search is out of the question for any data containing more than a handful of features.

Feature Subset Selection

The Advanced Idea

Now, let’s say each of these feature subsets is a node in a state space, and we form connections between those nodes that differ by only a single digit, i.e. the addition or deletion of a single feature. Our challenge therefore is to explore this state space in a manner which will quickly arrive at the most desirable feature subset. To accomplish this, we will employ the Best-First Search algorithm. To direct the search algorithm, we need a method of scoring each node in the state space. We will define the score as the mean accuracy after n-fold cross validation of models trained with the feature subset, minus a penalty of 0.1% per feature. This penalty is introduced to mitigate overfitting. As an added bonus, it also favors models which are quicker to build and evaluate. The above figure illustrates how a single node is evaluated using 3-fold cross validation. Lastly, we need to decide where in the state space to begin the search. Most of the branching will occur in the nodes encountered earlier in the search, so we will begin at the subset containing zero features, so that less time is spent building and evaluating models to explore these early nodes.

Testing the Idea

We’ll try this approach using the credit-screening dataset from the UC Irvine Machine Learning Repository, which contains 15 categorical and numerical features. Building a model using all the features, we get an average accuracy of 81.74% from five-fold cross validation. Not bad, but maybe we can do better… BigML’s Python bindings make it simple and straightforward to implement the search algorithm described above. You can find such an implementation using scikit-learn to create kfold cross validation data sources here. After running the feature subset search, we arrive at feature subset containing just 3 features, which achieves an average cross-validation accuracy of 86.37%. Moreover, we arrived at this result after evaluating only 117 different feature subsets, a far cry from 32768 which would be required for an exhaustive search.

Surfing the sea of tags with multi-label Machine Learning and BigMLer

Navigating the web nowadays, we constantly paddle through clouds of tags and other classification or taxonomy labels. We’ve grown so used to it that now we hardly realize their presence, but they are everywhere. They flow around the main stream of contents, pervade posts (such as this one), articles, personal pages, catalogs and all major content repositories. To store this kind of content, one naturally drifts to non-normalized fields or documental databases, where they end up stacked in a multi-occurrence field. From a Machine Learning point of view, they become a multi-labeled feature.

Machine Learning has well-known methods to cope with this kind of multi-labeled features. In this blog we devoted a previous post to talk about multi-labeled categories prediction using BigMLer, the command line utility for BigML. Now we present the next step: using multi-labeled fields’ labels as predictors.

Multi-labels in professional profiles: Recruiting toy example

In the spirit of the aforementioned post, let’s have a look at the typical profile page in your favourite recruiting web site. At first sight, we detect some contents that might as well be stored in multi-labeled fields: the companies we’ve worked for, the associations we belong to, the languages we speak, the positions we’ve assumed and our skills. Should you face the task of predicting which profile would be more suitable for a certain position, you would probably want to use this information. For example, in technical positions people that understand English will probably perfom better in their job. But maybe this is not so true for the chemical or pharmaceutical sector, where German has traditionally been a dominant communication language. Also the number of spoken languages could be a determinant factor. Who knows what other relations are hidden in your multi-labeled features!

OK, so there seems to be a bunch of valuable information stored in multi-labeled fields, but it looks like it must be reshaped to be useful in your machine learning system. Each label per se is a new feature you can use as input for your prediction rules. Even aggregation functions over the labels, like count, first or last, can be useful as new input. Now the good news: BigMLer can do this for you!

Multiple predictor multi-label fields.

Let’s build an example based on our recruiting site scenario. Suppose we make up a sample with some of the features available in users’ profile pages, such as the name, age, gender, marital status, number of certifications, number of recommendations, number of courses, titles, languages and skills. This is an excerpt of the training data:

The last three fields above are filled with colon-separated multiple values. For each person, they contain:

  • Titles: the titles of the posts she has occupied
  • Languages: the languages she speaks
  • Skills: the skills she has

As our goal is to produce a quick and simple proof of concept, the values for the Titles field are restricted to just four categories: Student, Engineer, Manager and CEO. The skills have also been chosen from a list of popular skills. Looking at the data we have, the first question that arises is: could we build a model that predicts if a new profile would fit a certain position? Well, probably some skills are a must in CEO‘s profiles, while other are still missing when you are a Student. Our data has implicitly all of this information, but we would need some work to rearrange it in order to build a predictive model from it. For example, we would need to build a new field for each of the available skills and populate it with a True value if the profile has that skill and False otherwise. What if you could let BigMLer take care of these uninteresting details for you? Guess what: you can!

See the next BigMLer command?

The --multi-label and --multi-label-fields options tell BigMLer that the contents of the fields TitlesLanguages and Skills are a series of labels. Using --label-separator you set colon as the labels’ delimiter and… ta da! BigMLer generates an extended file where each of the fields declared as multi-labeled fields are transformed into a series of binary new fields, one for each different label. The extended file is used to build the corresponding source, dataset and model objects that you should need to make a prediction. The value to predict is the Titles field, that has been targeted as our objective using the --objective option. So, in just one line and with no coding at all, you have been able to add each and every label of your Skills, Languages and Titles as an independent new feature available to be used as predictor or objective in your models.

Still, you might be missing additional features, such as the number of languages or the last occupied position. BigMLer has also added a new --label-aggregate option that can help you with that. You can use count to create a new field holding the number of occurrences of labels in a multi-label field and first or last to generate a new field with the first or last label found in the multi-label fields. In our example, we could use

and some Titles - count, Languages - count, Skills - count, Titles - last, Languages - last, Skills - last new fields will be added to our original source data and used in model building.

Selecting significant input data

We have just built an extended file and generated the BigML resources you need to create a trained model. Nevertheless, we focused mainly in showing the advantages that BigMLer offers to build up new features from multi-labeled fields, disregarding the convenience of including or excluding some of them as inputs for our model. In our example, the ID and Name fields, for example, should be excluded from the model building input fields, as they are external references that have nothing to do with the Titles values (we don’t expect all CEO’s to be named Francisco). Also, once the new label fields are included, we may prefer to exclude the prior multi-labeled fields from the model input fields to ensure that the prediction rules are only based in separate labels. For the Titles objective field, we would also like to ignore the generated aggregation fields to avoid useless prediction rules, like saying that if the last occupied post in a profile is Manager, then the profile is suitable to be a Manager. In addition to that, we exclude the Age, Gender and Marital Status fields because we don’t want these features to influence our analysis. This is how we chose to build our final model:

where the --model-fields option has been set to a comma-separated list of all the fields that you would like to be excluded as input for your model prefixed by a - sign. Then, having a look at the generated models, one for each position, some rules appear. You can see the entire process in the next video:

According to our data, Students have a limited number of skills such as web programming but lack others such as Business Intelligence (that is found in Managers‘ profiles), Algorithm Design (frequently found in Engineers‘ profiles) or Software Engineering Management (appearing in CEOs‘ profiles). Similar patterns can be found in the Engineers‘ model, the Managers‘ model and the CEOs‘ model, so that when we come across new profiles, you could use these models (tagged with the multi-label-recruiting text) to predict the positions they would be suitable to by calling

where new_profiles.csv would contain the information of the new profiles in CSV format. The predictions.csv file generated by BigMLer would store the predicted values for each new profile and their confidences separated by comma, all ready to use.

This is a simplified example of how BigMLer can empower you to easily use multi-labeled fields in your machine learning system. It can split the fields’ contents, generate new binary fields for each existing label and even aggregate the labels information with count, first or last functions. BigMLer‘s functionality keeps growing steadily, and new options like weighted models and threshold limited predictions are ready for you to try–but this will be the subject for another post, so stay tuned!


Get every new post delivered to your Inbox.

Join 132 other followers

%d bloggers like this: