Surfing the sea of tags with multi-label Machine Learning and BigMLer

Posted by

Navigating the web nowadays, we constantly paddle through clouds of tags and other classification or taxonomy labels. We’ve grown so used to it that now we hardly realize their presence, but they are everywhere. They flow around the main stream of contents, pervade posts (such as this one), articles, personal pages, catalogs and all major content repositories. To store this kind of content, one naturally drifts to non-normalized fields or documental databases, where they end up stacked in a multi-occurrence field. From a Machine Learning point of view, they become a multi-labeled feature.

Machine Learning has well-known methods to cope with this kind of multi-labeled features. In this blog we devoted a previous post to talk about multi-labeled categories prediction using BigMLer, the command line utility for BigML. Now we present the next step: using multi-labeled fields’ labels as predictors.

Multi-labels in professional profiles: Recruiting toy example

In the spirit of the aforementioned post, let’s have a look at the typical profile page in your favourite recruiting web site. At first sight, we detect some contents that might as well be stored in multi-labeled fields: the companies we’ve worked for, the associations we belong to, the languages we speak, the positions we’ve assumed and our skills. Should you face the task of predicting which profile would be more suitable for a certain position, you would probably want to use this information. For example, in technical positions people that understand English will probably perfom better in their job. But maybe this is not so true for the chemical or pharmaceutical sector, where German has traditionally been a dominant communication language. Also the number of spoken languages could be a determinant factor. Who knows what other relations are hidden in your multi-labeled features!

OK, so there seems to be a bunch of valuable information stored in multi-labeled fields, but it looks like it must be reshaped to be useful in your machine learning system. Each label per se is a new feature you can use as input for your prediction rules. Even aggregation functions over the labels, like count, first or last, can be useful as new input. Now the good news: BigMLer can do this for you!

Multiple predictor multi-label fields.

Let’s build an example based on our recruiting site scenario. Suppose we make up a sample with some of the features available in users’ profile pages, such as the name, age, gender, marital status, number of certifications, number of recommendations, number of courses, titles, languages and skills. This is an excerpt of the training data:

ID,Name,Age,Gender,Marital Status,Certifications,Recommendations,Courses,Titles,Languages,Skills
1,Fannie Dais,51,Female,Widowed,5,10,3,Student:Manager,French:English,Software Engineering Management:Recruiting:JSON:Perl/Python/Ruby:Oracle:Database Management and Software:Business Development/Relationship Management
2,Isaias Stoodley,47,Male,Divorced,5,10,6,Manager:CEO,English:German:Italian,MongoDB:Recruiting:Software Engineering Management:Business Intelligence:Linux:Oracle
3,Mari Gramling,19,Male,Married,0,0,0,Student,French,MongoDB:JSON:Web programming
4,Barrie Murakami,45,Male,Divorced,1,5,3,Engineer,German:English,Windows:MongoDB:Algorithm Design:MySQL:Linux

The last three fields above are filled with colon-separated multiple values. For each person, they contain:

  • Titles: the titles of the posts she has occupied
  • Languages: the languages she speaks
  • Skills: the skills she has

As our goal is to produce a quick and simple proof of concept, the values for the Titles field are restricted to just four categories: Student, Engineer, Manager and CEO. The skills have also been chosen from a list of popular skills. Looking at the data we have, the first question that arises is: could we build a model that predicts if a new profile would fit a certain position? Well, probably some skills are a must in CEO‘s profiles, while other are still missing when you are a Student. Our data has implicitly all of this information, but we would need some work to rearrange it in order to build a predictive model from it. For example, we would need to build a new field for each of the available skills and populate it with a True value if the profile has that skill and False otherwise. What if you could let BigMLer take care of these uninteresting details for you? Guess what: you can!

See the next BigMLer command?

bigmler --multi-label --name Recruiting_Multilabel \
--train multi_label_recruiting.csv \
--multi-label-fields Titles,Languages,Skills \
--objective Titles --label-separator :
view raw hosted with ❤ by GitHub

The --multi-label and --multi-label-fields options tell BigMLer that the contents of the fields TitlesLanguages and Skills are a series of labels. Using --label-separator you set colon as the labels’ delimiter and… ta da! BigMLer generates an extended file where each of the fields declared as multi-labeled fields are transformed into a series of binary new fields, one for each different label. The extended file is used to build the corresponding source, dataset and model objects that you should need to make a prediction. The value to predict is the Titles field, that has been targeted as our objective using the --objective option. So, in just one line and with no coding at all, you have been able to add each and every label of your Skills, Languages and Titles as an independent new feature available to be used as predictor or objective in your models.

Still, you might be missing additional features, such as the number of languages or the last occupied position. BigMLer has also added a new --label-aggregate option that can help you with that. You can use count to create a new field holding the number of occurrences of labels in a multi-label field and first or last to generate a new field with the first or last label found in the multi-label fields. In our example, we could use

bigmler --multi-label --name Recruiting_Multilabel \
--train multi_label_recruiting.csv \
--multi-label-fields Titles,Languages,Skills \
--objective Titles --label-separator : \
--label-aggregate count,last
view raw hosted with ❤ by GitHub

and some Titles - count, Languages - count, Skills - count, Titles - last, Languages - last, Skills - last new fields will be added to our original source data and used in model building.

Selecting significant input data

We have just built an extended file and generated the BigML resources you need to create a trained model. Nevertheless, we focused mainly in showing the advantages that BigMLer offers to build up new features from multi-labeled fields, disregarding the convenience of including or excluding some of them as inputs for our model. In our example, the ID and Name fields, for example, should be excluded from the model building input fields, as they are external references that have nothing to do with the Titles values (we don’t expect all CEO’s to be named Francisco). Also, once the new label fields are included, we may prefer to exclude the prior multi-labeled fields from the model input fields to ensure that the prediction rules are only based in separate labels. For the Titles objective field, we would also like to ignore the generated aggregation fields to avoid useless prediction rules, like saying that if the last occupied post in a profile is Manager, then the profile is suitable to be a Manager. In addition to that, we exclude the Age, Gender and Marital Status fields because we don’t want these features to influence our analysis. This is how we chose to build our final model:

bigmler --multi-label --name Recruiting_Multilabel \
--train multi_label_recruiting.csv \
--multi-label-fields Titles,Languages,Skills \
--objective Titles \
--label-separator : \
--model-fields='-Name,-ID,-Age,-Gender,-Marital Status,-Languages,-Skills,-Titles - last,-Titles - count' \
--label-aggregate count,last \
--tag multi-label-recruiting
view raw hosted with ❤ by GitHub

where the --model-fields option has been set to a comma-separated list of all the fields that you would like to be excluded as input for your model prefixed by a - sign. Then, having a look at the generated models, one for each position, some rules appear. You can see the entire process in the next video:

According to our data, Students have a limited number of skills such as web programming but lack others such as Business Intelligence (that is found in Managers‘ profiles), Algorithm Design (frequently found in Engineers‘ profiles) or Software Engineering Management (appearing in CEOs‘ profiles). Similar patterns can be found in the Engineers‘ model, the Managers‘ model and the CEOs‘ model, so that when we come across new profiles, you could use these models (tagged with the multi-label-recruiting text) to predict the positions they would be suitable to by calling

bigmler --model-tag multi-label-recruiting \
--test new_profiles.csv --method combined
view raw hosted with ❤ by GitHub

where new_profiles.csv would contain the information of the new profiles in CSV format. The predictions.csv file generated by BigMLer would store the predicted values for each new profile and their confidences separated by comma, all ready to use.

This is a simplified example of how BigMLer can empower you to easily use multi-labeled fields in your machine learning system. It can split the fields’ contents, generate new binary fields for each existing label and even aggregate the labels information with count, first or last functions. BigMLer‘s functionality keeps growing steadily, and new options like weighted models and threshold limited predictions are ready for you to try–but this will be the subject for another post, so stay tuned!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s