Describe yourself in a word? Never again with Multi-label Classification
Who hasn’t suffered the describe-yourself-in-a-word question in job interviews? That’s a really tough one, because it forces you to choose one amongst your obviously many remarkable qualities. That choice will always leave out too much information that could be very relevant for the job requirements.
In Machine Learning, single-label classifications can sometimes impose this kind of blindness. Expecting the property you want to predict to fall into only one category may be too simplistic. Reality is many-faceted, and what scientists try to model as collections of sharp-edged boxes is usually closer to the combination of colours and shapes you would see in a kaleidoscope. Take for instance the problem of predicting the emotions generated by music, or describing the topic of a bit of text. These rarely have a single categorical answer.
That’s what multi-label classification is all about, and now BigMLer can help you handle it nicely. Maybe in your next interview you can ask BigMLer to learn from the job’s requirements which are your best describing words!
Multi-label classification as a set of binary classification models.
Let’s review how we can solve a problem of multi-label classification. Typically, the available data to train your model with has an objective field, the predicate you’d like to predict, with one or more values associated to each training instance. Similarly, when testing your machine learning system, you expect to obtain a set of categories as prediction for each testing input. The simplest mechanism that can fulfill these needs is using a set of binary classification models to do the job.
Fancy you have data that reads like this:[gist https://gist.github.com/mmerce/7136559 /]
and want to predict the class results for a bunch of test inputs. The steps are simple:
- Preparing the input:
- Analyze the multi-labeled class field and make a set with all the labels found in the training data. That is to say: Adult, Student, Teenager, Worker.
- Create a new extended source adding to the original fields a new one per label in the set. The list of fields would then be: color, year, sex, class, class – Adult, class – Student, class -Teenager, class – Worker.
- Fill the contents of the newly created fields with a binary value (let’s say ‘True’ or ‘False’) indicating the presence or absence of the corresponding label in the objective field of the training instance. The first row of the extended file would then be: red,2000,male,”Student,Teenager”,False,True,True,False.
- Building the classification system:
- Build a single-label classification model per each label. The models are built using the features of the original source as inputs and the label field as new objective field. This will produce a set of models, one per label. Following our example, the first model would use the fields color, year, sex, class – Adult, the second one color, year, sex, class – Student and so on.
- Issuing the results
- Predict from new input data with each one of the models and build the prediction output by combining the labels associated to the models that predicted ‘True’. For instance, if only the first and second model predict ‘True’, our prediction will be Adult,Student.
Now, as you see this is not a specially difficult job, but it certainly can be cumbersome to prepare and execute. Not anymore, BigMLer has a new option waiting for you!
The BigMLer way to MLC
BigMLer keeps expanding its abilities to ease the machine learning users’ task. This time, a new
--multi-label option has been added to BigMLer’s quite long list of features. When this new command option is provided, BigMLer will use multi-label classification to generate the requested models and predictions. Just as you did in a single-label scenario, you just need to provide a CSV-formated source and BigMLer will do the rest.
Suppose you want to predict the class results for a bunch of test inputs based on the data we used previously as example. Then the magic words would be:
bigmler --multi-label --train multi_label.csv \ --test multi_label_test.csv
Remember the three groups of tasks mentioned in last section? Well, issuing this command alone will execute them all, and you’ll be left with predictions and their associated confidence stored sequentially in a nice-looking predictions.csv file. Could it be easier?
But let’s dig deeper into the additional options that BigMLer has to offer for multi-label classifications, minding each phase of the process.
Training data for MLC
As already mentioned, the starting point for BigMLer is a CSV file where instance features are stored row-wise, with a multi-labeled objective field. The multiple labels there are stored sequentially, and some special character is used as separator. BigMLer lets you choose the delimiter character to fit your needs. You just have to add the
--label-separator option to your command
bigmler --multi-label --train multi_label_tab.csv \ --label-separator '\t'
and the contents of the objective field will be parsed using the tab as delimiter character (comma will be used by default). By parsing all the different instances in the file, BigMLer finds the set of labels they contain, so you don’t need to list them. However, you might want to, for example if you want to restrict the number of labels to be considered in your predictions. Again, BigMLer becomes helpful there and you can use a
--label option to set the labels that are used in model building and prediction
bigmler --multi-label --train multi_label.csv \ --labels Student,Adult,Child
as you see, the list of labels is expected to be a comma-separated list of literals. In this example, only three models will be constructed, regardless of the number of different labels the original csv file has in its objective field.
To close the first phase, a new training file will be generated locally in your computer. This time, each row of the file will be expanded adding a new column per label. BigMLer will upload this file to BigML, and generate the corresponding dataset and set of models. These newly generated resources will then be available for you by using their id. We’ll see an example of that in next section
Building MLC models and ensembles
The previous BigMLer commands generate a set of models, one per label. The models can be retrieved to make predictions in the same way the single-label models were. We know from previous posts that each invocation of BigMLer generates a set of files in a separate output directory where the ids of the created resources are stored. Model ids are stored in a models file, and you can ask BigMLer to retrieve them by using the
--models option pointing to that file. Let’s say we generated a multi-label set of models to test some data.
bigmler --multi-label --train multi_label.csv \ --test multi_label_test.csv \ --output my_output_dir/predictions.csv
Then the model ids will be stored in a file placed in my_output_dir/models, and if we want to make new predictions we just have to say
bigmler --multi-label --models my_output_dir/models \ --test new_tests.csv
We can also use the
--tag option in the first command to assign a particular tag to all the generated models. Then we could use
bigmler --multi-label --model-tag my_ml_tag \ --test new_tests.csv
to retrieve and use them in new predictions.
You might want to improve the quality of your predictions by using one ensemble of models to predict each label. This is also possible with BigMLer. For instance,
bigmler --multi-label \ --dataset dataset/52659d36035d0737bd00143f \ --number-of-models 10
will retrieve the existing multi-label dataset that was built previously and use it to build one ensemble of ten models per label. You can customize the model or ensemble parameters as well (please refer to the docs to see the many available options to do so). As in the models’ example, you can use the
--ensemble-tag option to retrieve the set of ensembles and make more predictions with them.
MLC predictions’ file formats
As in the single-label case, BigMLer will run every row in the test file given in the
--test option through the models (or ensembles) generated for each of the labels. If the prediction for a label model is ‘True’, then the label is included in the list of predictions for that input data and its confidence is also added to a list of confidences. Thus, the predictions csv file will store in one row a subset of labels separated by the
--label-separator character, and the corresponding ordered list of confidences. For example, if predictions for a tests input were ‘Adult’ with confidence 0.95 and ‘Child’ with confidence 0.32 the predictions’ file row would read
BigMLer provides additional options to customize this format. First of all, adding
--prediction-header to your command will cause the first row of the predictions file to be a headers row. In addition to that, you can change the contents of the rows by using the
--prediction-info option. When set to
brief, only predictions will be stored,
normal is the default option that produces both predictions and confidences, and
full prepends the input tests data to the predictions. You can also filter the fields of input data that you want to appear before the prediction by setting
--prediction-fields and a comma-separated subset of the fields in the test input file.
We hope that BigMLer and its new set of options will help you embrace easily the multi-label experience, handling for you the tedious mechanical part and letting you enjoy its benefits. So why don’t you give it a try? We’d be glad to know about your use case and suggestions to build a next release of BigMLer features. Meanwhile, happy multi-labeling!