Who hasn’t suffered the describe-yourself-in-a-word question in job interviews? That’s a really tough one, because it forces you to choose one amongst your obviously many remarkable qualities. That choice will always leave out too much information that could be very relevant for the job requirements.
In Machine Learning, single-label classifications can sometimes impose this kind of blindness. Expecting the property you want to predict to fall into only one category may be too simplistic. Reality is many-faceted, and what scientists try to model as collections of sharp-edged boxes is usually closer to the combination of colours and shapes you would see in a kaleidoscope. Take for instance the problem of predicting the emotions generated by music, or describing the topic of a bit of text. These rarely have a single categorical answer.
That’s what multi-label classification is all about, and now BigMLer can help you handle it nicely. Maybe in your next interview you can ask BigMLer to learn from the job’s requirements which are your best describing words!
Multi-label classification as a set of binary classification models.
Let’s review how we can solve a problem of multi-label classification. Typically, the available data to train your model with has an objective field, the predicate you’d like to predict, with one or more values associated to each training instance. Similarly, when testing your machine learning system, you expect to obtain a set of categories as prediction for each testing input. The simplest mechanism that can fulfill these needs is using a set of binary classification models to do the job.
Fancy you have data that reads like this:
and want to predict the class results for a bunch of test inputs. The steps are simple:
- Preparing the input:
- Analyze the multi-labeled class field and make a set with all the labels found in the training data. That is to say: Adult, Student, Teenager, Worker.
- Create a new extended source adding to the original fields a new one per label in the set. The list of fields would then be: color, year, sex, class, class – Adult, class – Student, class -Teenager, class – Worker.
- Fill the contents of the newly created fields with a binary value (let’s say ‘True’ or ‘False’) indicating the presence or absence of the corresponding label in the objective field of the training instance. The first row of the extended file would then be: red,2000,male,”Student,Teenager”,False,True,True,False.
- Building the classification system:
- Build a single-label classification model per each label. The models are built using the features of the original source as inputs and the label field as new objective field. This will produce a set of models, one per label. Following our example, the first model would use the fields color, year, sex, class – Adult, the second one color, year, sex, class – Student and so on.
- Issuing the results
- Predict from new input data with each one of the models and build the prediction output by combining the labels associated to the models that predicted ‘True’. For instance, if only the first and second model predict ‘True’, our prediction will be Adult,Student.
Now, as you see this is not a specially difficult job, but it certainly can be cumbersome to prepare and execute. Not anymore, BigMLer has a new option waiting for you!
The BigMLer way to MLC
BigMLer keeps expanding its abilities to ease the machine learning users’ task. This time, a new
--multi-label option has been added to BigMLer’s quite long list of features. When this new command option is provided, BigMLer will use multi-label classification to generate the requested models and predictions. Just as you did in a single-label scenario, you just need to provide a CSV-formated source and BigMLer will do the rest.
Suppose you want to predict the class results for a bunch of test inputs based on the data we used previously as example. Then the magic words would be:
bigmler --multi-label --train multi_label.csv \ --test multi_label_test.csv
Remember the three groups of tasks mentioned in last section? Well, issuing this command alone will execute them all, and you’ll be left with predictions and their associated confidence stored sequentially in a nice-looking predictions.csv file. Could it be easier?
But let’s dig deeper into the additional options that BigMLer has to offer for multi-label classifications, minding each phase of the process.
Training data for MLC
As already mentioned, the starting point for BigMLer is a CSV file where instance features are stored row-wise, with a multi-labeled objective field. The multiple labels there are stored sequentially, and some special character is used as separator. BigMLer lets you choose the delimiter character to fit your needs. You just have to add the
--label-separator option to your command
bigmler --multi-label --train multi_label_tab.csv \ --label-separator '\t'
and the contents of the objective field will be parsed using the tab as delimiter character (comma will be used by default). By parsing all the different instances in the file, BigMLer finds the set of labels they contain, so you don’t need to list them. However, you might want to, for example if you want to restrict the number of labels to be considered in your predictions. Again, BigMLer becomes helpful there and you can use a
--label option to set the labels that are used in model building and prediction
bigmler --multi-label --train multi_label.csv \ --labels Student,Adult,Child
as you see, the list of labels is expected to be a comma-separated list of literals. In this example, only three models will be constructed, regardless of the number of different labels the original csv file has in its objective field.
To close the first phase, a new training file will be generated locally in your computer. This time, each row of the file will be expanded adding a new column per label. BigMLer will upload this file to BigML, and generate the corresponding dataset and set of models. These newly generated resources will then be available for you by using their id. We’ll see an example of that in next section
Building MLC models and ensembles
The previous BigMLer commands generate a set of models, one per label. The models can be retrieved to make predictions in the same way the single-label models were. We know from previous posts that each invocation of BigMLer generates a set of files in a separate output directory where the ids of the created resources are stored. Model ids are stored in a models file, and you can ask BigMLer to retrieve them by using the
--models option pointing to that file. Let’s say we generated a multi-label set of models to test some data.
bigmler --multi-label --train multi_label.csv \ --test multi_label_test.csv \ --output my_output_dir/predictions.csv
Then the model ids will be stored in a file placed in my_output_dir/models, and if we want to make new predictions we just have to say
bigmler --multi-label --models my_output_dir/models \ --test new_tests.csv
We can also use the
--tag option in the first command to assign a particular tag to all the generated models. Then we could use
bigmler --multi-label --model-tag my_ml_tag \ --test new_tests.csv
to retrieve and use them in new predictions.
You might want to improve the quality of your predictions by using one ensemble of models to predict each label. This is also possible with BigMLer. For instance,
bigmler --multi-label \ --dataset dataset/52659d36035d0737bd00143f \ --number-of-models 10
will retrieve the existing multi-label dataset that was built previously and use it to build one ensemble of ten models per label. You can customize the model or ensemble parameters as well (please refer to the docs to see the many available options to do so). As in the models’ example, you can use the
--ensemble-tag option to retrieve the set of ensembles and make more predictions with them.
MLC predictions’ file formats
As in the single-label case, BigMLer will run every row in the test file given in the
--test option through the models (or ensembles) generated for each of the labels. If the prediction for a label model is ‘True’, then the label is included in the list of predictions for that input data and its confidence is also added to a list of confidences. Thus, the predictions csv file will store in one row a subset of labels separated by the
--label-separator character, and the corresponding ordered list of confidences. For example, if predictions for a tests input were ‘Adult’ with confidence 0.95 and ‘Child’ with confidence 0.32 the predictions’ file row would read
BigMLer provides additional options to customize this format. First of all, adding
--prediction-header to your command will cause the first row of the predictions file to be a headers row. In addition to that, you can change the contents of the rows by using the
--prediction-info option. When set to
brief, only predictions will be stored,
normal is the default option that produces both predictions and confidences, and
full prepends the input tests data to the predictions. You can also filter the fields of input data that you want to appear before the prediction by setting
--prediction-fields and a comma-separated subset of the fields in the test input file.
We hope that BigMLer and its new set of options will help you embrace easily the multi-label experience, handling for you the tedious mechanical part and letting you enjoy its benefits. So why don’t you give it a try? We’d be glad to know about your use case and suggestions to build a next release of BigMLer features. Meanwhile, happy multi-labeling!
Most Machine Learning algorithms require data to be into a single text file in tabular format, with each row representing a full instance of the input dataset and each column one of its features. For example, imagine data in normal form separated in a table for users, another for movies, and another for ratings. You can get it in machine-learning-ready format in this way (i.e., joining by userid and movieid and removing ids and names):
"userid","name","age","gender" 1,John Smith,54,male 2,Carol Brew,29,female "movieid","title","year","genre","director" 1,The shinning,1980,psychological horror,Stanley Kubrick 2,Terminator,1984,science fiction,James Cameron "userid","movieid", "visited","rented","purchased","rating" 1,1,Yes,No,Yes,4 1,2,Yes,Yes,No,5 2,1,Yes,No,No,0 2,2,Yes,Yes,Yes,5 Machine Learning Ready: "age","gender","year","genre","director","visited","rented","purchased","rating" 54,male,1980,psychological horror,Stanley Kubrick,Yes,No,Yes,4 54,male,1984,science fiction,James Cameron,Yes,Yes,No,5 29,female,1980,psychological horror,Stanley Kubrick,Yes,No,No,0 29,female,1984,science fiction,James Cameron,Yes,Yes,Yes,5
Denormalizing (or “normalizing” data for Machine Learning) is a more or less complex task depending on where the data is stored and where it is obtained from. Often the data you own or have access to is not available in a single file—may be distributed across different sources like multiple CSV files, spreadsheets or plain text files, or normalized in database tables. So you need a tool to collect, intersect, filter, transform when necessary, and finally export to a single flat, text CSV file.
If your data is small and the changes are simple such as adding a derived field or making a few substitutions you can use a spreadsheet, make the necessary changes, and then export it to a CSV file. But when the changes are more complex; e.g., joining several sources, filtering a subset of the data, or managing a large amount of rows, you might need a more powerful tool like an RDBMS. MySQL is a great one—it’s free. If the data size that you are managing is in the terabytes, then (and only then) you should consider Hadoop.
Business inspections in San Francisco
Let’s take a look at an actual example. The San Francisco’s Department of Public Health recently published a dataset about restaurants in San Francisco, inspections conducted, violations observed, and a score calculated by a health inspector based on the violations observed.
You can download the data directly from the San Francisco open data website. Recently some statistics using this data were reported in this post—they may be difficult to stomach. Imagine, however, that you want to use the data to predict what violations certain kind of restaurants commit—or, if you’re a restaurant owner, to predict whether you are going to be inspected. As the data comes “normalized” in four separated files:
- businesses.csv: a list of restaurants or businesses in the city.
- inspections.csv: inspections in some of previous businesses.
- violations.csv: detected law violations in some of previous inspections.
- ScoreLegend.csv: a legend to describe score ranges.
You will first need to prepare it to be used as input to a Machine Learning service such as BigML.
Analyzing the data
Let’s first have a quick look at the main entities in the data of each file and its relationships. The four files are in CSV format with the following fields:
$ head -3 businesses.csv "business_id","name","address","city","state","postal_code","latitude","longitude","phone_number" 10,"TIRAMISU KITCHEN","033 BELDEN PL","San Francisco","CA","94104","37.791116","-122.403816","" 12,"KIKKA","250 EMBARCADERO 7/F","San Francisco","CA","94105","37.788613","-122.393894","" $ head -3 inspections.csv "business_id","Score","date","type" 10,"98","20121114","routine" 10,"98","20120403","routine" $ head -3 violations.csv "business_id","date","description" 10,"20121114","Unclean or degraded floors walls or ceilings [ date violation corrected: ]" 10,"20120403","Unclean or degraded floors walls or ceilings [ date violation corrected: 9/20/2012 ]" $ head -3 ScoreLegend.csv "Minimum_Score","Maximum_Score","Description" 0,70,"Poor" 71,85,"Needs Improvement"
There are three main entities: businesses, inspections and violations. The relationships between entities are: a 0..N relationship between businesses and inspections and an 0..N relationship between inspections and a violations. There’s also a file with a description for each range in the score
To build a machine-learning-ready CSV file containing instances about businesses, their inspections and their respective violations, we’ll follow three basic steps: 1) importing data into MySQL, 2) transforming data using MySQL, and 3) joining and exporting data to a CSV file.
Importing data into MySQL
First, we’ll need to create a new SQL table with the corresponding fields to import the data for each entity above. Instead of defining a type for each field that we import (dates, strings, integers, floats, etc), we simplify the process by using varchar fields. In this way, we just need to be concerned with the number of fields and their length for each entity. We also created a new table to import the legends for each score range.
create table business_imp (business_id int, name varchar(1000), address varchar(1000), city varchar(1000), state varchar(100), postal_code varchar(100), latitude varchar(100), longitude varchar(100), phone_number varchar(100)); create table inspections_imp (business_id int, score varchar(10), idate varchar(8), itype varchar(100)); create table violations_imp (business_id int, vdate varchar(8), description varchar(1000)); create table scorelegends_imp (Minimum_Score int, Maximum_Score int, Description varchar(100));
So now we are ready to import the raw data into each of the new tables. We use the load data infile command to define the format of the source file, the separator, whether a header is present, and the table in which it will be loaded.
load data local infile '~/SF_Restaurants/businesses.csv' into table business_imp fields terminated by ',' enclosed by '"' lines terminated by '\n' ignore 1 lines (business_id,name,address,city,state,postal_code,latitude,longitude,phone_number); load data local infile '~/SF_Restaurants/inspections.csv' into table inspections_imp fields terminated by ',' enclosed by '"' lines terminated by '\n' ignore 1 lines (business_id,score,idate,itype); load data local infile '~/SF_Restaurants/violations.csv' into table violations_imp fields terminated by ',' enclosed by '"' lines terminated by '\n' ignore 1 lines (business_id,vdate,description); load data local infile '~/SF_Restaurants/ScoreLegend.csv' into table scorelegends_imp fields terminated by ',' enclosed by '"' lines terminated by '\n' ignore 1 lines (Minimum_Score,Maximum_Score,Description);
If the dataset is big (i.e., several thousands rows or more), then it’s important to create indexes as follows:
create index inx_inspections_businessid on inspections_imp (business_id); create index inx_violations_businessidvdate on violations_imp (business_id, vdate);
Transforming data using MySQL
More often than not, raw data needs to be transformed. For example, numeric codes need to be converted into descriptive labels, different fields need to be joined, some fields might need different format. Also very often missing values, bad formatted data, or wrongly inputted data need to be fixed, and some other times you might need to create and fill new derived fields.
In our example, we’re going to remove the “[ date violation corrected: ...]” substring from the violation’s description field:
update violations_imp set description = substr(description,1,instr(description,' [ date violation corrected:')-1) where instr(description,' [ date violation corrected:') > 0;
We are also going to fix some missing data:
update business_imp set city = 'San Francisco' where city = 'San Francicso'; update violations_imp set vdate = '' where vdate = 'N/A';
Finally, we are going to add a new derived field “inspection” and fill it with Yes/No values:
alter table business_imp add column inspection varchar(3) after phone_number; update business_imp set inspection = 'Yes' where business_id in (select business_id from inspections_imp); update business_imp set inspection = 'No' where business_id not in (select business_id from inspections_imp);
Joining and Exporting MySQL tables to a CSV file
Once the data has been sanitized and reformed according to our needs, we are ready to generate a CSV file. Before creating it with the denormailized data, we need to make sure that we join the different tables in the right way. Looking at the restaurant data, we can see that some businesses have inspections and others not. For those with inspections, not all of them have violations. Therefore, the query that we should use is a “left join” to collect all businesses, inspections and violations. Additional transformations, like reformat or concat fields, can be done in this step too. We also need to make sure that with export the data with a descriptive header. The next query will make the export trick:
select "Business name", "Address", "City", "State", "Postal code", "Latitude", "Longitude", "Phone number", "Inspection", "Inspection score", "Score type", "Inspection date", "Inspection type", "Violation description" union all select a.name, a.address, a.city, a.state, a.postal_code, a.latitude, a.longitude, a.phone_number, a.inspection, b.score, d.description as scoredesc, concat(substr(b.idate,1,4),'-',substr(b.idate,5,2),'-',substr(b.idate,7,2)), b.itype, c.description into outfile 'sf_restaurants.csv' fields terminated by ',' optionally enclosed by '"' escaped by '\\' lines terminated by '\n' from business_imp a left join inspections_imp b on (a.business_id = b.business_id) left join violations_imp c on (b.business_id=c.business_id and b.idate = c.vdate) left join scorelegends_imp d on (cast(b.score as unsigned) between d.Minimum_Score and d.Maximum_Score);
A file named sf_restaurants.csv will be generated with a row per instance in this format:
"Business name","Address","City","State","Postal code","Latitude","Longitude","Phone number","Inspection","Inspection score","Score type","Inspection date","Inspection type","Violation description" "DINO'S PIZZA DELI","2101 FILLMORE ST ","San Francisco","CA","94115","37.788932","-122.433895","","Yes","84","Needs Improvement","2011-05-16","routine","Improper storage of equipment utensils or linens" "DINO'S PIZZA DELI","2101 FILLMORE ST ","San Francisco","CA","94115","37.788932","-122.433895","","Yes","84","Needs Improvement","2011-05-16","routine","Inadequately cleaned or sanitized food contact surfaces" "CHEZ MAMAN","1453 18TH ST ","San Francisco","CA","94107","37.762513","-122.397169","+14155378680","Yes","81","Needs Improvement","2012-05-24","routine","Improper thawing methods" "CHEZ MAMAN","1453 18TH ST ","San Francisco","CA","94107","37.762513","-122.397169","+14155378680","Yes","81","Needs Improvement","2012-05-24","routine","Inadequate food safety knowledge or lack of certified food safety manager" "CHEZ MAMAN","1453 18TH ST ","San Francisco","CA","94107","37.762513","-122.397169","+14155378680","Yes","81","Needs Improvement","2012-05-24","routine","Inadequately cleaned or sanitized food contact surfaces" "CHEZ MAMAN","1453 18TH ST ","San Francisco","CA","94107","37.762513","-122.397169","+14155378680","Yes","81","Needs Improvement","2012-05-24","routine","High risk food holding temperature" "EL MAJAHUAL RESTAURANT","1142 VALENCIA ST","San Francisco","CA","94110","37.754687","-122.420945","+14155827514","Yes","82","Needs Improvement","2012-12-11","routine","Moderate risk vermin infestation" "EL MAJAHUAL RESTAURANT","1142 VALENCIA ST","San Francisco","CA","94110","37.754687","-122.420945","+14155827514","Yes","82","Needs Improvement","2012-12-11","routine","Improper cooling methods" "EL MAJAHUAL RESTAURANT","1142 VALENCIA ST","San Francisco","CA","94110","37.754687","-122.420945","+14155827514","Yes","82","Needs Improvement","2012-12-11","routine","High risk food holding temperature" "J.B.'S PLACE","1435 17TH ST ","San Francisco","CA","94107","37.765003","-122.398084","","Yes","83","Needs Improvement","2011-09-27","routine","Unclean unmaintained or improperly constructed toilet facilities" "J.B.'S PLACE","1435 17TH ST ","San Francisco","CA","94107","37.765003","-122.398084","","Yes","83","Needs Improvement","2011-09-27","routine","High risk vermin infestation" "J.B.'S PLACE","1435 17TH ST ","San Francisco","CA","94107","37.765003","-122.398084","","Yes","83","Needs Improvement","2011-09-27","routine","Moderate risk food holding temperature" ...
Once the data has been exported, you might want to move the file from the MySQL default export folder (usually in the database folder), replace end-of-line characters (\N) for empty strings, and compress the file if it’s too large.
sudo mv /var/mysql/data/sf_restaurants/sf_restaurants.csv /tmp/ sed "s/\\\N//g" /tmp/sf_restaurants.csv > /tmp/sf_restaurants_bigml.csv bzip2 /tmp/sf_restaurants_bigml.csv
Finally, if you want to get a first quick predictive model you upload it to BigML with BigMLer as follows:
bigmler --train /tmp/sf_restaurants_bigml.csv.bz2 --objective Inspection
Et voilà!, a model like this can be yours.
If you haven’t been in a cave for the past several weeks, you may have heard that Twitter has announced plans for an IPO, which reportedly will take place November 15. But Twitter is far from the only IPO that has taken place—and there have been some very notable IPOs recently, including Veeva and Empire State Real Estate Trust this month alone. But leading up to every IPO is speculation on the stock’s Day 1 performance. While Day 1 performance isn’t necessarily an indicator of long-term stock (or company) performance, it is a critical metric for the firms that issue a company’s stock, as well as for institutional and individual investors.
IPOScoop has a service and rating system that aims to predict stocks’ opening day performances. At the core of this is their SCOOP Rating (Wall Street Consensus Of Opening-day Premiums), which ranks stocks from 1 to 5 stars (more details here). Happily for us, IPOScoop provides a comprehensive, historical list of past IPO performances since 2000, and tracks whether they met or exceeded the rating (“Performed”), or if they fell short (“Missed”).
What is the objective of this model?
While many BigML users leverage our platform to analyze stock performance or portfolio blend, we thought it would be interesting to see if there are any underlying factors in IPO data that can be used to assess IPOScoop predictions. You can view and clone the full model here.; and the dataset here.
What is the data source?
We used data from IPOScoop’s historical records, which they make available as an Excel download.
What was the modeling strategy?
As the data provided by IPOScoop was already well-structured, very little transformation was required—but we scrubbed the spreadsheet a bit to make sure that the name of the managers / joint mangers were consistent (e.g., we unified all iterations of “Credit Suisse”, “Credit Suisse First Boston” and “CSFB”). We then trimmed some of the ancillary information, converted the file to a .csv and uploaded it into BigML. Once we had the dataset uploaded, we made sure to modify our text analysis settings to take “Full Terms Only” for the Company name, as well as for the Lead / Joint-Lead Managers.
To help us remember what the star ratings meant, we used BigML’s new in-line editing function to input the meanings for each rating,
which later appear when we mouse over that field on the right-hand side of the tree..
What fields were selected for this model?
As we wanted to gauge accuracy of the SCOOP estimate, we eliminated all fields that could sway the actual outcome. For example, a large 1st Day Change percentage would naturally lead to a Performed rating so we didn’t use that field. In addition, company name shouldn’t be factored into the findings as we’re trying to predict outcomes independent of the actual entity. In the end, we used the following fields: Trade Month/Date/Day of Week (but not Trade Year), Managers, Offer Price, and SCOOP rating.
What did we learn?
Working in the BigML interface, we used the Frequent Interesting Patterns quick filter to narrow the tree down to highly confident predictions with good levels of support (as an aside, this is the first step I personally take after building any model to get a quick sense for whether it will give me interesting results). And to limit our findings further, we decided to only seek predictions with 90%+ confidence of being on or above target—we did this by dialing the confidence slider up further.
A very important clarification: our confidence levels are not predictions on the actual stock performances; they are predictions on whether the SCOOP ratings will be met (“Performed”) or not (“Missed”).
Looking at the tree, we see a quick split at the top based on SCOOP ratings of 1 star (to the right hand side of the tree), and those above one star. This makes sense as to “hit” a prediction the results simply need to meet or exceed the anticipated opening-day result—so a 1-star prediction would be much easier to achieve than a 2+ star prediction. For the purpose of this study, we thought it would be interesting to focus on bolder predictions, so we focus on the left hand side of the tree where we’ll find predictions for stocks with anticipated premiums of at least 2 stars (or $.50 per share).
A few confident nodes jumped out at us:
1) Here we see that we have a 90.59% confidence rating in a Performed prediction for a stock with a star rating of 3 or better and that is issued in December with an offer price above $12.76:
2) Further down the tree, we see that a 3+ rated stock issued after the 3rd day of any month between May and November, on any day other than Monday, with an offer price exceeding $17.75 and that was not managed by Goldman Sachs has a 94.25% likelihood of being an accurate prediction.
3) As we move even further down the left hand side of the tree we can see another interesting prediction—but this is summarized easier by using the Sunburst view. And as you see below, this prediction tells us that a Performed rating for a 3+ star stock offered between $12.77 and $17.225 that is traded on a Tuesday-Friday before the 18th of any month between April and October and was not managed by one of several firms listed, has a 94.65% confidence of being accurate.
Evaluating our model
To assess the strength of our model, we built a 10-model ensemble, and evaluated it against a 20% test set. The results were actually quite strong, as you can see below:
So what about that Twitter IPO?
Final details of Twitter’s IPO have yet to be released, but a recent report states the IPO will take place on November 15, with Morgan Stanley & JP Morgan Chase as managers, and with a fair value of $20.62. If we use $20.62 as the offer price and assume a 3-star SCOOP rating, we can build a prediction resulting in in a 78.47% confidence that the SCOOP estimate will be accurate:
Finally, a word of caution..
Needless to say, this blog post isn’t meant to serve as investment advice—we’re simply assessing the likely accuracy of SCOOP ratings, based on a pre-existing dataset. In fact, it’s always good to reflect on a model’s findings and give it your own sanity check. For example, what may the explanation be for the three nodes that we highlight above? In the first instance (the 90.59% confidence prediction), it’s quite possible that this is influenced by the time of year—with major investment funds trying to shore up end of year numbers, which would result in more buyers and an elevated premium. The other two findings? They could just be luck, or perhaps there are other underlying factors that lead to accurate high-performance predictions for mid-year IPOs—added data and added study are always helpful for learning more.
BigML uses decision trees to find patterns in data that are useful for prediction, and an “ensemble” of multiple trees is a great way to improve these predictions. The process for creating an ensemble is simple: instead of training a single decision tree, you train a bunch of them, each on a randomly sampled subset of your data. Each tree in the bunch then makes its own prediction, and majority vote wins. For example, an ensemble of 10 trees might vote 8 to 2 that a customer is likely to churn, so “churn” wins as the prediction for that customer. (Think of it like Congress, but without the filibuster or Hastert Rule.)
The data for each tree is sampled with replacement, so some data points will appear more than once—and some not at all. For example, if we start with 100 customers in our original data set, then pull from this group 100 times with replacement, the resulting new data set will contain (on average) only 63 of the original customers, and (on average) 37 will not be selected at all. Each time we repeat this process, we get a freshly random batch of 63 people, along with a freshly random group of 37 people who are not selected. (Again, those numbers are averages that will vary from sample to sample.)
This has the benefit of reducing the impact of outliers. Suppose my group of 100 customers contains a single Joker, an outlier that is utterly useless for learning. If I train only a single tree on the entire 100-customer data set, then the Joker can mess up my predictions. But if I use the trick above to create 10 different decision tree models, the Joker now has to clear two hurdles: first, he has to be one of the 63 customers selected for each sample, and second, he has to mess up enough trees to meaningfully alter the majority vote.
This technique—where we sample n times with replacement from an original collection of n data points—is called a bootstrap sample. (If we sample 50 times with replacement from an original data set of 100, that’s very nice, but it ain’t a bootstrap sample.) When we use multiple bootstrap samples to train an ensemble of trees, it’s called bagging, short for “bootstrap aggregating”. BigML uses bagging (and a related approach called random decision forests) to train its ensembles.
Creating all of these bootstrap samples might sound hard, but there’s actually a ridiculously easy shortcut. Instead of keeping a list of all users in memory so we can pick them out of the metaphorical hat, we just go through the list one by one in a single pass. Starting with user 1, we roll dice to decide how many times (if any) he appears in the bootstrap sample. We do the same thing for user 2, etc., until we have the sample size we want. In 99.99% of cases, a user will show up 6 or fewer times, so we only need to consider seven probabilities: the odds of showing up 0, 1, 2, 3, 4, 5, and 6 times. Even better, these numbers are easy to compute: the probability of showing up exactly x times is simply 1/(e * x!). This formula assumes a data set that is infinitely large, but fortunately these percentages are useful for data sets as small as 100 instances.
So without further ado, here are the seven magic numbers for creating a bootstrap sample in just one pass:
Some interesting things about this table:
- As mentioned above, the probability of a data point showing up at least once is (1 – 0.368) = 0.632.
- The odds of not getting picked are the same as the odds of getting picked exactly once. In both cases, the probability is 0.368, or 1/e.
- These probabilities sum up to more than 99.99%, so we really do get away with ignoring cases where a data point is picked more than 6 times.
(Huge thanks to BigML’s Adam Ashenfelter for this awesome blog post about one-pass sampling with replacement.)
The idea of real time machine learning seems simple enough: build a system that immediately learns from the freshest data, and it will always give you the best, most up-to-date predictions. In practice, however, learning and predicting are two distinct steps that happen in sequence. First we use an algorithm to find patterns in our data; these patterns are a “model”, and we’ve “trained” the model on our data. Next we take a new data point and look it up in the model to get a prediction; this lookup step is called “scoring”.
For example, I used BigML to train a model on the famous Iris data set, with the goal of using a flower’s measurements to predict its species. The resulting decision tree model organizes flowers into groups, which are easy to see in this SunBurst visualization:
There’s one group of flowers with petal width more than 1.65 cm and petal length more than 5.05 cm, and in that group the concentration of Iris virginica is very high (actually 100% in this dataset). So if I’m out in a field near the Gulf Coast and find a purple flower with these measurements, I’m pretty confident that I know its species. If I can’t remember the species name—perhaps I’m just starting out as a botanist—then I can just look it up in the decision tree model, shown here as a collection of nested if/then statements:
IF petal_length <= 2.45 THEN species = Iris-setosa IF petal_length > 2.45 AND IF petal_width > 1.65 AND IF petal_length <= 5.05 AND IF sepal_width > 2.9 AND IF sepal_length <= 6.4 AND IF sepal_length > 5.95 THEN species = Iris-virginica IF sepal_length <= 5.95 THEN species = Iris-versicolor IF sepal_length > 6.4 THEN species = Iris-versicolor IF sepal_width <= 2.9 THEN species = Iris-virginica IF petal_length > 5.05 THEN species = Iris-virginica IF petal_width <= 1.65 AND IF petal_length <= 4.95 THEN species = Iris-versicolor IF petal_length > 4.95 AND IF sepal_length > 6.05 THEN species = Iris-virginica IF sepal_length <= 6.05 AND IF petal_width <= 1.55 THEN species = Iris-virginica IF petal_width > 1.55 THEN species = Iris-versicolor
When I do this lookup, I’m “scoring” the new flower I just found, guessing its species using a model that was “trained” on previous flowers whose species was known. This lookup is very fast, because I only care about the rules (highlighted in blue) that match the one new flower I’m examining. If I used BigML to export this model as code, a computer would do the same fast lookup for the new flower, using the same small number of rules. This is one of the nice features of decision tree models: because they’re just big nested if/then statements, they look up predictions for new data points really fast—they “score” very quickly.
Which brings us back to the meaning of “real time”. The Iris example shows that you can score quickly without having to train in real time. And in practice, it’s often overkill to train a model in real time: there’s no reason to think an entire field of purple flowers will suddenly have shorter petals from one minute to the next, and likewise we don’t expect sudden big changes in a database of movie ratings or credit card transactions. In some cases, like text recognition, the patterns we’re trying to learn hardly change at all over time.
BigML provides four ways to do scoring: by using our web UI, by using our API, by exporting the model as code with a single click, or by using our new high performance PredictServer for large scale applications. In cases where a model actually needs frequent retraining, BigML does that too, using parallel computation and streaming data to train models in seconds or minutes, even on large datasets.
So the next time someone waxes poetic about real time machine learning, make yourself look really smart by asking if they mean real time training or real time scoring. But be careful: they might offer you a job, or maybe even jump off a bridge.
The issuing of visas for highly skilled workers has been a topic of debate recently in the United States. The H-1B is a non-immigrant visa in the United States under the Immigration and Nationality Act that allows U.S. employers to temporarily employ foreign workers in specialty occupations. Many employers (especially technology employers–most notably Mark Zuckerberg who is backing a lobbying group called FWD.us) want to make it easier for skilled workers to come to the US because they are eager, available and—some would say—generally willing to work for less money.
Recently, our friends over at chart.io did some cool data visualizations working off of an H-1B dataset that they pulled from enigma.io. Frankly, chart.io beat us to the punch a bit as we’ve been sitting on publishing this study until our new text processing feature came out to production, which it did last week. In case you’re not familiar with enigma.io, they’re a great company based in New York that has a subscription service that allows users to search across a vast array of data sources through their web app and provides access to data streams via an API.
What is the objective of the model?
With such a rich dataset we anticipated that taking a multivariate view of the data may uncover some interesting correlations. More specifically, we wanted to see if these correlations could be used to predict the wages of visa recipients. In addition, we thought it would be interesting to see if there are some correlations between job attributes and job location.
What is the data source?
As mentioned, we used enigma.io to get this data. engima.io sourced the data from the Office of Foreign Labor Certification. We pulled a dataset that featured information on every foreign worker application from 2012: 168,717 rows in total, with 37 fields of data. These fields covered information ranging from the visa application itself (when was it submitted? what is the status?) to the visa application’s employer (company name, location) and final parameters about the job itself (occupation classification, title, wage).
What was the modeling strategy?
To narrow the focus of our project, we trimmed the dataset a bit so that it only included approved workers with annual wages (no part-time or contract workers). We also eliminated fields that wouldn’t be relevant to models (e.g., case numbers) or were redundant (there were multiple fields for wage ranges that were largely the same). After our data pre-processing, we still had 151,670 data instances, with 21 fields per row–this was about a 30MB .csv.
What fields were selected for this model?
To focus on wage data, we selected the following fields: status of application, visa class (H-1B, H3-Australian or a few others), employer name/city/state, occupational classification (a high-level description—e.g., “Computer Programmer” or “Aerospace Engineer”), job title, worker city/state, wage, and employment start month.
Later, to focus on identifying the worker’s state, we selected the following fields: status, visa class, occupational classification, job title, worker state, wage, and employment start month.
It’s important to note that city, employer name, occupational classification and job title were all fields that leveraged our new text analysis functionality–which is configurable so that you can optionally:
1. tokenize full terms (e.g., “New York” instead of “New” and “York”)
2. stem words (i.e., ignoring prefixes and suffixes)
3. remove stop words (e.g., “of,” “the,” etc)
What did we find?
Just looking through the dataset histograms were actually pretty informative–even before we built a model. A few things jumped out:
In mousing over the “Wage” histogram, we quickly see that most people make less than $90,100.
In using the new “tag cloud” summary associated with every text field, we can see that Infosys was the most frequent employer of foreign worker visa recipients, followed by Wipro, Microsoft and several others:
And we can also see from the “Job Title” word cloud that most of these workers are filling technically oriented positions:
After we built our “wage” model from this dataset, we found that Occupational Classification and Job Title were the two most important fields—it’s a good thing that we now support text analysis!:
Interacting with the tree was informative in and of itself. To narrow our focus further, we decided to use the filters in the interface to find workers who made between $100K-$200K per year, and we also capped the expected error at $50K:
One finding here was that a computer software worker for Facebook who’s title does not contain “senior”, “architect” or “principal” had an annual salary of around $103K in 2012 (no word on their stock options, though!):
We also wanted to look at lower-waged foreign workers. When we moved the filter to focus on people making less than $50K in annual wages, we found this interesting node of a Deloitte & Touche employee with an accounting-related job (with ‘senior’ in his/her title) who had a salary of around $47K:
These were just two of many interesting findings. We also ran some predictions against this model and found that terms such as “engineering” for the occupation field led to higher predicted wages across states.
But on to our second model, which was to see if we could predict a worker’s state, based on other data. As mentioned, we looked at fields that wouldn’t give away state (including city name, employer name, etc). This was a bit trickier when using our standard trees as there weren’t many confident predictions that jumped out at us. But this is where the Sunburst View came to our rescue—we found interesting, confident predictions that software engineers making around $72K would reside in Washington State:
And then another, predicting with near 99% confidence that a computer analyst making around $52,000 would be working in Texas:
Working through this foreign worker data gave us a much better feel for the underlying trends and information–and text analysis was the key for getting greater levels of insight.
The 2012 US presidential elections saw Democrat Barack Obama win re-election, soundly defeating his Republican opponent Mitt Romney. “Soundly” of course, refers to the Electoral College results, where the President outpaced Romney 332-206. But the popular vote was proportionally much closer, with Obama taking 51%, to Romney’s 47%—demonstrating that while the President was favored by the majority of voters, America is still a nation very much divided by party lines.
What is the objective of this model?
We thought it would be interesting to look at the results at a more granular level based on different county-specific economic and demographic data like population density, age, unemployment rate, per capita income, home value, education level and more.
What is the data source?
We pulled elections data from The Guardian, and the demographics data from Esri, hosted in the Azure Datamarket. (as you may recall, BigML has provided users with a widget that makes it easy to browse and import data from the Azure Marketplace). The elections dataset had to polished to define the most voted candidate in every county, and the information was crossed with the demographic dataset using a relational database. The data was combined into a single .csv for BigML ingestion.
What was the modeling strategy?
While many (if not most) models built with BigML are done for predictive purposes, BigML is also very effective in analysis of historical data to uncover causal relationships between data, and to see how these relationships have influenced a net result. In this model, for example, we wanted to see how demographic and economic information influenced a candidate’s popularity in American counties. Plus, we think that guys like Nate Silver are already doing a pretty fine job of predicting election outcomes..
What fields were selected for this model?
While the .csv file that we generated had many fields, some fields have inter-dependencies and thus should not be selected together in the same model (i.e. party winner and candidate winner); however, these fields can always be used to build alternative models. Fields selected for this model were: total population, median age, % bachelors degree or higher, unemployment rate, per capita income, total households, average household size, % owner occupied housing, % renter occupied housing, % vacant housing, median home value, population growth 2010 to 2015 annual, house hold growth 2010 to 2015 annual and per capita income growth 2010 to 2015. The objective field in the model is the winner or most voted candidate in the county (Barack Obama / Mitt Romney). In total there were 3114 instances (which is the number of counties in the United States).
What did we find?
The field with the greatest importance was Median home value (30.71%).
The top node of the tree was % Renter occupied housing—and you can observe below that Mitt Romney was the most voted in 2,428 counties, a whopping 76.48% of the total number of counties. Mr. Obama won the election, of course, so the disparity between the county-by-county vote and the actual popular vote is likely explained by the fact that Romney was the most voted in counties with small populations, while Obama was mainly in highly populated counties with larger cities.
Looking closer at the tree, we see that the 83.23% of the counties where % Renter occupied housing was below the 27.05% threshold (meaning more people have home ownership), the most voted candidate was Mitt Romney. Looking at second level of the tree, in the left branch is observed that counties with a lower rate of citizens that hold at least a Bachelors also voted for Romney (85.69% confidence). Both the housing trend and the education trend support the fact that rural Americans were more likely to vote for Mr. Romney.
Furthermore, In the right branch, counties with high Median home value are for Obama (85.95%)—clearly another trait of urban voters. In fact, virtually every result of confidence splits following decision nodes throughout the tree supported this rural vs. urban trend.
To test and validate these relationships against actual data, we built a map based on the BigML dataset in CartoDB. And looking at this representation, we see that Obama won votes in the most populous counties, which were enough to take the state electoral votes, and ultimately to reclaim the Presidency.
“Well, duh” you may say in pointing out that urban voters leaned left, while rural voters leaned right—this is a well-known trend in American politics since the early 1980′s. But what we found most interesting about this model was the ability to get a finer-grained understanding of what voter traits had the *greatest* impact on likely voting outcome (beyond just where someone lives), and also the associated confidence levels. Give it a shot, and let us know what you think!
Peer to peer lending has become more popular recently, with services like Kickstarter and Indiegogo serving as de facto investment vehicles, and services like LendingClub enabling individuals to serve the role of private lenders to people with credit problems or who may otherwise not be eligible for traditional loans. Kiva has a similar peer-to-peer focus, but has a charitable approach: all loans are interest-free and are primarily granted to individuals from emerging countries for the purpose of bettering themselves through business or education investments.
What is the objective of this model?
With a 99% repayment rate to date, the vast majority of Kiva loans are paid back in full. Nonetheless, we wanted to see how we can predict the outliers: the small fraction of loan recipients who were unable to repay the loans. You can view and clone the model here.
What is the data source?
We pulled the Kiva data from build.kiva.org, which is a Kiva-sponsored site targeted at enabling developers to get a granular understanding of Kiva data. To pull the data into BigML, we used a data snapshot, which included 1,122 JSON/XML files ranging from 1-3MB in size each. We then downloaded that .zip file of JSONs, and developed a Python script to process all JSON files, in order to build a unique CSV file with the fields we thought would be most relevant for our modeling objectives.
What was the modeling strategy?
Through the manner we accessed the data, it was already somewhat processed and optimized for BigML ingestion. However, we still had 29 fields to pick and choose from. Since our objective was to identify trends for the minority of Kiva borrowers that default, it was important to deselect certain fields that would likely skew the trees and results such as “paid date” (as that wouldn’t be relevant for unpaid loans), and also the field for “delinquent” (as that would have a disproportionate match with unpaid loans). We also deselected redundant fields (e.g., we chose “country” but not “country code”). Last but not least, the BigML heuristics automatically deselected two text fields as we do not *yet* have text processing support (but stay tuned!).
What fields were selected for this model?
Fields selected for this model were: country, loan amount (total size of loan request), funded amount (the actual amount loaned) sector (one of 15 industries), funding date (year, month, date and day of week). In total there were 500K+ instances.
What did we find?
The top node of the tree and the field with the greatest importance was funding year (76%). This is probably due to the fact that loans made more recently are still in repayment mode–and perhaps also due to improvement in Kiva’s processes as the organization has become more mature.
Based on all of these selected fields, we found that the loan with the highest confidence of default would have been a loan made in Afghanistan in the first quarter of 2011:
We looked at an iteration of the model without funding year (which would give you a better feel for which loans are most likely to be defaulted on), and the result was a pretty flat tree–but which again pointed to Afghanistan as the riskiest country in which to issue a Kiva loan when we filtered by “defaulted”.
Again, it is important to emphasize that with a 99% repayment rate, Kiva loans are still *very* likely to be repaid in full. To validate this point, we ran a prediction against an Ensemble of ten models with the original data points other than country, for a $1,500 loan. The predicted status was over 95% for “paid”–and bear in mind that the second most common options is “in progress.”
We encourage you to clone the dataset to your own account, and start running your own models and predictions. In addition to evaluating repayment status, you can change the objective field to predict sector or amount repaid or time to repayment. And visit www.kiva.org to see how you can get involved to directly help aspiring individuals in emerging communities around the world.
Stay tuned for an update on this study after we’ve released our advanced text processing and other new features into production–the text descriptions of the loan purposes add a very interesting variable!
As those of you who have emailed us in the past know, we at BigML are passionate about supporting our users in their machine learning efforts, and we’re also eager to learn more about what features and capabilities you’d like to see us roll into BigML in the future. In this vein, we’re happy to share two activities with you: 1) our semi-annual user survey, and 2) our first webinar.
The user survey will be open through the end of September, and provides you with a chance to give feedback on your experience with BigML, and also make suggestions on what else we can do to help you moving forward. It’s very short (only 12 questions) and anyone who completes the survey will receive a free one-month Standard subscription.
On September 25 at 9AM PDT we will be holding a webinar which will use a customer churn use case to walk attendees through the full range of BigML’s capabilities – including several exciting new features that will be announced and released that day. This webinar is great for new and experienced BigML users alike, and will be the first of several webinars that we’ll be holding in the months to come. We also plan to have informal Google hangouts where users can bring questions to our development team and also get in-depth showcases of new features and capabilities. Space is limited for the webinar, so be sure to register soon!