People like to tweet about stocks, so much so that ticker symbols get their own special dollar sign like $AAPL or $FB. What if you could mine this data for insight into public sentiment about these stocks? Even better, what if you could use this data to predict activity in the stock market? That’s the premise behind PsychSignal, a provider of “real time financial sentiment”. They harvest large streams of data from Twitter and other sources, then compute real time sentiment scores (one “bullish” and one “bearish”) on a scale from 0 to 4. In a blog post titled Can the Bloomberg Terminal be “Toppled”?, a former Managing Director of Bloomberg Ventures asks an intriguing question: Could this kind of crowdsourced data be used to replace some of the functionality of a Bloomberg terminal?
So just for fun, we combined Quandl price and volume data with PsychSignal sentiment scores for 20 technology stocks. We then trained a simple model to predict whether the percentage “swing” (intraday high minus low) is higher or lower than the median, using only data available before the commencement of the trading day. Looking at the SunBurst view, we see a lot of bright green, which means the model is picking up some interesting correlations. For example, if the previous day’s close is down more than 3% from the previous day’s open, the opening price is less than the previous day’s close, and the previous day’s bearish signal is more than 0.84, then the model strongly predicts a price swing higher than the median (shown as a category named “2nd” in the screenshot below).
Evaluating the model on a single holdout set shows that it does much better than random guessing (see below). To be extra thorough, I used the BigML API to run 5-fold cross validation (not shown), confirming that average accuracy really is more than 64%.
Interestingly, if I try to predict whether a stock simply went up or down, the model is barely better than flipping a coin; for whatever reason, it’s easier to predict a tech stock’s intraday volatility than its daily gain or loss. Still, the accuracy of the “swing” model is impressive—just look at all that green in the SunBurst view. And you can bet your greek symbols that options traders are interested in predicting volatility.
Company founder James Crane-Baker puts this all in perspective: “Social media is such a rich vein of data about investor sentiment, it would be surprising if it didn’t contain useful information.” And this data is available in real time, so you could try building a model to make predictions using same-day data. Then maybe you’ll really start seeing some green!
We’ve all heard about case interview questions where the victim (sorry, applicant) is asked to explain why manhole covers are round, or estimate how many piano tuning experts there are in the world. Well, here’s a new one for you: How many incidents of graffiti have been logged in San Francisco since July 2008? The answer: more than 162,000. That’s almost 80 incidents per day for more than 5 years, and those are just the ones that got called in to 311; presumably a great deal of street art went unappreciated.
Thankfully, the City has painstakingly recorded a wealth of detail for each graffito, including whether it is “offensive” or “not offensive”. Interestingly, about 61% of incidents are deemed “not offensive”, although this is the home of the Folsom Street Fair (NSFW) so we assume folks grade on a curve.
As a bonus, the data is already geocoded, so we can easily train a model to find parts of the city likely to have (un)offensive graffiti. The result is here, with 56% of offensive graffiti contained in only 27% of the dataset, a rectangle that stretches from the south end of Market Street to Ocean Beach (to the west) and the Marina (to the north). Within this rectangle, public property is defaced by offensive orange dots while private property is, er, enhanced with relatively unoffensive blue dots. (Click the map to see the full Tableau goodness.)
I’m still pondering why this particular rectangle, and specifically the public property within this rectangle, is such a magnet for offensive graffiti. Is public property just more vulnerable to the orange stuff? But then why didn’t the model find a heavy mix of orange around Van Ness and Civic Center (never mind the Tenderloin)? Is this rectangle more sparsely populated than downtown, making people more likely to indulge in the extra-bad behavior of writing not just graffiti, but offensive graffiti? I don’t like either of these explanations; perhaps San Francisco 311 has some insight.
What I do know is that the SunBurst is full of green, which always makes me happy, and evaluation on a single holdout set shows more than 80% accuracy, which is much better than guessing. The confusion matrix tells us that the model’s main weakness is false negatives, i.e. offensive graffiti mislabeled as not offensive.
This is another great example of combining BigML and Tableau to create a compelling visualization: BigML finds a geographic pattern simply by analyzing latitude and longitude as numbers, and Tableau displays this pattern using its (really cool) built-in maps. And in typical Tableau fashion, the finding just leaps off the page.
When it comes to learn about Machine Learning APIs from experts, webinars are a great place to start. However, there’s nothing like in-person workshops. Today, BigML is announcing three exceptional workshops on Predictive Applications taking place in Spain in May. I will have the honour to speak at these workshops alongside Francisco and jao, who are the CEO and CTO of BigML. Here are the three dates:
- Bellaterra (Barcelona): May 7 at 12pm, IIIA (Artificial Intelligence Research Institute) at the Campus of the Universitat Autonòma de Barcelona.
- Madrid: May 8 at 5pm, Wayra/Telefonica.
- Valencia: May 13 at 4pm, Universitat Politècnica de València.
If you’re anywhere near these cities at that time, you should definitely come! Click the links to check out the posters above and get your invite!
Some more details
The aim of these workshops is to give you a quick and practical view of why and how to embrace predictive applications.
I will start with an introduction to Machine Learning, its possibilities, its limitations, and I will provide some advice on how to apply Machine Learning to your domain. I have the privilege of being the one to introduce BigML, but I will also review some other Prediction APIs. Then, Francisco will provide an introduction to BigML’s REST API. Finally, jao’s part will be the most hands-on as he will show step-by-step how to build a predictive application in 30 minutes.
We’re also thrilled to be joined in Madrid by Enrique Dans, who authors one of the most read technology blogs in Spanish and who also blogs in English, and by Richard Benjamins, who is Group Director BI & Big Data Internal Exploitation at Telefonica. In Valencia, we’ll be hosted by Professor Vicent Botti who leads the Artificial Intelligence Group at the Universitat Politècnica de València.
I am happy to announce to the readers of the BigML blog that these workshops will coincide with the launch of my new book on Prediction APIs, on May 7. I will cover some of its content in my talks, but not only! You can already get a flavour of what’s coming by downloading the first chapters and the table of contents for free.
Don’t forget to send an email to firstname.lastname@example.org to request an invite and don’t hesitate to say hi if you’re coming! I can be found on Twitter: @louisdorard. See you soon in Spain!
Whatever your preferred working environment, we want to make BigML available to you. That’s why we’re happy to announce a new set of BigML API bindings for MATLAB. These bindings, in the form of a MATLAB class, expose all the functions of BigML’s RESTful API.
Some of you may be aware that MATLAB includes a classification and regression tree implementation. So what makes BigML different? Here are some points that stand out in our mind:
- MATLAB presents its trees as static line drawings, as seen in the screenshot below of a classification tree trained on the iris dataset. In contrast, BigML’s models are fully interactive. You can view a model built with BigML here. We think that our model interface makes it easy to follow decision paths and tease out data patterns at a glance.
- Users of MATLAB will admit that it’s not the most efficient computational platform, and that heavyweight scripts will quickly consume all your workstation’s resources. BigML on the other hand, is a cloud-based service. You can train your models in the background, leaving your machine free to perform other number crunching tasks.
- One last important difference is that in order to use MATLAB’s trees, your MATLAB installation needs to include the statistics toolbox. Our API bindings will work with just a base MATLAB install.
Keep your eyes on our blog to see an application to showcase MATLAB and BigML working together. In the meantime, grab the API bindings here, and start playing. We can’t wait to see what neat things you can accomplish!
Very often datasets are imbalanced. That is, the number of instances for each of the classes in the target variable that you want to predict is not proportional to the real importance of each class in your problem. Usually, the class of interest is not the majority class. Imagine a dataset containing clickstream data that you want to use to create a predictive advertising application. The number of instances of users that did not click on an ad would probably be much higher than the number of click-through instances. So when you build a statistical machine-learning model of an imbalanced dataset, the majority (i.e., most prevalent) class will outweigh the minority classes. These datasets usually lead you to build predictive models with suboptimal classification performance. This problem is known as the class-imbalance problem and occurs in a multitude of domains (fraud prevention, intrusion detection, churn prediction, etc). In this post, we’ll see how you can deal with imbalanced datasets configuring your models or ensembles to use weights via BigML’s web interface. You can read how to create weighted models using BigML’s API here and via BigML’s command line here.
A simple solution to cope with imbalanced datasets is re-sampling. That is, undersampling the majority class or oversampling the minority classes. In BigML, you can easily implement re-sampling by using multi-datasets and sampling for each class differently. However, usually basic undersampling takes away instances that might turn to be informative and basic oversampling does not add any extra information to your model.
Another way to not dismiss any information and actually work closer to the root of the problem is to use weights. That is, weighing instances accordingly to the importance that they have in your problem. This enables things like telecom customer churn models where each customer is weighted according to their Lifetime Value. Let’s next see what the impact of weighting a model might be and also examine the options to use weights in BigML.
The Impact of Weighting
Let me illustrate the impact of weighting on model creation by means of two sunburst visualizations for models of the Forest Covertype dataset. This dataset has 581,012 instances that belong to 7 different classes distributed as follows:
- Lodgepole Pine: 283,301 instances
- Spruce-Fur: 211,840 instances
- Ponderosa Pine: 35,754 instances
- Krummholz: 20,510 instances
- Douglas-fir: 17,367 instances
- Aspen: 9,493 instances
- Cottonwood-Willow: 2,747 instances
The first sunburst below corresponds to a single (512-node) model colored by prediction that I created without using any weighting. The second one corresponds to a single (512-node) weighted model created using BigML’s new balance objective option (more on it below). In both sunbursts, red corresponds to the Cottonwood-Willow class and light green to the Aspen class. In the first sunburst, you can see that the model hardly ever predicts those classes. However, in the sunburst of the weighted model you can see that there are many more red and light green nodes that will predict those classes.
So as you can see, weighting helped make us aware of outcomes of predicted classes that are under-represented in the input data that otherwise would be shadowed by over-represented values.
Weighting Models in BigML
BigML gives you three ways to apply weights to your dataset:
- Using one of the fields in your dataset as a weight field;
- Specifying a weight for each class in the objective field; or
- Automatically balancing all the classes in the objective field.
Using this option, BigML will use one of the fields in your dataset as a weight for each instance. If your dataset does not have an explicit weight, you can add one using BigML new dataset transformations. Any numeric field with no negative or missing values is valid as a weight field. Each instance will be weighted individually according to the value of its weight field. This method is valid for both classification and regression models.
This method for adding weights only applies to classification models. A set of objective weights may be defined, one weight per objective class. Each instance will be weighted according to its class weight. If a class is not listed in the objective weights, it is assumed to have a weight of 1. Weights of value zero are valid as long as there are some other positive valued weights. If every weight does end up being zero (this can happen, for instance, if sampling the dataset produces only instances of classes with zero weight) then the resulting model will have a single node with a nil output.
The third method is a convenience shortcut for specifying weights for a classification objective which are inversely proportional to their category counts. This gives you an easy way to make sure that all the classes in your dataset are evenly represented.
Finally, bear in mind that when you use weights to create a model its performance can be significantly impacted. In classification models, you’re actually trading off precision and recall of the classes involved. So it’s very important to pay attention not only to the plain performance measures returned by evaluations but also to the corresponding misclassification costs. That is, if you know the cost of a false positive and the cost of a false negative in your problem, then you will want to weigh each class to minimize the overall misclassification costs when you build your model.
BigML is very excited to be collaborating with Tableau Software as a newly minted Tableau Technology Partner. This is a natural fit as both companies share similar philosophies on making analytics accessible (and enjoyable) to more people through incorporation of intuitive and easy-to-use tools. Not surprisingly, there’s a lot of overlap between our users: our latest survey showed that over 50% of BigML’s users also use Tableau. And of course, we’re eager to introduce Tableau’s customers to the predictive power of BigML.
So what does this mean for BigML and Tableau users? First and foremost, it means that BigML will be working on ways to enable easier usage across both tools. For starters, we have just launched a new feature that lets you export a BigML model directly to Tableau as a calculated field, as is demonstrated in the video below:
This approach unleashes the power of Tableau for prediction, letting users interact with a BigML model just like any other Tableau field. In the video, for example, we color a bar chart by predicted profit, which yields valuable insights into Tableau’s “superstore” retail dataset.
Moving forward, we’ll be looking at other ways that we combine the powerful analytic and visual capabilities of Tableau and BigML to enable joint customers to do amazing things.
We’d love to hear from any BigML users who are also Tableau customers so we can get added direction on this collaboration and provide you with early access to future implementations. If you’re willing and interested, please email us at email@example.com!
Just over a year ago BigML hit a milestone when the 10,000th predictive model was created using our platform. At the time, we were very happy to see that our idea of building a service that helped people easily build predictive applications had started to get some traction. One year and several thousand users later we are super excited that BigML has now supported the creation of over 1,000,000 predictive models—400,000 of which were created in just the last two months! We are extremely happy to see a growing community of fellow data practitioners and developers using BigML across a number of domains and use cases: predictive advertising, predictive lead scoring, preventive customer churn, security, fraud detection, etc.
And we’re just getting started! We have a roadmap of exciting features that will be coming out just around the corner… Stay tuned!
When trying to make data-driven decisions, we’re often faced with datasets that contain many more features than what we actually need for decision-making. As a consequence, building models becomes more computationally demanding, and even worse, model performance can suffer, as heavily parameterized models can lead to overfitting. How then, can we pare down our data to the most relevant set of features? In this post, we will discuss a solution proposed by Stanford scientists Ron Kohavi (now partner-level architect at Microsoft) and George H. John (now CEO at Rocket Fuel) in this article, in fact one of the the top 300 most-cited articles according to CiteSeerX.
The Basic Idea
Let’s consider a simple case where the data are composed of three features. We can denote each possible feature subset as a three digit binary string, where the value of each digit denotes whether the corresponding feature is included in the subset. For example, the string ‘101’ represents the subset which includes the first and third features from the original set of three. Our goal therefore, is to find which subset gives the best model performance. For any set of n features, there exist 2^n different subsets, so clearly an exhaustive search is out of the question for any data containing more than a handful of features.
The Advanced Idea
Now, let’s say each of these feature subsets is a node in a state space, and we form connections between those nodes that differ by only a single digit, i.e. the addition or deletion of a single feature. Our challenge therefore is to explore this state space in a manner which will quickly arrive at the most desirable feature subset. To accomplish this, we will employ the Best-First Search algorithm. To direct the search algorithm, we need a method of scoring each node in the state space. We will define the score as the mean accuracy after n-fold cross validation of models trained with the feature subset, minus a penalty of 0.1% per feature. This penalty is introduced to mitigate overfitting. As an added bonus, it also favors models which are quicker to build and evaluate. The above figure illustrates how a single node is evaluated using 3-fold cross validation. Lastly, we need to decide where in the state space to begin the search. Most of the branching will occur in the nodes encountered earlier in the search, so we will begin at the subset containing zero features, so that less time is spent building and evaluating models to explore these early nodes.
Testing the Idea
We’ll try this approach using the credit-screening dataset from the UC Irvine Machine Learning Repository, which contains 15 categorical and numerical features. Building a model using all the features, we get an average accuracy of 81.74% from five-fold cross validation. Not bad, but maybe we can do better… BigML’s Python bindings make it simple and straightforward to implement the search algorithm described above. You can find such an implementation using scikit-learn to create kfold cross validation data sources here. After running the feature subset search, we arrive at feature subset containing just 3 features, which achieves an average cross-validation accuracy of 86.37%. Moreover, we arrived at this result after evaluating only 117 different feature subsets, a far cry from 32768 which would be required for an exhaustive search.
Navigating the web nowadays, we constantly paddle through clouds of tags and other classification or taxonomy labels. We’ve grown so used to it that now we hardly realize their presence, but they are everywhere. They flow around the main stream of contents, pervade posts (such as this one), articles, personal pages, catalogs and all major content repositories. To store this kind of content, one naturally drifts to non-normalized fields or documental databases, where they end up stacked in a multi-occurrence field. From a Machine Learning point of view, they become a multi-labeled feature.
Machine Learning has well-known methods to cope with this kind of multi-labeled features. In this blog we devoted a previous post to talk about multi-labeled categories prediction using BigMLer, the command line utility for BigML. Now we present the next step: using multi-labeled fields’ labels as predictors.
Multi-labels in professional profiles: Recruiting toy example
In the spirit of the aforementioned post, let’s have a look at the typical profile page in your favourite recruiting web site. At first sight, we detect some contents that might as well be stored in multi-labeled fields: the companies we’ve worked for, the associations we belong to, the languages we speak, the positions we’ve assumed and our skills. Should you face the task of predicting which profile would be more suitable for a certain position, you would probably want to use this information. For example, in technical positions people that understand English will probably perfom better in their job. But maybe this is not so true for the chemical or pharmaceutical sector, where German has traditionally been a dominant communication language. Also the number of spoken languages could be a determinant factor. Who knows what other relations are hidden in your multi-labeled features!
OK, so there seems to be a bunch of valuable information stored in multi-labeled fields, but it looks like it must be reshaped to be useful in your machine learning system. Each label per se is a new feature you can use as input for your prediction rules. Even aggregation functions over the labels, like
last, can be useful as new input. Now the good news: BigMLer can do this for you!
Multiple predictor multi-label fields.
Let’s build an example based on our recruiting site scenario. Suppose we make up a sample with some of the features available in users’ profile pages, such as the name, age, gender, marital status, number of certifications, number of recommendations, number of courses, titles, languages and skills. This is an excerpt of the training data:
The last three fields above are filled with colon-separated multiple values. For each person, they contain:
- Titles: the titles of the posts she has occupied
- Languages: the languages she speaks
- Skills: the skills she has
As our goal is to produce a quick and simple proof of concept, the values for the
Titles field are restricted to just four categories: Student, Engineer, Manager and CEO. The skills have also been chosen from a list of popular skills. Looking at the data we have, the first question that arises is: could we build a model that predicts if a new profile would fit a certain position? Well, probably some skills are a must in CEO‘s profiles, while other are still missing when you are a Student. Our data has implicitly all of this information, but we would need some work to rearrange it in order to build a predictive model from it. For example, we would need to build a new field for each of the available skills and populate it with a True value if the profile has that skill and False otherwise. What if you could let BigMLer take care of these uninteresting details for you? Guess what: you can!
See the next BigMLer command?
--multi-label-fields options tell BigMLer that the contents of the fields
Skills are a series of labels. Using
--label-separator you set colon as the labels’ delimiter and… ta da! BigMLer generates an extended file where each of the fields declared as multi-labeled fields are transformed into a series of binary new fields, one for each different label. The extended file is used to build the corresponding source, dataset and model objects that you should need to make a prediction. The value to predict is the
Titles field, that has been targeted as our objective using the
--objective option. So, in just one line and with no coding at all, you have been able to add each and every label of your
Titles as an independent new feature available to be used as predictor or objective in your models.
Still, you might be missing additional features, such as the number of languages or the last occupied position. BigMLer has also added a new
--label-aggregate option that can help you with that. You can use
count to create a new field holding the number of occurrences of labels in a multi-label field and
last to generate a new field with the first or last label found in the multi-label fields. In our example, we could use
Titles - count,
Languages - count,
Skills - count,
Titles - last,
Languages - last,
Skills - last new fields will be added to our original source data and used in model building.
Selecting significant input data
We have just built an extended file and generated the BigML resources you need to create a trained model. Nevertheless, we focused mainly in showing the advantages that BigMLer offers to build up new features from multi-labeled fields, disregarding the convenience of including or excluding some of them as inputs for our model. In our example, the
Name fields, for example, should be excluded from the model building input fields, as they are external references that have nothing to do with the
Titles values (we don’t expect all CEO’s to be named Francisco). Also, once the new label fields are included, we may prefer to exclude the prior multi-labeled fields from the model input fields to ensure that the prediction rules are only based in separate labels. For the
Titles objective field, we would also like to ignore the generated aggregation fields to avoid useless prediction rules, like saying that if the last occupied post in a profile is Manager, then the profile is suitable to be a Manager. In addition to that, we exclude the
Marital Status fields because we don’t want these features to influence our analysis. This is how we chose to build our final model:
--model-fields option has been set to a comma-separated list of all the fields that you would like to be excluded as input for your model prefixed by a
- sign. Then, having a look at the generated models, one for each position, some rules appear. You can see the entire process in the next video:
According to our data, Students have a limited number of skills such as web programming but lack others such as Business Intelligence (that is found in Managers‘ profiles), Algorithm Design (frequently found in Engineers‘ profiles) or Software Engineering Management (appearing in CEOs‘ profiles). Similar patterns can be found in the Engineers‘ model, the Managers‘ model and the CEOs‘ model, so that when we come across new profiles, you could use these models (tagged with the multi-label-recruiting text) to predict the positions they would be suitable to by calling
where new_profiles.csv would contain the information of the new profiles in CSV format. The predictions.csv file generated by BigMLer would store the predicted values for each new profile and their confidences separated by comma, all ready to use.
This is a simplified example of how BigMLer can empower you to easily use multi-labeled fields in your machine learning system. It can split the fields’ contents, generate new binary fields for each existing label and even aggregate the labels information with count, first or last functions. BigMLer‘s functionality keeps growing steadily, and new options like weighted models and threshold limited predictions are ready for you to try–but this will be the subject for another post, so stay tuned!
One of the great things about sports these days are the number of bright minds taking fresh looks at sports statistics and analytics. This has been particularly true in baseball, where there’s healthy debate between baseball Traditionalists (who favor subjective-based scouting reports and place value on traditional statistics like Home Runs, Wins, etc) and “Sabermetricians” (who favor advanced statistics—many of which were created by Nate Silver), with more and more credence being lent to analytic-driven decisions for roster composition, player salaries and the like. The Sabermetrics approach to baseball is well-documented in “Moneyball” and throughout the baseball blogosphere.
Football (the American variety) also has a burgeoning statistical community, and Football Outsiders is one of the leaders in this arena. At the core of their approach is their proprietary Defense-adjusted Value Over Average (DVOA) system that breaks down every single NFL play and compares a team’s performance to a league baseline based on situation in order to determine value over average—which they explain in detail here. Another example of cool football analytics in action is this report by Lock Analytics on how to use Random Forests to estimate a game’s win probability on a play by play basis.
What was the objective of the model?
With the Super Bowl being right around the corner, we thought it would be interesting to work with some of Football Outsiders’ historical data plus key football gambling metrics to see if we could come up with predictions for the big game.
What is the data source?
We used a few sources to create a dataset for this project. First and foremost, we pulled statistical data from Football Outsiders. Then, to get some added context on the Super Bowl for some of the betting metrics, we pulled historical scores, point spreads and over/under totals from Vegas Insider.
What was the modeling strategy?
As stated, we combined Football Outsiders’ DVOA-driven analysis with historical betting statistics, with the objectives being to predict the NFC (Seattle Seahawks) and AFC (Denver Broncos) total points. However, since there were only relevant statistics from the past 25 seasons we had a pretty small dataset. As such, we decided to build a 100-model ensemble in order to gain a stronger predictive value.
What fields were selected for the ensembles?
We used AFC & NFC Efficiency—both weighted (factoring in late-season performance) and non-weighted; AFC & NFC Offensive & Defensive ranks (weighted & non-weighted), AFC & NFC Special Teams ranks, the point spread, and the over/under total. We also included historical points scored for both the AFC & NFC team from past Super Bowls. You can view and clone the full dataset that we built here, although you’ll have to deselect some of the fields if you want to replicate this specific ensemble.
What did we find?
Using the Prediction function within the BigML interface, we entered the key values for this season to predict both AFC and NFC points (pictured below)
And the winner is..:
As you can see below, the result of our prediction is NFC 26.38 (with error ±4.95), AFC 22.16 (with error ±2.20). This reflects a prediction using confidence weighting for the ensemble. If we use a plurality-based prediction the result is NFC 29.93 (±13.74), AFC 23.97 (±6.0).
Incidentally, we also ran a categorical ensemble to simply predict the winner (NFC/AFC), and that ensemble also pointed to a Seahawks’ victory, with ~76% confidence. Another ensemble predicting the winner against the spread showed Seattle with a ~72% confidence.
But before you Seattle denizens start planning a victory parade through the Emerald City, please bear in mind a few things:
- These ensembles were built from small datasets so treat the results with a grain of salt. In other words, don’t wager your mortgage on 24 rows of data.
- Factoring in the expected error (which reflects our 95% confidence threshold), Denver could still win.
- This blog was authored by a lifelong Seahawks fan.
Last but not least, do not forget the greatest football variable of all, which a wise data analyst recently described to me as the “Any Given Sunday” coefficient…