Machine Learning Throwdown, Part 3 – Models

Posted by
Several regular kids in basketball uniforms and one really tall kid
Source: http://www.break.com/pictures/too-tall-to-play-ball-2083905

Pop quiz: among the kids in the picture, which ones are able to jump high enough to reach the hoop to dunk a basketball?

You can’t know for sure, but you can make an educated guess that the tall kid wearing the #21 jersey can dunk while the others cannot. Your brain has a model of how the world works that allows you to look at a person and make a prediction about their ability to dunk. You may not always be correct, but your mental model allows you to make much more accurate predictions than you could by flipping a coin.

Machine learning algorithms work by detecting patterns in data and creating a model that summarizes important properties. Given a list of people, their physical characteristics, and whether or not they can dunk, these algorithms can create a model to predict whether a new person it encounters can dunk based on characteristics such as their height and age.

There are a large variety of different machine learning models and each one has its own strengths and weaknesses. In the second post of the series, I talked about getting started with a few different services and importing your data. Now it’s time to look at the models supported by each of these services. But first, a quick note about the datasets I mentioned in the last post. A couple of them originally had a unique string field (e.g., the name of a community). I had to remove these fields because some of the services were unable to handle them.

The Services

BigML

BigML provides one type of model, a decision tree. Perhaps the most interesting property of decision trees is that they are completely white box. This means that you can look inside the model to understand what it learned from your data and how it uses that information to make predictions about unseen data. Black box models, on the other hand, provide little insight into the structure of your data and which properties are important for making predictions.

Continuing with the basketball example, both white box and black box models can help if all you want to do is predict whether a particular person can dunk. However, a white box model such as BigML’s decision trees can provide even more information. For example, it may be able to tell you that very tall people can usually dunk, but only if they are less than 80 years old.

Pros:

  • White box models; can be explored on the website and downloaded for offline use
  • Can be created on the website without writing any code
  • Optimized automatically; no need to tweak parameters
  • Fast model creation
  • Does not crash or time out on unique string fields that some services/algorithms are unable to handle

Cons:

  • Only supports traditional supervised machine learning (one output field predicted by the other fields)
  • Only supports one type of model
  • Does not report cross-validation scores indicating how well it makes predictions
  • Does not make use of its text data type (arbitrary strings such as an email subject) when learning a model

Google Prediction API

Unique among the cloud-based offerings, Google Prediction API can learn models from arbitrary text data. This is an important feature in many applications such as spam detection and sentiment analysis. For example, a dataset of spam and non-spam email could be used to create a model that predicts whether a new email is spam based on its subject. Doing this with the other services is possible but requires some non-trivial preprocessing before importing your data.

Pros:

  • Optimized automatically; no need to tweak parameters
  • Makes use of arbitrary string data (e.g., an email subject)
  • Reports basic cross-validation scores indicating how well it makes predictions

Cons:

  • Only supports traditional supervised machine learning (one output field predicted by the other fields)
  • Black box models provide no insight into your data and cannot be downloaded for offline use
  • Very slow model creation
  • Poor support for missing data
  • Must write code to create a model (Edit: You can create a model using their APIs Explorer intended for developers, but you will need to read the API documentation to understand how to use it)

Prior Knowledge

The other cloud-based offerings only support traditional supervised machine learning where all but one field is used to predict the value of the remaining fields. The nonparametric Bayesian models created by Prior Knowledge’s Veritable API are significantly more flexible. Given values for a subset of fields, these models can predict the values of all of the remaining fields. In addition to making predictions, Prior Knowledge’s models provide additional operations that allow you to explore similarities between different rows and columns in your dataset. Bayesian nonparametrics has been a hot topic in the machine learning literature in recent years and the team at Prior Knowledge is working hard to make this framework available to everyone.

Pros:

  • Optimized automatically; no need to tweak parameters
  • Supports reasoning about multiple fields
  • Provides advanced operations for revealing structure in your data
  • Works well with missing data

Cons:

  • Models cannot be downloaded for offline use
  • Does not report cross-validation scores indicating how well it makes predictions
  • Only supports one type of model
  • Somewhat slow model creation
  • No string/text data type; must remove strings from your dataset or treat them as categorical types (may or may not work)
  • Must write code to create and explore a model

Weka

Weka supports a wide range of machine learning algorithms including some that are very similar to the cloud-based services.  Unlike the cloud-based services that take care of the details for you, Weka exposes a variety of parameters for each algorithm. These parameters have reasonable defaults but you may need to tweak them to get optimal predictive performance. In general, creating a good model in Weka requires significantly more machine learning expertise to preprocess the data, select the right algorithm, and tune the parameters.

Some of the algorithms in Weka really struggled on datasets with unique string fields. In one case, it was taking a couple hours to train a single model but the training time dropped to a few seconds once I removed the troublesome field. In another case, Weka’s memory usage skyrocketed and was unable to train a model from a moderately-sized dataset until I allocated over 2GB of RAM to the JVM. Again, removing the unique string field solved the problem.

Training time and memory usage were not the only things affected by the unique string fields. The cross-validation scores (a measure of how well a model is able to make predictions) for many of Weka’s algorithms increased dramatically after removing these fields. Even though these unique strings should have no predictive value (it’s like trying to predict a person’s height based on the name of their first pet), their presence in your data can really sabotage performance. With Weka, it’s often important to understand how the algorithms work and what kind of data they can handle.

For this comparison (more details in the next post), we narrowed Weka’s large list of algorithms down to ten including linear regression, decision trees, Naive Bayes, Support Vector Machines, and k-Nearest Neighbor. These were chosen to evaluate a wide variety of algorithms rather than to optimize performance. See the Machine Learning Throwdown details for a full list of the algorithms and their parameters. Note that these parameters were chosen in advance and are fixed for each dataset. As with the other services, we are not tuning any of the parameters to optimize performance.

Pros:

  • Supports a wide range of models
  • Some support for visualizing white box models (for certain algorithms including decision trees, Naive Bayes, etc.)
  • Reports cross-validation scores indicating how well it makes predictions
  • Can be created from the GUI without writing code
  • Can share models w/ someone that has Weka installed

Cons:

  • Need to be aware of your chosen algorithm’s limitations, data preprocessing requirements, etc.
  • Training time and memory usage vary dramatically from algorithm to algorithm; must manually increase JVM heap size to avoid crashes while training models
  • Decision tree visualization is fine for small models but becomes a jumbled mess for moderately-sized models
  • Not optimized automatically; may need to tweak model parameters

Conclusion

As you can see, the models provided by each one of these services have their own strengths and weaknesses. Which one should you choose? It really depends on your data and your goals. BigML provides an amazing view into your data with a white box model, Prior Knowledge can help you discover complex relationships in your data, and Google Prediction API is convenient for working with text. If you are more of an expert in machine learning and the other services don’t meet your needs, Weka may be the right choice for you.

The next post in this series will look at how well the models from these services are able to generalize to make accurate predictions from unseen data.

(Note: Per Dec 5, 2012 Prior Knowledge no longer supports its public API.)

Other posts:

12 comments

  1. Nick, Really enjoying your ML eval!
    I have some experience with Google Prediction API and would note you can create a model through developer GUIs by uploading training data CSV with Google Cloud Storage Manager and then training your model through Google’s apis-explorer. But, Google process is not nearly has user friendly as BigML approach…
    -Kevin

  2. from your post google seems clear winner. Your cons for google kinda shows its strong points in comparison to what BigML has not.
    WOrds like very slow and poor – are they really any relevant without comparison data?

    1. Jonny, there can be many aspects of relevance. It totally depends on your situation. For full disclosure of all performance details, check the subsequent post on Predictions and the links given there. One company’s strength can be an other company’s weakness, sure. I think Nick covered the full width of pro’s and con’s over the range of all 6 posts very well. It is up to the user to try them out and see which one is his/her personal winner, given the needs that you have for a machine learning service.

  3. Thanks for the explanation.
    I built a model in BigML based on the Forbes2000 dataset to predict marketvalue using other remaining fields. Subsequently I used the same dataset to build a classification model in R (rpart package). However, BigML performed significantly better than rpart and I have no idea why because I do not know what algorithms are used in BigML behind the curtains. Is it possible that you shed some light on this ?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s