Machine Learning Throwdown, Part 5 – Miscellaneous

Posted by
This is your application stack.  The fourth level from the bottom represents cloud-based ML APIs.  Oh, snap!

In the fourth post of the series, I compared prediction functionality and performance between each of the services. We saw that while some services may make more accurate predictions than others, the runners-up often follow closely behind. That’s good news for you because it means you are free to choose among the services without being too concerned that you picked a dud when it comes to making accurate predictions. This post will cover some other miscellaneous topics that may help you choose which service will best meet your needs.

Stability

I previously hinted that I ran into stability issues with more than one service. Which ones gave me grief? Well, actually, all of the cloud-based services had problems. I was unable to rely on any of them to take my data, create a model, and then make predictions without occasionally failing. To collect the results, I often had to run experiments multiple times (without any changes to code or data) to work around random failures. BigML and Prior Knowledge are in early beta so perhaps the occasional hiccup is forgivable, but Google Prediction API has been out of beta for nearly a year. Weka isn’t perfect either but at least its issues are well documented, if sometimes difficult for non-experts to understand. As I mentioned in my third post on models, Weka often runs out of memory and crashes on moderately-sized datasets which requires you to restart the program with more memory allocated to it.

How concerned should you be about these stability issues? Random failures are a bit annoying if you are using these services interactively (e.g., through BigML’s web interface), but are more problematic when you use their APIs to integrate machine learning into your application. First of all, most (but certainly not all) of the problems I had occurred while creating models. There’s not too much to worry about if you create the model in advance and only integrate predictions into your application. You’ll just need to be extra sure you have good error handling (but shouldn’t you be doing this anyway?) if your application needs to train models without supervision. For now, let’s just hope this post encourages each of these services to invest more time ensuring their APIs are rock solid.

Cost

The pricing structures are very different between the services, so you really need to look at the details to determine which one will be cheapest based on your expected usage. BigML uses credits which are cheaper if you buy them in bulk, while the others use regular money. Prior Knowledge charges for model creation based on the number of cells analyzed (#rows * #columns) while the others charge based on the size of the CSV files. Google Prediction API charges a $10/month project fee while the others have no recurring costs. Some charge more for creating models; others make their money from predictions. They all have free trials but some are more generous than others. Weka, of course, is completely free.

Because of the different pricing structures for the cloud-based services, I can’t really say that one is better than the other. What I can do is tell you how much it would cost to run the experiments I described in my previous post. Computing 10-fold cross-validation scores on all 11 datasets requires creating models from approximately 24MB of data (totaling over 8.5 million cells) and making about 23,000 predictions. Ignoring free trials, here are the approximate costs for each service based on published prices:

  • BigML: $6 assuming predictions are made using an offline model; additional $11.50 if using online predictions (through the website or API)
  • Google Prediction API: $6.50 + $10/month + Google Cloud Storage fees (negligible)
  • Prior Knowledge: $10.80
  • Weka – Free!

Support

I can’t say much about the quality of support available for each service. However, I do know that BigML and Prior Knowledge are both startups that are eager to talk with their customers, learn how they are using the service, and help them work through issues. I’ve seen quite a few questions go unanswered on Google’s mailing list, but perhaps people have better luck using the feedback system where they say they will do their best to respond within 48 hours. Weka has a large user community where you may be able to get free help, or paid support is available from Pentaho.

The Services

BigML

Pros:

  • Developer mode where everything is free but each model can be created from at most 1MB of data (nearly all datasets for my experiments would qualify)
  • Very generous free trial
  • Eager to talk with customers and help them work through issues
  • Good developer documentation; everything is easy to find

Cons:

  • Random API errors (failed dataset creation; model errors shortly after they are created)
  • No full sample applications using their API, only code snippets

Google Prediction API

I honestly expected Google’s service to be the most reliable, but I was shocked when it turned out to be the most problematic for me. I most likely just started using their API at the wrong time. I saw reports of similar problems from other people on the mailing list during the days I was having issues. I doubt it’s always this unstable, but, based on their mailing list, problems like this do seem to come up quite frequently.

I ran into problems including erroneous ‘No Model Found’ messages while trying to make predictions and random failures during model creation with error messages such as ‘Error’, ‘Backend Error’, ‘Internal Error’, and ‘Pwnd, n00b’ (okay, not that last one). I had to restart my script more than five times just to collect results on the tiny iris dataset. I thought there was no hope of collecting results for some of the larger datasets, but I walked away for a few days and things worked significantly better when I came back. I was still getting occasional errors, but at least I was able to collect results for the remaining datasets.

Pros:

  • Provides full sample applications using their API

Cons:

  • Random API errors (failed model creation; model errors when making predictions)
  • $10 per month base fee to keep your project active even if you don’t use it
  • Developer documentation is spread over multiple sites that may be hard to find (e.g., cloud storage is separate from authentication is separate from the prediction API)

Prior Knowledge

Pros:

  • Unlimited predictions with the free trial (using models created from up to a total of one million data cells, roughly equivalent to 5MB of CSV data)
  • Eager to talk with customers and help them work through issues
  • Provides full sample applications using their API

Cons:

  • Random API errors (failed model creation; timeouts when making predictions)
  • I ran into some frustrating undocumented restrictions while using their API (one example: their “count” data type is limited to 100,000 which was discovered by trial and error)

Weka

Pros:

  • Free! Paid support is also available from Pentaho
  • Large number of users makes it easier to get support
  • Plenty of sample code if you search around
  • Tons of documentation including one third of a data mining book

Cons:

  • Often runs out of memory and crashes
  • There’s not one clear place to go for support
  • If you want it in the cloud, you’re running your own servers

Conclusion

My next post will wrap up this entire throwdown and summarize my experiences with these services. Join me, won’t you?

(Note: Per Dec 5, 2012 Prior Knowledge no longer supports its public API.)

Other posts:

7 comments

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s