Machine Learning Throwdown, Part 2 – Data Preparation

This is the second in a series of blog posts comparing BigML with other machine learning services. As you may recall from the first post in the series, I am primarily evaluating cloud-based services aimed at making machine learning accessible to non-experts like myself. Having previously introduced the competition and the criteria for comparison, let’s now see what it takes to get started and load your data into each service.

Think of a machine learning dataset as a simple table of data. Each row is an example that you want to learn from (e.g., the sales data for a particular wine), and each column describes a property of the data (e.g., the sales price of the wine). Your data may be stored in an Excel spreadsheet, a database, or perhaps even in a plain CSV (Comma-Separate Values) file. To begin learning from your data, you must properly format it and load it into your service of choice.

As part of this machine learning throwdown, I will be sharing my results from benchmarking the predictive performance of each service on some popular datasets from the UCI Machine Learning Repository. Note that these datasets are relatively clean by machine learning standards. See the Machine Learning Throwdown details for a description of each dataset chosen for this competition. This post will cover what I learned from importing the data into each service.

The Services

BigML

When I interviewed for my internship with BigML, one of the things they asked me to do was spend about 30 minutes making something interesting using their Python bindings. Seriously? I had previously created an account and poked around their website, but how could they expect me to write a program using their service in just 30 minutes? Luckily for me, they had already done all the hard work to make it as easy as possible to get up and running.

One thing that makes it really easy to get started with BigML is their web interface. Unlike the other cloud-based services, you can do everything on the website: upload your data, create and explore a model, and make predictions. That’s right, you can do all this without writing a single line of code! For developers, BigML has a RESTful API and a growing list of libraries for popular platforms and programming languages including Java, Python, Ruby, R, and even iOS.

Pros:

Very easy to get started
Can upload data on the website or using the API
Auto-detects data types if you don’t specify them yourself
Good at parsing poorly-formatted data
Supports remote sources via URL (Amazon S3, etc.)
Can handle up to 64GB of data
Libraries for quite a few popular languages (though you may need to browse through the blog and/or their GitHub profile to find them)

Cons:

No support for adding data to an existing dataset/model

Google Prediction API

Fortunately, no one asked me to produce anything with Google Prediction API while the clock was ticking. There are multiple pieces to set up and the tools for uploading your data are separate from the tools for creating models and making predictions. On the bright side, many of these tools and libraries are common among other Google products and APIs. Much of the work required to get started should look familiar if you have used Google’s APIs in the past.

You can use a web interface, command line tool, or the API to upload your data in CSV format to Google Cloud Storage. Once your data is there, however, you will need to write code to work with the prediction API as there is no user-friendly web interface.

Pros:

Multiple methods for uploading data including via the website, command line tool, and API
Good integration with other Google services
Libraries for a variety of programming languages
Auto-detects data types
Supports updating an existing model with new data

Cons:

Significant setup time: separate tools/libraries for authenticating, uploading data, and working with the prediction API
Somewhat strict about data format (no header row allowed, objective column must be first, etc.)
Poor support for data with missing values
Libraries are very thin wrappers on top of the API (very few convenience functions)
Training data is limited to 250MB

Prior Knowledge

Prior Knowledge makes it relatively easy for developers to get started with their Veritable API, but they have no web interface for working with your data. Everything will need to be done through their RESTful API or one of their client libraries (currently available for Python and Ruby).

One notable thing about their API is that you can upload rows of data individually or in bulk as JSON data. With the Veritable API, you can bypass a CSV file completely if it’s more convenient for your application. However, my advice is to avoid this if you can. I spent half of a day fighting with their picky API to get my data in a format it would accept. Save yourself the headache, save your data to a CSV file, and use the utilities in their client libraries for reading in CSV files. Either way, you will still need to explicitly specify the data types for each column as they make no attempt to detect them automatically.

Pros:

Fairly easy to get started
Can upload data incrementally

Cons:

Picky about the format of the data
Must write code to properly format your data and upload it with using the API (though their libraries provide utilties to make this process easier with CSV data)
Training data is limited to 10,000 rows and 300 columns by default (roughly equivalent to a 15MB CSV file)

Weka

Recall that Weka is the odd duck out in this comparison as it is a standalone application that you run on your own computer. While very powerful, it is targeted at those with a significant amount of machine learning expertise. There are multiple ways to interact with Weka including a GUI application, Java API, and command line interface. Weka accepts data in ARFF format which is essentially a CSV file with type information about your data. It includes the ability to open CSV files and automatically detect data types.

Pros:

Very powerful; lots of features
Built-in data preprocessing tools
Has a GUI, a Java API, and a command line interface
Some support for streaming data with the related MOA project

Cons:

Not cloud-based; must download and install the application
GUI is overwhelming for new users
Options are limited for controlling Weka from languages other than Java
Maximum dataset size greatly depends on your choice of algorithm, your computer’s memory, the JVM heap size used for Weka, and the properties of your data

Conclusion

That just about wraps up the discussion on getting started and importing your data. Based on my experience with these four services, BigML comes out way ahead of the “competition” in this category.

Of course, there’s no point in importing your raw data unless these services can work their magic to turn it into something more useful. The next post in this series will talk about the creation of predictive models to help make sense of your data. I’ll see you there!

(Note: Per Dec 5, 2012 Prior Knowledge no longer supports its public API.)

Machine Learning Throwdown, Part 2 – Data Preparation

The Services

BigML

Google Prediction API

Prior Knowledge

Weka

Conclusion

8 comments

Leave a comment Cancel reply

The Services

BigML

Google Prediction API

Prior Knowledge

Weka

Conclusion

Share this:

Relacionado

Leave a comment Cancel reply