Machine Learning Throwdown, Part 2 – Data Preparation

Posted by

This is the second in a series of blog posts comparing BigML with other machine learning services. As you may recall from the first post in the series, I am primarily evaluating cloud-based services aimed at making machine learning accessible to non-experts like myself. Having previously introduced the competition and the criteria for comparison, let’s now see what it takes to get started and load your data into each service.

Think of a machine learning dataset as a simple table of data. Each row is an example that you want to learn from (e.g., the sales data for a particular wine), and each column describes a property of the data (e.g., the sales price of the wine). Your data may be stored in an Excel spreadsheet, a database, or perhaps even in a plain CSV (Comma-Separate Values) file. To begin learning from your data, you must properly format it and load it into your service of choice.

As part of this machine learning throwdown, I will be sharing my results from benchmarking the predictive performance of each service on some popular datasets from the UCI Machine Learning Repository. Note that these datasets are relatively clean by machine learning standards. See the Machine Learning Throwdown details for a description of each dataset chosen for this competition. This post will cover what I learned from importing the data into each service.

The Services

BigML

When I interviewed for my internship with BigML, one of the things they asked me to do was spend about 30 minutes making something interesting using their Python bindings. Seriously? I had previously created an account and poked around their website, but how could they expect me to write a program using their service in just 30 minutes? Luckily for me, they had already done all the hard work to make it as easy as possible to get up and running.

One thing that makes it really easy to get started with BigML is their web interface. Unlike the other cloud-based services, you can do everything on the website: upload your data, create and explore a model, and make predictions. That’s right, you can do all this without writing a single line of code! For developers, BigML has a RESTful API and a growing list of libraries for popular platforms and programming languages including Java, PythonRuby, R, and even iOS.

Pros:

  • Very easy to get started
  • Can upload data on the website or using the API
  • Auto-detects data types if you don’t specify them yourself
  • Good at parsing poorly-formatted data
  • Supports remote sources via URL (Amazon S3, etc.)
  • Can handle up to 64GB of data
  • Libraries for quite a few popular languages (though you may need to browse through the blog and/or their GitHub profile to find them)

Cons:

  • No support for adding data to an existing dataset/model

Google Prediction API

Fortunately, no one asked me to produce anything with Google Prediction API while the clock was ticking. There are multiple pieces to set up and the tools for uploading your data are separate from the tools for creating models and making predictions. On the bright side, many of these tools and libraries are common among other Google products and APIs. Much of the work required to get started should look familiar if you have used Google’s APIs in the past.

You can use a web interface, command line tool, or the API to upload your data in CSV format to Google Cloud Storage. Once your data is there, however, you will need to write code to work with the prediction API as there is no user-friendly web interface.

Pros:

  • Multiple methods for uploading data including via the website, command line tool, and API
  • Good integration with other Google services
  • Libraries for a variety of programming languages
  • Auto-detects data types
  • Supports updating an existing model with new data

Cons:

  • Significant setup time: separate tools/libraries for authenticating, uploading data, and working with the prediction API
  • Somewhat strict about data format (no header row allowed, objective column must be first, etc.)
  • Poor support for data with missing values
  • Libraries are very thin wrappers on top of the API (very few convenience functions)
  • Training data is limited to 250MB

Prior Knowledge

Prior Knowledge makes it relatively easy for developers to get started with their Veritable API, but they have no web interface for working with your data. Everything will need to be done through their RESTful API or one of their client libraries (currently available for Python and Ruby).

One notable thing about their API is that you can upload rows of data individually or in bulk as JSON data. With the Veritable API, you can bypass a CSV file completely if it’s more convenient for your application. However, my advice is to avoid this if you can. I spent half of a day fighting with their picky API to get my data in a format it would accept. Save yourself the headache, save your data to a CSV file, and use the utilities in their client libraries for reading in CSV files. Either way, you will still need to explicitly specify the data types for each column as they make no attempt to detect them automatically.

Pros:

  • Fairly easy to get started
  • Can upload data incrementally

Cons:

  • Picky about the format of the data
  • Must write code to properly format your data and upload it with using the API (though their libraries provide utilties to make this process easier with CSV data)
  • Training data is limited to 10,000 rows and 300 columns by default (roughly equivalent to a 15MB CSV file)

Weka

Recall that Weka is the odd duck out in this comparison as it is a standalone application that you run on your own computer. While very powerful, it is targeted at those with a significant amount of machine learning expertise. There are multiple ways to interact with Weka including a GUI application, Java API, and command line interface. Weka accepts data in ARFF format which is essentially a CSV file with type information about your data. It includes the ability to open CSV files and automatically detect data types.

Pros:

  • Very powerful; lots of features
  • Built-in data preprocessing tools
  • Has a GUI, a Java API, and a command line interface
  • Some support for streaming data with the related MOA project

Cons:

  • Not cloud-based; must download and install the application
  • GUI is overwhelming for new users
  • Options are limited for controlling Weka from languages other than Java
  • Maximum dataset size greatly depends on your choice of algorithm, your computer’s memory, the JVM heap size used for Weka, and the properties of your data

Conclusion

That just about wraps up the discussion on getting started and importing your data. Based on my experience with these four services, BigML comes out way ahead of the “competition” in this category.

Of course, there’s no point in importing your raw data unless these services can work their magic to turn it into something more useful. The next post in this series will talk about the creation of predictive models to help make sense of your data. I’ll see you there!

(Note: Per Dec 5, 2012 Prior Knowledge no longer supports its public API.)

Other posts:

8 comments

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s