Machine learning and data mining play very nicely with data in a row-column format, where each row represents a data point and each column represents information about that point. It’s a natural format, and is of course the basis for things like spreadsheets, databases, and CSV files.
But what if your data isn’t so conveniently formatted? Let’s say you have an arbitrary pile of documents, like product reviews, and you’d like to classify each one. A simple thing to do would be to use word counts as features, but then you’re forced to make arbitrary decisions about which words are important. If you just use all words, you end up with thousands or maybe tens of thousands of features, which generally decreases the efficiency of machine learning. Moreover, simply counting words gives no information about the context in which the word was used.
Does machine learning know a better way?
Thankfully, there are technologies called topic models which take some steps towards solving this problem. The general idea is to look for “topics” in the data, which are essentially groups of words that often occur together (this is a gross oversimplification, but gives the correct flavor). For example, in a collection of news articles, you may discover a topic that has the words “Obama”, “Congress”, and “President”, which would correspond to the real-world topic of politics. We can then assign each document a score for each topic, indicating that this document is “well-explained” by that topic. When we do this, we transform possibly tens of thousands of words into a small number (~10-100) of features, each one packed with information.
This is a fairly general way of thinking about this problem. For example, you could use the same technology on shopping baskets (arbitrary lists of product serial numbers, say), and the “topics” would be groups of serial numbers that are often purchased together. The main limitation on the usefulness of this is the average length of each document. Because we’re relying on word co-occurance, we’d like our documents to be as long as possible so that we have lots of co-occurances to work with. Twitter-length documents is around the point where this stops being very useful.
All in all, topic modeling is basically just a fancy, automated form of feature engineering that often works nicely on arbitrarily structured documents.
As a proof of concept, I’ve developed a small service called AdFormare. To work with the website, you upload a collection of documents and we do some processing to figure out the topics in the datasets, and the topic scores for each document. As a bonus, we produce a nice visualization that shows you things like which topics often occur together, and shows you examples of documents with high scores for each topic.
Without going too deeply into it, here’s a sample visualization produced from a large collection of movie reviews:
And here’s a little tutorial that tells you what you’re looking at:
Coming Soon To BigML
We’re going to integrate the guts of this technology into BigML, so you can do topic modeling on the text fields in your BigML datasets, but I’m soliciting people to try this out on their own document collections so we can work out the bugs before we deploy. If you’ve got a collection of documents you’d like to see processed like this, by all means e-mail me (firstname.lastname@example.org).