Using Machine Learning to Gain Divine Insights into Kindle Ratings

Posted by

I recently came across a 1995 Newsweek article titled “Why the Web Won’t be Nirvana” in which author and astronomer Cliff Stoll posited the following:

How about electronic publishing? Try reading a book on disc. At best, it’s an unpleasant chore: the myopic glow of a clunky computer replaces the friendly pages of a book. And you can’t tote that laptop to the beach. Yet Nicholas Negroponte, director of the MIT Media Lab, predicts that we’ll soon buy books and newspapers straight over the Internet. Uh, sure.”

Well, it turns out that Mr. Stoll was slightly off the mark (not to mention his bearish predictions on e-commerce and virtual communities). Electronic books have been a revelation, with the Kindle format being far and away the most popular.  In fact, over 30% of books are now purchased and read electronically.  And Kindle readers are diligent with providing reviews and ratings for the books that they consume (and of course the helpful prod at the end of each book doesn’t hurt).

So that got us thinking: are there hidden factors in a Kindle book’s data that are impacting its rating?  Luckily, import.io makes it easy to grab data for analysis, and we did exactly that:  pulling down over 58,000 kindle reviews which we could quickly import into BigML for more detailed analysis.

kindle

My premise going into the analysis was that author and words in the book’s description, along with length of book, would have the greatest impact on the number of stars that a book receives in its rating.  Let’s see what I found out after putting this premise (and the data) to the machine learning test via BigML…

Data

We uploaded over 58,000 Kindle reviews, capturing URL, title, author, price, save (whether or not the book was saved), pages, text description, size, publisher, language, text-to-speech enabled (y/n), x-ray enabled (y/n), lending enabled (y/n), number of reviews and stars (the rating).

This data source includes both text, numeric and categorical fields.  To optimize the text processing for authors, I selected “Full Terms only” as I don’t think that first names would have any bearing on the results.

kindle author full term

I then created a dataset with all of the fields, and from this view, I can see that several of the key fields have some missing values:

missing

Since I am most interested in seeing how the impact of the book descriptions impact the model, I decide to filter my dataset so that only the instances that contain descriptions will be included.  BigML makes it easy to do this by simply selecting “filter dataset”:

filter dataset

and from there I can choose which fields to filter, and how I’d like them to be filtered.  In this case I selected “if value isn’t missing” so that the filtered dataset will only include instances where those fields have complete values:

fileter

And just like that, I now have a new dataset with roughly 50,000 rows of data called “Amazon Kindle – all descriptions” (you can clone this dataset here). I then take a quick look at the tag cloud for description, which is always interesting:

word cloud amzn

In the above image we see generic book-oriented terms like “book,” “author,” “story” and the like coming up most frequently – but we also see terms like “American,” “secret,” and “relationship” which may end up influencing ratings.

Building my model:

My typical approach is to build a model and see if there are any interesting patterns or findings.  If there are, I’ll then go back and do a training/test split on my dataset so I can evaluate the strength of said model.  For my model, I tried various iterations of the data (this is where BigML’s subscriptions are really handy!).  I’ll spare you the gory details of my iterations, but for the final model I used the following fields:  price, pages, description, lending, and number of reviews. You can clone the model into your own dashboard here.

What we immediately see in looking at the tree is a big split at the top, based on description, with the key word being “god”.  By hovering over the nodes immediately following the root node, we see that any book that contains the description “god” has a likely review of 4.46 stars:

god ratings

while those without “god” in the description have a rating of 4.27:

no god

Going back to the whole tree, I selected the “Frequent Interesting Patterns” option in order to quickly see which patterns in the model are most relevant, and in the picture below we see six branches with confident, frequent predictions:

freq patterns

The highest predicted value is on the far right (zoomed below), where we predict 4.57 stars for a book that contains “god” in the description (but not “novel” or “mystery”), costs more than $3.45 and has lending enabled.

strongest

Conversely, the prediction with the lowest rating does not contain “god,” “practical,” “novel” or several other terms, is over 377 pages, cannot be loaned, and costs between $8.42 and $11.53:

lowest

Looking through the rest of the tree, you can find other interesting splits on terms like “inspired” and “practical” as well as the number of pages that a book contains.

Okay, so let’s evaluate

Evaluating your model is an important step as data most certainly can lie (or mislead), so it is critical to test your model to see how strong it truly is.  BigML makes this easy: with a single step I can create a training/test split (80/20), which will enable me to build a model with the same parameters from the training set, and then evaluate that against the 20% hold-out set. (You can read more about BigML’s approach to evaluations here).

The results are as follows:

evaluation

You can see that we have some lift over mean-based or random-based decisioning, albeit somewhat moderate.  Just for kicks, I decided to see how a 100-model Ensemble would perform, and as you’ll see below we have improvement across the board:

ensemble

Conclusion

This was largely a fun exercise, but it demonstrates that machine learning and decision trees can be informative beyond the predictive models that they create.  By simply mousing through the decision tree I was able to uncover a variety of insights on what datapoints lend themselves to positive or less positive Kindle book reviews.  From a practical standpoint, a publisher could build a similar model and factor it into its decision-making before green-lighting a book.

Of course my other takeaway is that if I want to write a highly-rated Kindle title on Amazon, it’s going to be have something to do with God and inspiration.

PS – in case you didn’t catch the links in the post above, you can view and clone the original dataset, filtered dataset and model through BigML’s public gallery.

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s