Exploring 250,000+ Movies with Association Discovery
Hot on the heels of our Fall 2015 Release webinar including our Association Discovery (aka Association Rule Learning) implementation, we wanted to give this new capability a spin on our blog in order to get our readers warmed up. It is worth noting that there are many potential use cases of this technique including promotional pricing or bundling of items that are closely related, market basket analysis, Web usage mining, intrusion detection, and bioinformatics to analyze public genomic and proteomic databases among others.
Generally speaking, anytime you are challenged with uncovering statistically significant relations between variables with potentially thousands or even millions of different values, you will find Association Discovery handy as it does a great job in weeding out spurious associations to let you concentrate on the interesting ones. In a prior post, we covered some of the ways our proprietary Association Discovery algorithm differs from typical statistical approaches to categorical association analysis (e.g. log-linear analysis) so we won’t go into the same details here.
To quickly demonstrate how it works, we used the Home Theatre Info dataset containing movie metadata information on over 250,000 DVDs offered in North America. As usual, we completed some basic table joins and data wrangling in order to be able to feed this raw DVD data to BigML in a Machine Learning ready format. We’ll save you the gory details here and instead concentrate on the results. However, we would encourage you to download our User Guide and follow the end-to-end process later on.
The first exercise we ran uses a simpler subset of our dataset consisting of only the Director-Actor pairs. In our example, the Association Discovery task we ran on the Director-Actor pairs quickly found the top 100 association rules involving 69 different variable – value pairs e.g. Actor ID = 72342 etc.
Looking at the visual above (which we manually augmented with some movie images), we can see the top associations between Actors and/or Directors. Even though the dataset had separate fields for Actor and Director IDs the Association Discovery algorithm was able to identify different combinations of rules involving Actor vs. Actor, Actor vs. Director, and Director vs. Director. The Actors are automatically marked in orange and directors in blue. There are some interesting World Cinema associations that were discovered without any supervision:
- The most prolific network of collaboration is in the bottom left, which points out to Japanese Director Daisuke Nishio’s Dragonball Z creative team including many voice over artists bringing his anime characters to life.
- A lesser-known relationship between the middleweight boxer turned low budget film actor Dick Miller and the prolific engineer turned independent film producer/director Roger Corman known for his horror flicks. IMDB cites “Miller settled in Los Angeles in the mid-1950s, where he was noticed by producer/director Roger Corman, who cast him in most of his low-budget films, usually playing unlikeable sorts, such as a vacuum-cleaner salesman in Not of This Earth (1957).”
- Also of note is the unique self-referential relationship exists between actor/director Clint Eastwood and himself. Talk about a Do-It-Yourself guy!
Just like that the history of chaotic moviemaking dating back to the silent era comes into focus pointing out to some of the strongest collaborations from its past that have soundly “beaten the odds” of randomness. Those of you that are more experienced Machine Learning practitioners may be thinking “So what, we had a bunch of paired data that…I could just run a ‘group by’ SQL query against that same table to arrive at similar results.” That is indeed true, but even with this simple case you get the added benefit of association strength measures and a network visualization view that may not be apparent at a glance by looking at a long list database query results.
Once we turned our attention to our main dataset that combines data fields that our warm up exercise left out, the highest leverage association rules that floated up to the top consistently involved Genre, Studio and Rating fields. These rules show significant “Lift” which justifies that we take them seriously. For example, the first rule stating that whenever the Genre is “Anime” the Rating is “MA13” has a lift score of approximately 26. This means the association between this variable-value pair is 26 times more likely in our dataset than what would be expected from simple coincidence. It is also worth noticing that the second rule in the result set is the reverse of the first rule with the same Leverage and Lift values as expected. Yet the Coverage, Support and Confidence scores are different because those are calculated with respect to the “Antecedent”, which is different in each case. For a quick explanation of the Association Discovery terminology, you can refer to the related documentation.
Looking at the results of our second exercise we found strong associations that we were able to confirm from our prior knowledge serving as proof points:
- Director ID 732 associated with the movie studio Dreamworks is none other than Mr. Steven Spielberg.
- Director ID 3974 associated with Nickelodeon is Chris Gifford – the creator of the hugely popular children’s cartoon series ‘Dora the Explorer’.
- PBS (Public Broadcasting Service) is strongly associated with the directorial godfather to many in the documentary genre, Ken Burns as well as Stephen Ives (known for The American Experience series among others).
- It was also very plausible that the Pokemon creator Masamitsu Hidaka and Dave Filoni of Star Wars anime fame are tied with the ‘General Audiences’ movie rating.
This analysis also raises the challenge of identifying what makes an “interesting” association. There are many academic papers written on this seemingly simple question so it is more complex that it appears. For the sake of our exercise, we would leave this as a judgment the subject-matter expert has to make. If it is useful in his/her setting than it is good to go. If not, just skip to the next association.
The key idea here is that Association Discovery is your best friend when you have the challenge of finding non-spurious relationships in heaps of data that include many variables each involving many values of their own. If you think about it the alternative as being exhaustively searching BILLIONS of variable – value pairs and their relationships with another, then you come to appreciate how much productivity can be saved with this methodology.
As all our users have simultaneously gotten access to Association Discovery with this week’s launch, we hope that you give it a spin and let us know what you think by sending us a note at email@example.com with your ideas and feedback.