How BigML Finds Important Variables in Wide Datasets
This blog post is based on a talk I gave at the Dare2Data conference in Madrid.
I recently found a fascinating sociology survey with more than 39,000 responses to almost 400 questions. The survey, which has been given in the United States since 1972, covers a wide range of topics. Besides demographic info like age, gender, race and income, the survey also covers personal beliefs (“Should racists be allowed to teach college?”), living situation (“Have you been too tired to do housework recently?”) and life experience (“Have you ever injected illicit drugs?”).
While it’s great to have a dataset that’s so, um, rich, most of the variables are simply not relevant to whatever it is I want to predict. If I’m predicting whether your income is higher or lower than the United States median of $50,000, it doesn’t really matter if you’ve received a traffic ticket for a moving violation, or if you think marriage counseling is scientific. (Yes, those are actual questions.)
This is where BigML comes in. Because our algorithm does a “greedy” search through the data, examining every input individually to see how well it predicts the output, it excels at finding the needle of insight in a haystack of irrelevance. BigML actually does check whether moving violations predict income, but quickly learns that marital status, education, employment and age are much more useful.
Of course, if you change what you’re trying to predict, the list of important variables changes too. At Dare2Data, I tried predicting political beliefs instead of income, with interesting results. (Since I excluded moderates from the training set, it’s more accurate to say that I’m predicting strongly held political beliefs.)
For example, if you meet these five criteria, then you identify as conservative more than 85% of the time:
- You disapprove of homosexuality (or don’t respond to the question);
- You disapprove of sex before marriage (or don’t respond to the question);
- You are white;
- You go to church almost every week;
- You live in a single-family detached house (a proxy for living in the suburbs).
Of the 2,550 people who meet these five criteria, 2,224 (more than 85%) identify as conservative. This group, who might call themselves “social conservatives”, are an impressive 19% of conservatives in the entire dataset.
The model even finds a sixth factor: if you are also Protestant, but not United Methodist, then you are even more likely to be conservative. At first I thought this was just noise, but there is actually a large liberal wing within the United Methodist Church that supports same-sex marriage. Amazingly, BigML is able to find this nuance in the data—talk about a needle in a haystack!
On the liberal side, there’s a group that doesn’t disapprove of homosexuality, does disapprove of the death penalty, and is strongly pro-choice. This group is about 85% liberal, accounting for 12% of all liberals in the dataset. Again, it’s remarkable that BigML can find groups of people that behave in such recognizable ways, even though it knows nothing about politics, religion, or other touchy subjects.
Once again, only a small subset of the 400 variables actually matters for prediction:
Hopefully I’ve conveyed how great BigML is at sifting through a dataset with lots of variables. This type of “wide” dataset pops up all the time in business, especially when examining customer behavior, and traditional tools like Excel or Tableau simply aren’t designed to handle the analysis. By examining the full richness of your data, BigML helps you focus on what’s really important—even if it’s traffic tickets.