Hacking on the U.S. Census
Machine learning models are excellent for helping people make decisions based on data. One area where such models might be helpful in particular is public policy. The U.S. Census paints a comprehensive picture of the population of the United States, and with some digging, we should almost certainly be able to find some useful insights.
For example, how might education and income levels in a given area of the U.S. (say, a county) be correlated with the poverty levels in the area? Perhaps higher median incomes for the highly educated drive the local economy and bring people out of poverty. Or perhaps a low number of high school graduates drives the poverty rate up. What about gender? Does low educational attainment for females effect the poverty rate differently than for males?
Thankfully, the U.S. census has made an API available to help us answer these questions. With a quick call to the API, we can download the educational, income, and poverty levels for each county in the U.S.
There are a few minor problems with the data as is, though:
- You can only retrieve eight columns of data at a time. This means we’ll have to make several API calls and put together our data file afterwards.
- The data are expressed as total numbers of people rather than percentages. Totals aren’t very useful to us here; a model trained on totals will tell us only that there are more people in poverty where there are more people in general. For example, there are hundreds of times more people with a Ph.D. in New York City than there are in most other cities in the U.S. There are also hundreds of times more people living in poverty. Does this mean that Ph.D.s are correlated with poverty? Of course not. It just means that there are lots of people in New York. Thankfully, the census provides us with the overall totals for each measurement, and so dividing by that number gives us proper percentages.
- For educational levels, we’d rather have a running total than individual categories. Take the percentage of people claiming they have a “high school” education level. If that percentage was lower, would that mean the citizens were more or less educated? Maybe and maybe not. It could mean that less people graduate high school or it could mean that more people graduate college. One easy solution is to sum up columns of data so we get “high school or more“, “associate’s degree or more“, etc. This way, an increase in any of these numbers means a more educated populace.
All of these problems are very easy to solve. In fact, with the unix utilities
awk, combined with BigML’s REST API, we can do everything in a pretty short shell script, which you can see, download, and use however you like by checking out this github gist. We end up with the demographic information for a little over 3000 U.S. counties in our data file.
After that, it’s just a matter of logging on to BigML and doing a one-click model to analyze the data.
Browsing through the model (which you can do, too, by clicking here), we can see some statistics are more correlated than others with the poverty level. The first of these is, perhaps unsurprisingly, the median rate of income for people with a high-school level of education. One can imagine that this is the population most likely to “go either way”, so to speak. If this population is doing well, then it means less poverty in general. Note that this statistic appears to be more important than the median income rate for people with below a high school level of education. One reason for this might be that even if the median income for this group is higher than average, it is still unlikely that that it will be above the poverty level.
Another attribute that is very important in the model is level of education of women. Specifically, if the high school graduation rate for women is low, it tells us a good deal about the poverty rate: Counties with graduation rates below 83% for women have more than 1.5 times the average poverty rate for the counties that have graduation rates above that threshold.
In fact the effect of these two variables is so powerful, that knowing the values for only these two, along with the median income for associate’s degree holders, is enough to pick out the bottom 5% of all counties for poverty.
Does this mean we should immediately begin a campaign to improve high school graduation rates among women, in the name of reducing poverty? Let’s not be too hasty. In particular, this model says nothing about correlation versus causation. Suppose a girl with a poor family leaves high school in order to get a job and help support them, and remains below the poverty line as a result. Did her lack of education cause her poverty, or vice versa? Sometimes it is difficult to say, especially when effects span generations.
Nonetheless, the model provides an interesting view of the data and a basis for further investigation. What story might your data tell? Request a BigML invitation and find out.