If you ever find yourself suddenly thrown back in time on board the Titanic (perhaps Quantum Leap style), BigML has a few pointers.
We’ve recently been automating some basic natural language processing tricks to help our users tackle data containing unstructured text. First we do language detection, stemming, and stop-word removal. Then we select commonly occurring terms that might be useful for building predictive models. We’ll write more about those details later, when we’re ready to bring it to production. But we’ve already been pleasantly surprised by the results.
Internally, we use some Titanic data as part of our regression test suite (and we’ve blogged about a different version of the dataset in the past). It captures some information about each passenger, like name, age, class, fare price, profession, and whether the passenger survived. It’s nice for a regression test since it combines a variety of data types. It has both numeric features (like age and fare price) and categorical data (like class).
When we turned on the automatic text processing for the first time the Titanic decision tree changed significantly. A text field, passenger name, became the most influential field. That seemed fishy to us. Normally unique identifiers (like name, customer id, etc.) aren’t useful at all. We expected the new text processing to work well for some of our other datasets (like detecting spam in SMS messages), but we didn’t expect a change in the Titanic model.
Exploring the model (see a sunburst-style preview of the model here), however, made it all immediately clear. The tree had cleverly found a pattern that we’d entirely overlooked.
The captain of the Titanic famously prioritized getting women and children onto lifeboats. Our Titanic data contained an age field, but none for gender. What our new model discovered was that the passenger names contained honorifics that made a fantastic stand-in for age and gender. That led the first split in the tree to check whether a name includes “Mr.” (as opposed to “Mrs.”, “Master”, or “Miss”).
What are some of the take aways from the model? If, during your jump back in time, you have any control over who you inhabit—these are BigML’s official recommendations.
First, don’t be a mister. Only 20% of the Titanic’s misters lived, while 64% of the non-misters survived.
If you manage to be a non-mister, try to also be first class. You’d be in great shape with a 93% chance to survive! Second class is good too, but your chances drop to a more pedestrian 83%.
Finally, if you ignore our advice and become a mister anyway, then at least go for the deck crew. This gives you a much better chance than the average mister. A solid 68% of the deck crew survived. So yes, it helps to be one of the misters deciding who gets on the lifeboat.
Explore the model for yourself! http://bl.ocks.org/ashenfad/5979156