Saturday, December 13, 2014

A Cataclysm of Data

I just finished reading Dataclysm by Christian Rudder.  He is one of the founders of the OKCupid dating site and the author of OKTrends, a blog about interesting facts he finds in the data OKCupid gathers from its users.  Being an enthusiastic user of the site and having really enjoyed the blog posts I was very interested to see what he would say when he had a whole book to say it in.  I was not disappointed, but I wasn't really surprised either.

The book is a good examination of the ways in which people talk about attraction and the ways in which they act - in particular it is interesting when these are not the same.  For example, most people agree that they would not date someone who was overtly biased against a particular race.  However, when we see who people actually send messages to and who they respond to the evidence is clear that people have very strong racial biases.  In particular asian men and all black people have a really tough time of it.  It isn't just a few brutally racist people either, but rather a consistent bias across the vast majority of people.  It is noteworthy that racial bias is several times larger in the US than in Canada and many other countries.  I wasn't quite sure what to conclude from that but it didn't exactly surprise me.

Rudder also talks a lot about the ways in which women and men behave very differently.  For example, women at age 20 tend to find 23 year old men most attractive and the ages trend upward together.  At the high end of the data 50 year old women find 45 year old men most attractive.  Men on the other hand find 20 year old women most attractive no matter what age the men are.  This drives message volumes in a big way - younger women get gazillions of messages whereas older women need to send messages if they want a good chance to connect with people.

The thing I liked most about the book is that Rudder acknowledges his biases and the ways in which his data is limited.  He makes it very clear that he understands that his source consists almost exclusively of single people and that he doesn't have enough data to make good observations for people above age 50.  Even though OKCupid is the destination of choice for polyamorous folks we are still a pretty small chunk of the population there - probably even smaller in numbers than the ostensibly monogamous cheaters who maintain their profiles without pictures.

Rudder also doesn't place himself above the users whose profiles he discusses.  When talking about racial biases he makes it clear that he thinks preferences in attraction in an individual aren't really something we can criticize but that there are clear problems when certain groups consistently run into bias against them.  He makes it clear that he thinks racism is unacceptable but that he almost certainly has unconscious biases he is not aware nor proud of just like most other people do.

I tend to place far more stock in someone who presents conclusions when those conclusions come with a huge helping of "We certainly cannot generalize to everyone" and "The data is very limited in this respect" because it shows that the writer knows their limitations.  Rudder does a good job this way and he treats his data as useful, which it is, but very carefully outlines the limits of what one can conclude from it.

The thing I wish Rudder would have included in the book is the statistics about gay relationships.  He does talk about that lack though and says that it would have bulked up the book tremendously but wouldn't have added much since the trends across genders were actually pretty much the same.  Fair enough, at least he considered it and rejected it for a decent reason.

Speaking of gender OKCupid is soon going to roll out more gender options than M and F and that is a good thing.  Most users won't really notice a difference but it will be a big positive change for people who want to identify as nonbinary or trans but will also let people like me become cis men instead of just men and I like that.  All of Rudder's data is strictly divided into men and women in the book because until now those were the only choices and I will be happy to read the next version that hopefully has data for other gender identities so we can take a peek into the Big Data there.  Of course it is possible that such a data set is small enough that strong conclusions will not be possible to draw from it; honestly I don't know.

At any rate I think this is a book worth reading.  It has lots of interesting data to look through, is well written, and doesn't try to overreach with conclusions it can't really justify.  I approve.

No comments:

Post a Comment