Gender Differences on Craigslist

Out of boredom, and geek curiosity to see whether I could do it, I decided I wanted to make a visualization to compare the different words used in Craigslist personal ads by men and women. I also wanted to see whether the words used would be different if they were posted on the "seeking" section or the "casual encounters" section.

The first challenge was to write a script that would be able to take all the text from every personal ad on the front page of the four different sections: w4m, m4w, casual encounters w4m, and casual encounters m4w. Surprisingly, I was able to accomplish all this with 22 lines of python code.

relevant xkcd

More surprisingly I didn't really hardcode too much in there, which means you could reuse the script to analyze a different regions craigslist and only have to change three lines [I think]. This is possibly because craigslist has a pattern for their urls:

http://singapore.craigslist.com.sg/stp/2463324417.html

As you can see the general pattern goes:

city.[language.]craigslist.com.[country.]/ad-section/post-id.html

So from this I was able to write a general expression to read through the front page of each ad section and gather all the post URLs.

Each post page is also pretty standard, and very nicely puts the ad text within set sections:

<div id="userbody">Here is the ad text. 

Which meant that I could write another regular expression to extract only the text of the ad and write it to a text file. I then used TextWrangler to remove any extra HTML tags.

To make the different word clouds I used IBM's Many Eyes web service because it's free, helps further clean up the data by removing common words [I also decided to remove the word "looking" from all the data sets], and gives you a lot of different options. After tinkering around with all the settings I was left with my four visualizations [click on the picture to view full size].

w4m

m4w

casual encounters - w4m

casual encounters - m4w

So there you have it. Geek victory and possible conversation starter. Now, as I said at the beginning of this post, I started this more for the tech challenge than analysis of the results so I'm not going to have a Jerry's Thought on this, although I will explain that one of the reasons why the word "nine" is so prominent in the women's casual encounters section is because mobile numbers in Singapore all start with 9, I think.

A Collection of Randomness

Gender Differences on Craigslist

June 27, 2011

Blog Archive

Labels