The first challenge was to write a script that would be able to take all the text from every personal ad on the front page of the four different sections: w4m, m4w, casual encounters w4m, and casual encounters m4w. Surprisingly, I was able to accomplish all this with 22 lines of python code.
relevant xkcd |
http://singapore.craigslist.com.sg/stp/2463324417.html
As you can see the general pattern goes:
city.[language.]craigslist.com.[country.]/ad-section/post-id.html
So from this I was able to write a general expression to read through the front page of each ad section and gather all the post URLs.
Each post page is also pretty standard, and very nicely puts the ad text within set sections:
<div id="userbody">Here is the ad text. <!-- START CLTAGS -->
Which meant that I could write another regular expression to extract only the text of the ad and write it to a text file. I then used TextWrangler to remove any extra HTML tags.
w4m |
m4w |
casual encounters - w4m |
casual encounters - m4w |
So there you have it. Geek victory and possible conversation starter. Now, as I said at the beginning of this post, I started this more for the tech challenge than analysis of the results so I'm not going to have a Jerry's Thought on this, although I will explain that one of the reasons why the word "nine" is so prominent in the women's casual encounters section is because mobile numbers in Singapore all start with 9, I think.