Tue, 22 Dec 2009
PyCon 2010: "Scrape the Web," and a poster session
"Scrape the Web," my PyCon tutorial on web scraping is back this year! Plus I'll be leading a conversation on how to get involved with Free Software from my poster at the poster session.
This year's Python conference takes place February 19-21 in Atlanta, Georgia, USA.
Poster session
This year is the first year PyCon is holding a poster session. My poster is on open source and Free Software for the Python community, focusing on how you can get involved.
It's a plenery session. This means, for 90 minutes, there will be a dozen of us presenters standing in front of our posters hoping PyCon attendees will talk to us. Everyone at PyCon will be milling about, since there will be no talks during the poster session. So stop by!
Web scraping tutorial
I had lots of fun last year talking to a packed room about programming the web. The World-Wide Web is the world's most widely-used distributed computing system; if you're only using it from a web browser, you're missing out. It's a tutorial, which is a paid three-hour course (with refreshments) in a classroom setting. Based on what last year's attendees said afterward at lunch, it seemed the attendees enjoyed themselves too!
From Python, there's a host of choices for pulling information from the web, and a few choices for pushing data back (usually through forms). Here are some topics we'll cover:
- The sorry (if humorous) state of standards on the web
- Evaluating web page parsing engines
- Why you don't want to use regular expresions (and how you can anyway)
- Submitting to forms
I think the most exciting part is the discussion of getting around anti-scraping countermeasures. This is where the rubber hits the road. We'll:
- Write Python code that automatically submits comments to a WordPress blog protected by WP Hashcash
- Choose a user-agent so Google doesn't block us immediately
- Look at deployed software that has built-in CAPTCHA solving
- Compare the effectiveness of different countermeasures
- Automate a completely AJAX website by mechanizing Firefox itself
Last year's version is online as a video. If you missed it last year, register for PyCon and sign up for my tutorial, "Scrape the Web." You're likely to learn a lot, and I'm always happy to answer questions during and afterward.
Brian Gershon, one of last year's attendees, explained best:
Why use an API, when you can just grab it off the page? :)