Mon, 12 Jan 2009
Baypiggies, January 8: Scrape the web
I gave a presentation at the Thursday, January 8, Baypiggies meeting that was something of a preview of my scraping Tutorial that's coming up at PyCon 2009. (Baypiggies is the Bay Area Python Interest Group.)
If you want to take a look at what I presented, here is what I have. Note that you can grab all of these in bulk by doing:
$ svn checkout http://svn.asheesh.org/svn/public/20082009/scraping-preso/
The presentation itself:
- preso.odp: The presentation (OpenDocument format)
The curry examples:
- examples/curry/regex: The regular expression I used for grabbing the menu part of the document.
- examples/curry/grab_page.py: A simple demo of urllib2.urlopen() to read in web pages as file objects.
- examples/curry/curry.py: A totally broken regex-based script that isn't even fully written that grabs the curry page, checks that it is today's menu, extracts the menu out of the page, removes the HTML tags, and prints it.
- examples/curry/return_to_curry.py: The same broken script that now shows how to strip HTML tags using BeautifulSoup.
The actually-working example, Cepstral's weather reading-aloud tool:
- examples/cepstral/weather.py: A tool that POSTs to the page and grabs the resulting WAV file.
- examples/cepstral/mech.py: A way shorter version that shows how nice "mechanize" is.
Code snippets that might be useful:
- examples/string_matching.py: A quick reminder that sometimes you can get away with testing for the existence of a string as your entire information retrieval strategy.
- examples/user_agent.py: Code (taken from Twill) for changing one's user agent
- redemo.py: A regular expression GUI, available in the Python source tarball or random places on the web
- examples/wget_demo.sh: A trivial demo of the fact that wget can download pages.
- examples/trivial_yahoo_search_with_mechanize.py: Post a query to Yahoo, including disabling robots.txt.
Patches welcome! These were quickly half-baked on a truck ride provided by Jim Stockford. (-: I'll be revisiting them later as I prepare more for my PyCon talk.