Sat, 03 Apr 2010
After "Scrape the Web", 2010
As you might remember, I gave my Scrape the Web tutorial again at PyCon this year, 2010.
I finally got to publishing the 2010 tidbits on the web. So without further ado:
- A video of me on blip.tv
- A tar.gz file of code samples
- A brief four-page cheat sheet
- My full slides, photos, jokes and all
I also referred to some useful tools in the talk. You might want to check these out:
- Selenium IDE, for WYSIWYG scraping code generation
- Selenium RC, for reaching right into a web browser and having that do your page loading
- the everyblock templatemaker (see the cheat sheet)
- Firequark, for finding CSS selectors of elements on the page
- FireBug, for the magical "Inspect Element"
And the old standbys:
- mechanize
- lxml.html
My mini feedback for myself:
My WP-Hashcash demo didn't go as planned this year, but it's still possible in theory. The attack in the code still works against last year's version of WP HashCash. Kids, don't upgrade your demo site the night before your presentation!
Speaking of "the night before," again I didn't sleep very much before the talk. I think that worked better for me last year. In the future, if I basically stay up all night, I should give a talk before noon.
So I actually think last year's video was probably better, though I haven't watched them in full.
Take freely from the code samples and "cheat sheet"!
[] permanent link and comments
Tue, 22 Dec 2009
PyCon 2010: "Scrape the Web," and a poster session
"Scrape the Web," my PyCon tutorial on web scraping is back this year! Plus I'll be leading a conversation on how to get involved with Free Software from my poster at the poster session.
This year's Python conference takes place February 19-21 in Atlanta, Georgia, USA.
Poster session
This year is the first year PyCon is holding a poster session. My poster is on open source and Free Software for the Python community, focusing on how you can get involved.
It's a plenery session. This means, for 90 minutes, there will be a dozen of us presenters standing in front of our posters hoping PyCon attendees will talk to us. Everyone at PyCon will be milling about, since there will be no talks during the poster session. So stop by!
Web scraping tutorial
I had lots of fun last year talking to a packed room about programming the web. The World-Wide Web is the world's most widely-used distributed computing system; if you're only using it from a web browser, you're missing out. It's a tutorial, which is a paid three-hour course (with refreshments) in a classroom setting. Based on what last year's attendees said afterward at lunch, it seemed the attendees enjoyed themselves too!
From Python, there's a host of choices for pulling information from the web, and a few choices for pushing data back (usually through forms). Here are some topics we'll cover:
- The sorry (if humorous) state of standards on the web
- Evaluating web page parsing engines
- Why you don't want to use regular expresions (and how you can anyway)
- Submitting to forms
I think the most exciting part is the discussion of getting around anti-scraping countermeasures. This is where the rubber hits the road. We'll:
- Write Python code that automatically submits comments to a WordPress blog protected by WP Hashcash
- Choose a user-agent so Google doesn't block us immediately
- Look at deployed software that has built-in CAPTCHA solving
- Compare the effectiveness of different countermeasures
- Automate a completely AJAX website by mechanizing Firefox itself
Last year's version is online as a video. If you missed it last year, register for PyCon and sign up for my tutorial, "Scrape the Web." You're likely to learn a lot, and I'm always happy to answer questions during and afterward.
Brian Gershon, one of last year's attendees, explained best:
Why use an API, when you can just grab it off the page? :)
[] permanent link and comments
Fri, 27 Mar 2009
Pycon09: "Scrape the Web" is over
For my attendees, and anyone else following along at home:
Thanks for coming! I had a great time at the talk, and I already wrote a little bit about how much fun I had. I wanted to be sure to conclude the tutorial with some next steps for you all.
The presentation
Some crucial links:
- My presentation: the actual PDF I used
- Sample code: in the examples directory (note you can "svn checkout" that too)
The demos that didn't work
There were two demos that were less smooth than they ought to have been.
- Selenium RC, and
- WordPress Hash Cash.
For WordPress Hash Cash, there's the lovely simple code I wrote that uses python-spidermonkey to post comments to a blog with HashCash. One such blog is online at http://pycon09.asheesh.org/hashcash/ ; please try it! (The reason it didn't work was some heavy load on my server during the talk.)
For Selenium RC, you can see the sample code in examples/seleniumrc/. There is a README in that directory that explains clearly how to run that code. (It didn't work for the same reason.)
The future
I'm available for questions, both hands-on at PyCon and by email after the weekend. Just email me!
[] permanent link and comments
"Scrape the web" at PyCon: lots of fun!
Thursday morning at 9 am, I gave my scheduled tutorial at PyCon: Scrape the Web: Strategies for programming websites that don't expect it.
For those of you who attended, thank you! You made it loads of fun. The tutorial was supposedly full at 30 people, but in fact we had at least five more; at the halfway break, staff added another table to the room so that those of you standing in the back could sit down!
Because I was so behind on so many things from travel, I stayed up all night before the talk. This is actually fun for me, as we saw at Debconf last year. So I arrived at the talk energized and with my examples fleshed-out (for the most part).
There were a few ways I knew things were going well.
Early on in the talk, Nathan Yergler arrived and saw that we were scraping information from the CC lunch mainstay Mehfil Indian, nicknamed "Curry in a hurry." This caused a ricochet of smiles between me at the front and Nathan at the back; I hope that helped the mood for others, too!
Throughout the talk, the audience looked happy, and they felt comfortable enough to stop me and ask questions. Knowing that the audience feels comfortable participating is crucial for me. Participation and questioning are part of learning; they are also the best way for me to know how to tailor what topics I cover to the people in the room.
After the talk, some attendees handed in evaluation forms. One man asked me what he should do with his. "I've been putting them in this box face-down," I explained.
He suggested, "This one you ought to see face-up!"
About five people came up after the talk and asked me specific questions. One was a young lady who attended my preview talk at Baypiggies in January, which was great to see.
The same number came up to me at the end and thanked me for a good talk, which was very rewarding. One asked how often I give talks at conferences. I mentioned my OSCON talk with Nathan, and wondered to myself what other conference sessions I had led. He urged me, "You really ought to make speaking part of your career. You're a great speaker."
I followed a couple of attendees to lunch; one pointed out the room had been Twittering madly during the talk. A search of Twitter shows a lot of positive comments. (He also pointed out that someone else is "paulproteus" on Twitter.)
- pydanny: is happy with what he got out of Asheesh Laroia's Scrape the Web tutorial
- AceGopher: So much goodness in Scrape the Web!
- http://identi.ca/phrakture (on identi.ca): Referer is misspelled in the HTTP spec - hah
- sgbirch: Screen saving talk off to a great start. Asheesh Laroia is a good speaker, this is informative and fun.
- brianfive: enjoyed the "Scraping the Web" tutorial at #pycon. Why use an API, when you can just grab it off the page? :)
Basically, everybody loves me. Yay!
By the end of lunch, I was fading from the lack of sleep. I took a six-hour nap, and I woke up after all the official PyCon proceedings were over. I read an email from Greg Lindstrom, organizer of the Tutorials series of talks at PyCon. His email began:
It's Thursday night and I wanted to tell you how happy I am with the tutorials over the past two days. I haven't looked at the survey results yet -- give me a couple weeks on that; I'll share the results with you -- but the comments I heard were overwhelmingly positive. My favorite was overhearing someone ask "how did that kid in 'scrape the web' learn all of that?".
I giggled about this as I walked to dinner with Nathan.
What a great start to PyCon!
[] permanent link and comments
Mon, 12 Jan 2009
Baypiggies, January 8: Scrape the web
I gave a presentation at the Thursday, January 8, Baypiggies meeting that was something of a preview of my scraping Tutorial that's coming up at PyCon 2009. (Baypiggies is the Bay Area Python Interest Group.)
If you want to take a look at what I presented, here is what I have. Note that you can grab all of these in bulk by doing:
$ svn checkout http://svn.asheesh.org/svn/public/20082009/scraping-preso/
The presentation itself:
- preso.odp: The presentation (OpenDocument format)
The curry examples:
- examples/curry/regex: The regular expression I used for grabbing the menu part of the document.
- examples/curry/grab_page.py: A simple demo of urllib2.urlopen() to read in web pages as file objects.
- examples/curry/curry.py: A totally broken regex-based script that isn't even fully written that grabs the curry page, checks that it is today's menu, extracts the menu out of the page, removes the HTML tags, and prints it.
- examples/curry/return_to_curry.py: The same broken script that now shows how to strip HTML tags using BeautifulSoup.
The actually-working example, Cepstral's weather reading-aloud tool:
- examples/cepstral/weather.py: A tool that POSTs to the page and grabs the resulting WAV file.
- examples/cepstral/mech.py: A way shorter version that shows how nice "mechanize" is.
Code snippets that might be useful:
- examples/string_matching.py: A quick reminder that sometimes you can get away with testing for the existence of a string as your entire information retrieval strategy.
- examples/user_agent.py: Code (taken from Twill) for changing one's user agent
- redemo.py: A regular expression GUI, available in the Python source tarball or random places on the web
- examples/wget_demo.sh: A trivial demo of the fact that wget can download pages.
- examples/trivial_yahoo_search_with_mechanize.py: Post a query to Yahoo, including disabling robots.txt.
Patches welcome! These were quickly half-baked on a truck ride provided by Jim Stockford. (-: I'll be revisiting them later as I prepare more for my PyCon talk.