2012-08-29

Web scraping tutorial

I've seen this video about scraping websites, and I want to write a short and quick tutorial on how to do it, after I've tried it and found it fun. If you know of some handy tool that I have not written about and might fit in this tutorial, don't hesitate to comment!

Suppose you're looking for a job. But you're not satisfied clicking on every link that you see on your website, such as Craigslist. You need an automated solution that does that for you, because it's too repetitive and boring. But that's what computers are for! Let's turn boring repetitiveness into exciting fascination!

Scraping means getting structured data from a website using an automatic tool. It might not be very nicely regarded by website owners, since they can't make money if a bot doesn't look at any advertisements, but it relentlessly uses up their bandwidth. And for that matter, it might be in the moral twilight.

For doing this, I use Python and a module for it, namely lxml. There's also mechanize, beautiful-soup, hpricot, but these are beyond the scope of this tutorial. Watch the video for more.
Also, a better interactive interface for Python is IPython, which feels more like an environment than a simple interpreter, since it has auto-completion, syntax coloring, detailed exceptions, and I think it looks better than IDLE even though it's a console app.
On Ubuntu, installing these was as easy as typing:
sudo apt-get install python-lxml ipython

First, we need a way to tell our Python script which elements of a page we want. We are going to use these things called "CSS Selectors". They're some sort of rules specifying what should and what shouldn't be matched - sort of like regular expressions, but specifically designed for web pages. Here's an example of one matching all bold elements on a page:
b
There you go. Looks simple, right? It's just the letter b. Now here's one that matches all bold, italic, and headline-1 formatted text:
b, i, h1
You get the idea. But real expressions used for finding more specific content aren't as simple. Some might include element ID numbers, elements following other elements, parents and children, attributes, and so on. Here's the one that matches our data from the tables on that page:
.row a:nth-child(2)
This one matches all links that are the second child of their parent, and which have the class "row". If you want the details, look here.

But you don't need to learn this syntax in order to do useful stuff. Thank goodness, since it would be such a tedious task. It turns out there's a tool that allows you to visually mark elements and generate the simplest expression for you. It's called SelectorGadget. Such a creative name.

Install it. This is a bookmarklet with a javascript script that does stuff to your page in the browser.

For this tutorial, we're going to download all links to job pages on Craigslist Romania. These could, for example, be saved into a text file one on each line, then you could call:
wget -i file_with_links.txt
to download each of them into the current directory.

Ok. With the SelectorGadget installed, visit the page and click on the bookmarklet. Be warned, all hell may break loose if you do that. Then, click on a link to a job.
When you click on it, lots of things may turn yellow. That means those things have also been selected by the generated selector. To mark one as unwanted, click on it again, and it will turn red, and will serve as a counterexample of what you want. Keep doing this until you only have the links to the jobs. It may take a few clicks until you get exactly what you want, and it's not guaranteed to find a solution, but it saves you a whole lot of effort when it does.

Now that we have this... alien CSS Selectonator thingy... we save it for later. We're going to use it in our program, even though we don't fully understand it. Computers have a mind of their own.

Now, let's learn how to use lxml.html! I picked this one because it's quite robust (hasn't let me down yet), and very fast. It is a C extension of Python, so it's quite close to as fast as possible. Fire up IPython and enjoy the view:

Python 2.7.3 (default, Aug  1 2012, 05:14:39) 
Type "copyright", "credits" or "license" for more information.

IPython 0.12.1 -- An enhanced Interactive Python.

?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]:


Ah, bliss!
Now, assuming you did install lxml, write these commands:
import lxml.html.parse, urllib2
url = urllib2.urlopen('http://bucharest.craigslist.org/jjj/')
Now, the HTML stuff is available as url.read(). We want to parse it and get the root node of the resulting tree:
doc = parse(url).getroot()
And to apply the selector to get our job links:
links = doc.cssselect('.row a:nth-child(2)')
To print them, we need their "href" attribute. We can simply iterate over them like this:
for link in links:
    print link.attrib['href']

Great. Hopefully you'll now have your screen full of Craigslist links. This is good.
But we want the links to be in a file. The easiest way to do this is by redirecting the output. Save the commands in a script file, and run it like this:
python CraigsLinks.py > file_with_links.txt
and you've got the file you can use in the command at the beginning of the article. And a lot of jobs! Hopefully you'll find a way to make the computer read them too. Congratulations!

That was surprisingly easy to write, and it could be easily adapted to sites which don't take steps against scrapers. Become filled with astonishment at the simplicity of Python and its modules!

There are many more things to learn. For example, running Javascript on sites that use it, tricking the site into thinking we're a browser (faking the user-agent id), navigating from page to page ("crawling"), or scraping and crawling in a parallelized manner. For more, watch the video and explore the tools' tutorials and documentations.