Scrapy

Yesterday during another boring phone call I googled for “fun python packages” and bumped into this nice article: “20 Python libraries you can’t live without“. While I already knew many of the packages mentioned there one caught my interest: Scrapy. Scrapy seems to be an elegant way not only for parsing web pages but also for travelling web pages, mainly those which have some sort of ‘Next’ or ‘Older posts’ button you wanna click through to e.g. retrieve all pages from a blog.

I installed Scrapy and ran into one import error, thus as mentioned in the FAQ and elsewhere I had to manually install pypiwin32:

pip install pypiwin32

Based on the example on the home page I wrote a little script to retrieve titles and URLs from my German blog “Axel Unterwegs” and enhanced it to write those into a Table-Of-Contents type HTML file, after figuring out how to overwrite the Init and Close method of my spider class.

import scrapy
header = """
<html><head>
<meta content='text/html; charset=UTF-8' http-equiv='Content-Type'/>
</head><body>
"""
footer = """
</body></html> 
"""

class BlogSpider(scrapy.Spider):
 name = 'blogspider'
 start_urls = ['http://axelunterwegs.blogspot.co.uk/']
 
 def __init__(self, *a, **kw):
   super(BlogSpider, self).__init__(*a, **kw)
   self.file = open('blogspider.html','w')
   self.file.write(header)

 def parse(self, response):
   for title in response.css('h3.post-title'):
     t = title.css('a ::text').extract_first()
     url = title.css('a ::attr(href)').extract_first()
     self.file.write("<a target=\"_NEW_\" href=\"%s\">%s</a>\n<br/>" % (url.encode('utf8'),t.encode('utf8')))
     yield {'title': t, 'url': url}

   for next_page in response.css('a.blog-pager-older-link'):
     yield response.follow(next_page, self.parse)
 
 def spider_closed(self, spider):
   self.file.write(footer)
   self.file.close()

Thus, here is the TOC of my German blog.

I tried to get the same done with my English blog here on WordPress but have been struggling so far. One challenge is that the modern UI of WordPress does not have any ‘Older posts’ type of button anymore; new postings are retrieved as soon as you scroll down. Also the parsing doesn’t seem to work for now, but may be I figure it out some time later.

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: