Scrapy

Yesterday during another boring phone call I googled for “fun python packages” and bumped into this nice article: “20 Python libraries you can’t live without“. While I already knew many of the packages mentioned there one caught my interest: Scrapy. Scrapy seems to be an elegant way not only for parsing web pages but also for travelling web pages, mainly those which have some sort of ‘Next’ or ‘Older posts’ button you wanna click through to e.g. retrieve all pages from a blog.

I installed Scrapy and ran into one import error, thus as mentioned in the FAQ and elsewhere I had to manually install pypiwin32:

pip install pypiwin32

Based on the example on the home page I wrote a little script to retrieve titles and URLs from my German blog “Axel Unterwegs” and enhanced it to write those into a Table-Of-Contents type HTML file, after figuring out how to overwrite the Init and Close method of my spider class.

import scrapy
header = """
<html><head>
<meta content='text/html; charset=UTF-8' http-equiv='Content-Type'/>
</head><body>
"""
footer = """
</body></html> 
"""

class BlogSpider(scrapy.Spider):
 name = 'blogspider'
 start_urls = ['http://axelunterwegs.blogspot.co.uk/']
 
 def __init__(self, *a, **kw):
   super(BlogSpider, self).__init__(*a, **kw)
   self.file = open('blogspider.html','w')
   self.file.write(header)

 def parse(self, response):
   for title in response.css('h3.post-title'):
     t = title.css('a ::text').extract_first()
     url = title.css('a ::attr(href)').extract_first()
     self.file.write("<a target=\"_NEW_\" href=\"%s\">%s</a>\n<br/>" % (url.encode('utf8'),t.encode('utf8')))
     yield {'title': t, 'url': url}

   for next_page in response.css('a.blog-pager-older-link'):
     yield response.follow(next_page, self.parse)
 
 def spider_closed(self, spider):
   self.file.write(footer)
   self.file.close()

Thus, here is the TOC of my German blog.

I tried to get the same done with my English blog here on WordPress but have been struggling so far. One challenge is that the modern UI of WordPress does not have any ‘Older posts’ type of button anymore; new postings are retrieved as soon as you scroll down. Also the parsing doesn’t seem to work for now, but may be I figure it out some time later.

 

 

Advertisements

Project Jupyter

Project Jupyter is an open source project allowing to run Python code in a web browser, focusing to support interactive data science and scientific computing not only for Python but across all programming languages. It is a spin-off from IPython I blogged about here.
Typically you would have to install Jupyter and a full stack of Python packages on your computer and start the Jupyter server to get started.
But there is also an alternative available in the web where you can run IPython notebooks for free: https://try.jupyter.org/
This site does not allow you to save your projects permanently but you can export projects and download and also upload notebooks from your local computer.
IPython notebooks are a great way to get started with Python and learn the language. It makes it easy to run your script in small increments and preserves the state of those increments aka cells. It also nicely integrates output into your workflow including graphical plots created with packages like matplotlib.pyplot, and it comes with some primitive markup language to add documentation to your scripts.
The possibilities are endless with IPython or Jupyter – to learn Python as a language or data analysis techniques.
I was inspired by this video on IBM developerWorks to again get started with this: “Use data science to up your game performance“. And the book “Learning IPython for Interactive Computing and Data Visualization – Second Edition” by Cyrille Rossant is the source where I got this tip from about free Jupyter in the web.

Of course you can also sign up for a trial on IBMs Bluemix and start a IBM Data Science Experience project.

How to tag mp3 files

I have a collection of mp3 files which I have named in the form "ARTIST – TITLE.mp3" and wanted to get them tagged properly.
My first plan was to write a Python script to do so, I tried two Python libraries: pytaglib and eyeD3. pytaglib didn’t install, on Windows you need a Visual Studio C++ compiler installed to make it work, which I don’t have currently. pytaglib was the reason why I tried to deal with ubuntu which confronted me with lots of other problems and finally didn’t buy me anything since pytaglib also didn’t install properly on ubuntu and ran into some other compile issues.
eyeD3 installed but apparenty can not handle modern mp3 tag formats.
I also tried MusicBrainz recommend in this article "How to tag all your audio files in the fastest possible way", but its user interface is weird and didn’t get me my files tagged. And I tried the linux id3tag command mentioned in the same article, again no success, looks like it does not support latest tag formats neither.
Then I bumped into Mp3tag for Windows. Brilliant. It made it a piece of cake to tag my mp3 files through a function ‘filename to tag’ where you can specify some sort of pattern for the filenames you have been using, %Artist% – %Title%.mp3 in my case, and a few clicks later all my files have been tagged properly.
I right away donated 5 bucks to the author of this freeware tool.

IPython and lxml

I have been playing a bit with ipython and lxml these days.

IPython is a powerful and interactive shell for Python. It supports browser based notebooks with support for code, text ( actually html markup ), mathematical expressions, inline plots and other rich media. Nice intro here:

Another nice demo what you can do with ipython actually is the pandas demo video here.

Several additional packages need to be installed first to really be able to use all these features, like pandas, mathplotlib or numpy. A good idea it is to install the entire scipy stack, as described here.

I did the installation first on my windows thinkpad and later on on a Mint Linux box.

This is some work to get thru, like bumping into missing dependencies and installing those first, or try several installation methods in case of problems. Sometimes it is better to take a compiled binary, sometimes using pip install, sometimes fetching a source code package and going from there.

I finally succeeded on both my machines. Next step was to figure out how to run an ipython notebook server, because using ipython notebooks in a browser is the most efficient and fun way to work with ipython. For Windows there are useful instructions here, on my Linux Mint machine it worked very differently, working instructions I finally found here.

After that I developed my first notebook using lxml, called GetTableFromWikipedia, which basically goes out on a wikipedia page ( im my case the one about Chemical Elements ) and fetch a table from there ( in my case table # 10 with a list of chemical elements ), retrieves that table using lxml and xpath code and converts it to csv.

The nice thing about ipython is that you can write code into cells and then just re-run those cells to see results immediately in the browser. This makes it very efficient and convenient to develop code by simply trying, or to do a lot “prototyping” — which sounds more professional.

Having an ipython notebook server running locally on your machine is certainly a must for developing a notebook. But how to share notebooks with others ? I found http://nbviewer.ipython.org allowing to share notebooks with the public. You have to store your notebook somewhere in the cloud and pass the URL to the nbviewer. I uploaded my notebook to one of my dropbox folder and here we go: have a look ! Unfortunately it is not possible to actually run the notebook with nbviewer ( nbviewer basically converts a notebook to html  ).

My notebook of course works with other tables too, like the List of rivers longer than 1000 km, published in this wikipedia article as table # 5.

How Quizroom works …

Now, as promised, a few insights into how Quizroom works.
As I already explained: Quizroom auto-generates questions based on facts I have stored in its database, so there is no need to setup pre-defined questions and answers.
It is designed in a way that it allows me to keep an arbitrary number of fact tables in my database with an arbitrary number of facts. For example I have one table called facts_countries containing a list of countries with their population and rank by population ( guess who is number 1 by the way ).
The key table in the Quizroom database is the table called questions which contains the question templates, assigned to categories. For the category "Geography" for instance there is one question template which looks like this:

Question = "Which of these countries has the highest population ?"
Answer = "Country"
Criteria = "max(Population)"
Table = "facts_countries"
Category = "Geography"
Ref =
http://en.wikipedia.org/wiki/List_of_countries_by_population

There are multiple questions in category "Geography", so first thing Quizroom does is picking one randomly. Let’s assume it has picked the one shown above. This tells Quizroom to go to the facts_countries table and pick four records randomly from there. From those 4 records one is picked as the "right" answer depending on the criteria, here the one with the highest population. The question is displayed plus the four possible answers. That’s basically it.
There is a column Ref with the URL from where I have taken the data. You might have noticed that after you have answered a question a "Reference" link is shown at the bottom of the screen, so you actually can go there and verify the source of the facts used for that particular question.
The challenge now is to feed the Quizoom database with interesting facts, stored in a structured way. Wikipedia of course is a good source and some articles have a lot of tables, which make is easy to some extend to derive those structured data. I actually wrote a little Greasemonkey script to transform HTML tables into CSV files easy to import into a database.
Nevertheless, even HTML tables are hard to digest for a structured database in many cases. If for example you look at this Wikipedia article into the table of Countries you notice that for several countries footnotes have been added. This kind of disturbs the attempt to transform such a table into a structured format and requires extra data cleaning effort.
My little Greasemonkey script is just a start, may be a more powerful browser extension is needed to assist in fetching unstructured data and transforming it into useful structured data. Many facts come in format of lists with special rules, something for instance not supported by that script yet.
So much for now. If you, dear reader, know of any source in the internet with interesting facts organized in a structured way please let me know; may be these facts could become the fuel for more interesting questions in Quizroom.

Welcome to Quizroom !

My first little Python based web project "Quizroom" went live on Frihost. My original plan was to implement it using MySQL, but technical limitations on Frihost so far forced me to re-write my server code and thus for now use SQLite as a backend.

Quizroom is a little quiz game asking you multiple-choice questions in several categories I have set up so far, as we have currently:
* Geography
* Movies
* Science
Quizroom auto-generates questions based on facts I have stored in its database, so there is no need to setup pre-defined questions and answers. In a later blog posting I explain a bit more about how it works.

For now, here is the quick user’s guide:

When you start the game you first select one of the categories or "all" if you wanna play them all. Then you click on the upper center field to get started and the first question is displayed together with 4 possible answers. A timer starts running, as you see a progress bar advancing from the left to the right at the bottom of the browser window.
The sooner you answer right, the better !
The time you get is 10 seconds plus 1 second for every 30 words you have to read ( question + all answers ).
If you answer right you gain as many points as seconds were left before you would have been running into a timeout.
If you answer wrong you loose one energy point per 100 points you have scored so far, which basically means: the higher your score the more energy you loose when answering wrong.
A right answer gets you 1 energy point, up to a maximum of 20. You start with 10.
If you run out of energy game is over for you. At this time – if you made it into the high score list – you have the opportunity to leave your name ( or Frihost user name or whatever ) in the high score list. You can also click the closing "x" at the top right corner of that dialog box if you do not want to show up on the highscore list.
A push button in the right column of the screen lets you re-start the game. While the game is running you also can end it any time thru another push-button in the right column of the browser window.
That’s basically it. Give it a try and have fun playing. Click HERE to ge started.

Changing the color of selected rows in a PyGridTableBase grid

First of all I want to express what a great wx.Python extension wx.grid is. I am developing a little browser for large log files and when I first started to get this done thru native list controls I realized how slow those are when it comes to list large amount of data. Thanks to wx.grid which is basically using a very sophisticated form of a virtual list control my little browser now advanced to a real useful tool.

Documentation of everything around wx.Python is available but not in a form I would call comprehensive. Like with many other libraries software developers are using these days google.com is your friend and you need to google for solutions, search thru code repositories ( like Nullege for Python ) or user groups ( like the wxpython-users group on Google ).  Stackoverflow of course is another great source of answers.

One simple thing I could not get achieved for a couple of hours: changing the color used for selected rows. Like wx.Python list controls wx.Grid uses a dark blue for this which looks odd on my Windows 7 machine.

After gazing again at the MegaTable sample code I finally found the solution: your own row selection ( or call it highlighting ) color can be implemented by changing the Draw method of a font renderer you can use as a plugin to a grid panel.

Here is my version of the Draw method of my MegaFontRenderer using a light grey as a row selection color rather than the odd wx.BLUE ( my changes in yellow ):

    def Draw(self, grid, attr, dc, rect, row, col, isSelected):
        # Here we draw text in a grid cell using various fonts
        # and colors.  We have to set the clipping region on
        # the grid's DC, otherwise the text will spill over
        # to the next cell
        dc.SetClippingRect(rect)

        # clear the background
        dc.SetBackgroundMode(wx.SOLID)
        
        HIGHLIGHT_COLOR = (240,240,240)
        
        if isSelected:
            dc.SetBrush(wx.Brush(HIGHLIGHT_COLOR, wx.SOLID))
            dc.SetPen(wx.Pen(HIGHLIGHT_COLOR, 1, wx.SOLID))
        else:
            dc.SetBrush(wx.Brush(wx.WHITE, wx.SOLID))
            dc.SetPen(wx.Pen(wx.WHITE, 1, wx.SOLID))
        dc.DrawRectangleRect(rect)

        text = self.table.GetValue(row, col)
        dc.SetBackgroundMode(wx.SOLID)

        # change the text background based on whether the grid is selected
        # or not
        if isSelected:
            dc.SetBrush(self.selectedBrush)                
            dc.SetTextBackground(HIGHLIGHT_COLOR)
        else:
            dc.SetBrush(self.normalBrush)
            if self.background_color_index:
                idx = self.table.GetValue(row, int(self.background_color_index)-1)
                dc.SetTextBackground(COLORS[int(idx)-1])
            else:            
                dc.SetTextBackground("white")

        dc.SetTextForeground(self.color)
        dc.SetFont(self.font)
        dc.DrawText(text, rect.x+1, rect.y+1)

        # Okay, now for the advanced class 🙂
        # Let's add three dots "..."
        # to indicate that that there is more text to be read
        # when the text is larger than the grid cell

        width, height = dc.GetTextExtent(text)
        
        if width > grid.GetColSize(col) and not self.colSize:
            # width, height = dc.GetTextExtent("...")
            # x = rect.x+1 + rect.width-2 - width
            # dc.DrawRectangle(x, rect.y+1, width+1, height)
            # dc.DrawText("...", x, rect.y+1)
            grid.SetColSize(col, width)

        dc.DestroyClippingRegion()