Scrapy

Yesterday during another boring phone call I googled for “fun python packages” and bumped into this nice article: “20 Python libraries you can’t live without“. While I already knew many of the packages mentioned there one caught my interest: Scrapy. Scrapy seems to be an elegant way not only for parsing web pages but also for travelling web pages, mainly those which have some sort of ‘Next’ or ‘Older posts’ button you wanna click through to e.g. retrieve all pages from a blog.

I installed Scrapy and ran into one import error, thus as mentioned in the FAQ and elsewhere I had to manually install pypiwin32:

pip install pypiwin32

Based on the example on the home page I wrote a little script to retrieve titles and URLs from my German blog “Axel Unterwegs” and enhanced it to write those into a Table-Of-Contents type HTML file, after figuring out how to overwrite the Init and Close method of my spider class.

import scrapy
header = """
<html><head>
<meta content='text/html; charset=UTF-8' http-equiv='Content-Type'/>
</head><body>
"""
footer = """
</body></html> 
"""

class BlogSpider(scrapy.Spider):
 name = 'blogspider'
 start_urls = ['http://axelunterwegs.blogspot.co.uk/']
 
 def __init__(self, *a, **kw):
   super(BlogSpider, self).__init__(*a, **kw)
   self.file = open('blogspider.html','w')
   self.file.write(header)

 def parse(self, response):
   for title in response.css('h3.post-title'):
     t = title.css('a ::text').extract_first()
     url = title.css('a ::attr(href)').extract_first()
     self.file.write("<a target=\"_NEW_\" href=\"%s\">%s</a>\n<br/>" % (url.encode('utf8'),t.encode('utf8')))
     yield {'title': t, 'url': url}

   for next_page in response.css('a.blog-pager-older-link'):
     yield response.follow(next_page, self.parse)
 
 def spider_closed(self, spider):
   self.file.write(footer)
   self.file.close()

Thus, here is the TOC of my German blog.

I tried to get the same done with my English blog here on WordPress but have been struggling so far. One challenge is that the modern UI of WordPress does not have any ‘Older posts’ type of button anymore; new postings are retrieved as soon as you scroll down. Also the parsing doesn’t seem to work for now, but may be I figure it out some time later.

 

 

Advertisements

Project Jupyter

Project Jupyter is an open source project allowing to run Python code in a web browser, focusing to support interactive data science and scientific computing not only for Python but across all programming languages. It is a spin-off from IPython I blogged about here.
Typically you would have to install Jupyter and a full stack of Python packages on your computer and start the Jupyter server to get started.
But there is also an alternative available in the web where you can run IPython notebooks for free: https://try.jupyter.org/
This site does not allow you to save your projects permanently but you can export projects and download and also upload notebooks from your local computer.
IPython notebooks are a great way to get started with Python and learn the language. It makes it easy to run your script in small increments and preserves the state of those increments aka cells. It also nicely integrates output into your workflow including graphical plots created with packages like matplotlib.pyplot, and it comes with some primitive markup language to add documentation to your scripts.
The possibilities are endless with IPython or Jupyter – to learn Python as a language or data analysis techniques.
I was inspired by this video on IBM developerWorks to again get started with this: “Use data science to up your game performance“. And the book “Learning IPython for Interactive Computing and Data Visualization – Second Edition” by Cyrille Rossant is the source where I got this tip from about free Jupyter in the web.

Of course you can also sign up for a trial on IBMs Bluemix and start a IBM Data Science Experience project.

Just discovered: jsconsole.com

Just discovered jsconsole.com, an awesome way to quickly test out some javascript code.

So far I like jsFiddle to test out javascript code in the context of a html page and css styles, or cscript to run some javascript locally in a command prompt window.

jsconsole runs in your browser but works like a console, thus you just type or paste in javascript code and will see the output in the console. Like:

a = 1;
1
b = 3;
3
a + b
4
A nice feature is that you can paste in functions as well and execute those:
function addit(v1,v2) { 
  return v1+v2; 
}
addit(a,b);
4
A much nicer feature is that you can load any web page to make it available as a document in your javascript context:
:load www.google.com

Loading url into DOM…

DOM load complete

You can also use that “:load” command to load any external scripts, or a javascript framework like jquery:

:load http://ajax.googleapis.com/ajax/libs/jquery/2.1.0/jquery.min.js

Loading script…

Loaded http://ajax.googleapis.com/ajax/libs/jquery/2.1.0/jquery.min.js

Now we have the google page available as a document and jquery as a library we can easily find out ( programmatically ) the text of the two buttons on that page:

$("input").each( function() {if ($(this).attr("type") == "submit") { console.log($(this).attr("value")); }});

"Google Search"

"I’m Feeling Lucky"

Isn’t hat just … wow !?

Flowcharter 0.1 alpha

Last week in my drawer I discovered a very early version of Visio or ABC Flowcharter:

I probably got this in 1980 when I joined IBM. Boy, did we really draw flow charts on paper for software we wrote ? Actually we did during the programming classes I went through at the beginning.

For my diploma theses work I wrote in 1983 with Script on a mainframe computer I had to come up with some ASCII art to enrich my paper with flowcharts:

IPython and lxml

I have been playing a bit with ipython and lxml these days.

IPython is a powerful and interactive shell for Python. It supports browser based notebooks with support for code, text ( actually html markup ), mathematical expressions, inline plots and other rich media. Nice intro here:

Another nice demo what you can do with ipython actually is the pandas demo video here.

Several additional packages need to be installed first to really be able to use all these features, like pandas, mathplotlib or numpy. A good idea it is to install the entire scipy stack, as described here.

I did the installation first on my windows thinkpad and later on on a Mint Linux box.

This is some work to get thru, like bumping into missing dependencies and installing those first, or try several installation methods in case of problems. Sometimes it is better to take a compiled binary, sometimes using pip install, sometimes fetching a source code package and going from there.

I finally succeeded on both my machines. Next step was to figure out how to run an ipython notebook server, because using ipython notebooks in a browser is the most efficient and fun way to work with ipython. For Windows there are useful instructions here, on my Linux Mint machine it worked very differently, working instructions I finally found here.

After that I developed my first notebook using lxml, called GetTableFromWikipedia, which basically goes out on a wikipedia page ( im my case the one about Chemical Elements ) and fetch a table from there ( in my case table # 10 with a list of chemical elements ), retrieves that table using lxml and xpath code and converts it to csv.

The nice thing about ipython is that you can write code into cells and then just re-run those cells to see results immediately in the browser. This makes it very efficient and convenient to develop code by simply trying, or to do a lot “prototyping” — which sounds more professional.

Having an ipython notebook server running locally on your machine is certainly a must for developing a notebook. But how to share notebooks with others ? I found http://nbviewer.ipython.org allowing to share notebooks with the public. You have to store your notebook somewhere in the cloud and pass the URL to the nbviewer. I uploaded my notebook to one of my dropbox folder and here we go: have a look ! Unfortunately it is not possible to actually run the notebook with nbviewer ( nbviewer basically converts a notebook to html  ).

My notebook of course works with other tables too, like the List of rivers longer than 1000 km, published in this wikipedia article as table # 5.

Since Firefox 30 using unsafeWindow is really not recommended anymore

I had written a little Greasemonkey script allowing to generate some html code from a flickr photo page to use in a blog posting.

Gut gelaunter Baum
"Gut gelaunter Baum" by Axel Magard.

See picture on the right as an example.

That script went to userscripts.org but unfortunately userscripts.org is not available anymore, so you now can get that script from here (OpenUserJS).

That script used code like described here to dynamically load jQuery so that jQuery can be used in that Greasemonkey script. Because of this change in Firefox this code stopped working and I always ran into a Javascript error saying: “Permission denied to access property …”

Luckily this problem has been discussed here on stackoverflow.

The solution: a different way to use jQuery in a Greasemonkey script, bascially thru the @require directive, nicely explained in Taw’s blog here. ( You can check out the source code of my script right away here on OpenUserJS.

No way to access selected content in a XUL editor element ?

For a Firefox extension I am currently developing I am writing code allowing local editing of documents  and I am using the XUL Editor element for this which provides kind of an easy way to implement a HTML ( or text ) editor in a browser extension. Kind of … Of course there are hurdles and the major one currently was how to retrieve the selection made in the editor.

Why is that needed ? Let’s explain with one simple example. XUL Editor makes it easy to manipulate content thru Midas and the Document.execCommand interface. To insert a link you basically invoke that command with the options shown below:

editor.contentDocument.execCommand("createlink", false, url);

Document.execCommand takes care of inserting the appropriate html tags correctly around the portion of text you have selected. In order to implement this in a senseful way of course you have to come up with some dialog first to ask the user for that url to use for the link:

So far, so good. Now imagine a user selects a text for which a link already had been inserted before. Of course you want your dialog entry field to contain that URL so that the user has the chance to change ( or delete ) it. This is where that feature to detect the user selection in the editor element is needed – and I can think of more use cases ( like e.g. manipulating the attributes of an inserted image ).

I was googling for some time without any luck. Either this is one of the best hidden features in XUL or there is bascially no way to do it. The best hint I found here in the Microsoft Developer Network about using document.selection.createRange(). They actually provide a demo of this code … which is not working and ends with the following error shown in the console: “TypeError: document.selection is undefined”.

So far, so bad. Looks like I had to come up with my own hack and use what I have. As I said the easy way to manipulate selected content in a XUL editor is thru contentDocument.execCommand. I decided to come up with a ‘marker’ tag to mark selected text first before I continue processing it. Thus for my method to handle links I first did this:

editor.contentDocument.execCommand("hiliteColor", false, "000000");

This bascially applies a black background to the selected text, HTML wise it generates the following tags around the selection:

<span style="background-color: rgb(0, 0, 0);">...</span>

Now with the help of some jQuery magic I can easily find the selected text within my document. The following code demonstrates how I fetch an exisiting URL from the HTML code if an “a” tag is found, by accessing the “href” attribute of that tag:

    spans = $(html).find("span");
    $(spans).each( function () {
        if ($(this).attr("style") == "background-color: rgb(0, 0, 0);" ) {
            if ($(this).find("a").length) {
                url = $(this).find("a").attr("href");
            }
        }
    });

Next I display my dialog for the user to enter / change the given URL and then use this simple code to create the link or remove it if the user has deleted it so that my dialog returns an empty url:

    if (url == "") {
        editor.contentDocument.execCommand("unlink", false, url);        
    } else {
        editor.contentDocument.execCommand("createlink", false, url);        
    }

All what remains to do later on is to remove that “marker” tag after I am done with my processing, which I accomplish thru a regular expression:

patt1 = new RegExp("<span style\=\"background\-color\: rgb.0, 0, 0.;\">(.*)</span>","g");  
editor.contentDocument.documentElement.innerHTML =
        editor.contentDocument.documentElement.innerHTML.replace(patt1,"$1");

Works ! As I said: it’s a hack. It will break of course if any user decides to use black background somewhere in his document. May be I should use another less likely color for my magic, like rgb(0,0,1).