Jupyter: Plotting pivots & changing legend entries

A while ago I blogged about project Jupyter and in the last days I have been working a lot with it and I am still fascinated by its power.

Today I faced and solved two challenges I like to share here:
. plotting a pivot table
. changing legend entries

Assume we have the following dataframe:

Creating a pivot is a piece of cake by using the pandas pivot_table method on that dataframe:

pivot = pd.pivot_table(df,index=["Org"],values=["Male employees","Female employees"], 


This gets us
. the number of departments per org ( = len Female employees or len Male employees )
. the sum of male and female employees per org ( = sum Female employees and sum Male employees )
. as well as mean, min and max

How to plot ?
We can simply save the pivot tables as a new dataframe ‘pivot’ and call its plot method. Let’s say we want to plot sum of male and female employees per org. First we need to drop the other statistics from the pivot table we don’t need for the plot. Then we plot:




Only problem here is that the legend entries of this plot look a bit cryptic. Here is some code to fix this:


ax = plt.gca() 
handles,labels = ax.get_legend_handles_labels() 
new_labels = [] 
for l in labels: 
ax.legend(handles, new_labels)  


I have shared the entire notebook here.


How to print ipython notebooks without the source code

This is something I really need to create sort of standard reports based on ipython notebooks which should not contain the source code and input prompts of ipython cells: the capability to print ipython notebooks without the source code.

There are ways to do that as discussed here on stackoverflow but all these methods involve adding some ugly code to your ipython cells or tweaking the way the ipython server is started ( or running nbconvert ) which might be out of your control if you use some cloud offering like Data Science Experience on IBM Cloud and not your own ipython installation.

Here is how I achieve this:

I simply download my notebook as html.

Then I run this python script to convert that html file so that prompts and code cells are gone:

FILE = "/somewhere/myHTMLFile.html"

with open(FILE, 'r') as html_file:
    content = html_file.read()

# Get rid off prompts and source code
content = content.replace("div.input_area {","div.input_area {\n\tdisplay: none;")    
content = content.replace(".prompt {",".prompt {\n\tdisplay: none;")

f = open(FILE, 'w')

That script bascially adds the CSS ‘display: none’ attribute for all divs of class ‘prompt’ or ‘input_area’.

That tweaked html page now easily can be printed into a pdf file for me to get my standard report without any code or input prompt cells.

If you know what you are doing you can add more CSS tweaking, like e.g. this one, to that Python code:

# For dataframe tables use Courier font family with smaller font size
content = content.replace(".dataframe thead","table.dataframe { font-size: 7px; font-family: Courier; }\n.dataframe thead")

To figure out things like that I used Firefox Inspector to determine class names of DOM elements ( like e.g. ‘div.data_frame’ is used to display dataframe tables in ipython ) and some CSS knowledge to achieve the manipulations I find useful, like reducing the font size of tables in order to make them fit on pages printed with portrait orientation.


Yesterday during another boring phone call I googled for “fun python packages” and bumped into this nice article: “20 Python libraries you can’t live without“. While I already knew many of the packages mentioned there one caught my interest: Scrapy. Scrapy seems to be an elegant way not only for parsing web pages but also for travelling web pages, mainly those which have some sort of ‘Next’ or ‘Older posts’ button you wanna click through to e.g. retrieve all pages from a blog.

I installed Scrapy and ran into one import error, thus as mentioned in the FAQ and elsewhere I had to manually install pypiwin32:

pip install pypiwin32

Based on the example on the home page I wrote a little script to retrieve titles and URLs from my German blog “Axel Unterwegs” and enhanced it to write those into a Table-Of-Contents type HTML file, after figuring out how to overwrite the Init and Close method of my spider class.

import scrapy
header = """
<meta content='text/html; charset=UTF-8' http-equiv='Content-Type'/>
footer = """

class BlogSpider(scrapy.Spider):
 name = 'blogspider'
 start_urls = ['http://axelunterwegs.blogspot.co.uk/']
 def __init__(self, *a, **kw):
   super(BlogSpider, self).__init__(*a, **kw)
   self.file = open('blogspider.html','w')

 def parse(self, response):
   for title in response.css('h3.post-title'):
     t = title.css('a ::text').extract_first()
     url = title.css('a ::attr(href)').extract_first()
     self.file.write("<a target=\"_NEW_\" href=\"%s\">%s</a>\n<br/>" % (url.encode('utf8'),t.encode('utf8')))
     yield {'title': t, 'url': url}

   for next_page in response.css('a.blog-pager-older-link'):
     yield response.follow(next_page, self.parse)
 def spider_closed(self, spider):

Thus, here is the TOC of my German blog.

I tried to get the same done with my English blog here on WordPress but have been struggling so far. One challenge is that the modern UI of WordPress does not have any ‘Older posts’ type of button anymore; new postings are retrieved as soon as you scroll down. Also the parsing doesn’t seem to work for now, but may be I figure it out some time later.



Project Jupyter

Project Jupyter is an open source project allowing to run Python code in a web browser, focusing to support interactive data science and scientific computing not only for Python but across all programming languages. It is a spin-off from IPython I blogged about here.
Typically you would have to install Jupyter and a full stack of Python packages on your computer and start the Jupyter server to get started.
But there is also an alternative available in the web where you can run IPython notebooks for free: https://try.jupyter.org/
This site does not allow you to save your projects permanently but you can export projects and download and also upload notebooks from your local computer.
IPython notebooks are a great way to get started with Python and learn the language. It makes it easy to run your script in small increments and preserves the state of those increments aka cells. It also nicely integrates output into your workflow including graphical plots created with packages like matplotlib.pyplot, and it comes with some primitive markup language to add documentation to your scripts.
The possibilities are endless with IPython or Jupyter – to learn Python as a language or data analysis techniques.
I was inspired by this video on IBM developerWorks to again get started with this: “Use data science to up your game performance“. And the book “Learning IPython for Interactive Computing and Data Visualization – Second Edition” by Cyrille Rossant is the source where I got this tip from about free Jupyter in the web.

Of course you can also sign up for a trial on IBMs Bluemix and start a IBM Data Science Experience project.

How to tag mp3 files

I have a collection of mp3 files which I have named in the form "ARTIST – TITLE.mp3" and wanted to get them tagged properly.
My first plan was to write a Python script to do so, I tried two Python libraries: pytaglib and eyeD3. pytaglib didn’t install, on Windows you need a Visual Studio C++ compiler installed to make it work, which I don’t have currently. pytaglib was the reason why I tried to deal with ubuntu which confronted me with lots of other problems and finally didn’t buy me anything since pytaglib also didn’t install properly on ubuntu and ran into some other compile issues.
eyeD3 installed but apparenty can not handle modern mp3 tag formats.
I also tried MusicBrainz recommend in this article "How to tag all your audio files in the fastest possible way", but its user interface is weird and didn’t get me my files tagged. And I tried the linux id3tag command mentioned in the same article, again no success, looks like it does not support latest tag formats neither.
Then I bumped into Mp3tag for Windows. Brilliant. It made it a piece of cake to tag my mp3 files through a function ‘filename to tag’ where you can specify some sort of pattern for the filenames you have been using, %Artist% – %Title%.mp3 in my case, and a few clicks later all my files have been tagged properly.
I right away donated 5 bucks to the author of this freeware tool.

IPython and lxml

I have been playing a bit with ipython and lxml these days.

IPython is a powerful and interactive shell for Python. It supports browser based notebooks with support for code, text ( actually html markup ), mathematical expressions, inline plots and other rich media. Nice intro here:

Another nice demo what you can do with ipython actually is the pandas demo video here.

Several additional packages need to be installed first to really be able to use all these features, like pandas, mathplotlib or numpy. A good idea it is to install the entire scipy stack, as described here.

I did the installation first on my windows thinkpad and later on on a Mint Linux box.

This is some work to get thru, like bumping into missing dependencies and installing those first, or try several installation methods in case of problems. Sometimes it is better to take a compiled binary, sometimes using pip install, sometimes fetching a source code package and going from there.

I finally succeeded on both my machines. Next step was to figure out how to run an ipython notebook server, because using ipython notebooks in a browser is the most efficient and fun way to work with ipython. For Windows there are useful instructions here, on my Linux Mint machine it worked very differently, working instructions I finally found here.

After that I developed my first notebook using lxml, called GetTableFromWikipedia, which basically goes out on a wikipedia page ( im my case the one about Chemical Elements ) and fetch a table from there ( in my case table # 10 with a list of chemical elements ), retrieves that table using lxml and xpath code and converts it to csv.

The nice thing about ipython is that you can write code into cells and then just re-run those cells to see results immediately in the browser. This makes it very efficient and convenient to develop code by simply trying, or to do a lot “prototyping” — which sounds more professional.

Having an ipython notebook server running locally on your machine is certainly a must for developing a notebook. But how to share notebooks with others ? I found http://nbviewer.ipython.org allowing to share notebooks with the public. You have to store your notebook somewhere in the cloud and pass the URL to the nbviewer. I uploaded my notebook to one of my dropbox folder and here we go: have a look ! Unfortunately it is not possible to actually run the notebook with nbviewer ( nbviewer basically converts a notebook to html  ).

My notebook of course works with other tables too, like the List of rivers longer than 1000 km, published in this wikipedia article as table # 5.

How Quizroom works …

Now, as promised, a few insights into how Quizroom works.
As I already explained: Quizroom auto-generates questions based on facts I have stored in its database, so there is no need to setup pre-defined questions and answers.
It is designed in a way that it allows me to keep an arbitrary number of fact tables in my database with an arbitrary number of facts. For example I have one table called facts_countries containing a list of countries with their population and rank by population ( guess who is number 1 by the way ).
The key table in the Quizroom database is the table called questions which contains the question templates, assigned to categories. For the category "Geography" for instance there is one question template which looks like this:

Question = "Which of these countries has the highest population ?"
Answer = "Country"
Criteria = "max(Population)"
Table = "facts_countries"
Category = "Geography"
Ref =

There are multiple questions in category "Geography", so first thing Quizroom does is picking one randomly. Let’s assume it has picked the one shown above. This tells Quizroom to go to the facts_countries table and pick four records randomly from there. From those 4 records one is picked as the "right" answer depending on the criteria, here the one with the highest population. The question is displayed plus the four possible answers. That’s basically it.
There is a column Ref with the URL from where I have taken the data. You might have noticed that after you have answered a question a "Reference" link is shown at the bottom of the screen, so you actually can go there and verify the source of the facts used for that particular question.
The challenge now is to feed the Quizoom database with interesting facts, stored in a structured way. Wikipedia of course is a good source and some articles have a lot of tables, which make is easy to some extend to derive those structured data. I actually wrote a little Greasemonkey script to transform HTML tables into CSV files easy to import into a database.
Nevertheless, even HTML tables are hard to digest for a structured database in many cases. If for example you look at this Wikipedia article into the table of Countries you notice that for several countries footnotes have been added. This kind of disturbs the attempt to transform such a table into a structured format and requires extra data cleaning effort.
My little Greasemonkey script is just a start, may be a more powerful browser extension is needed to assist in fetching unstructured data and transforming it into useful structured data. Many facts come in format of lists with special rules, something for instance not supported by that script yet.
So much for now. If you, dear reader, know of any source in the internet with interesting facts organized in a structured way please let me know; may be these facts could become the fuel for more interesting questions in Quizroom.