Scrapy

Yesterday during another boring phone call I googled for “fun python packages” and bumped into this nice article: “20 Python libraries you can’t live without“. While I already knew many of the packages mentioned there one caught my interest: Scrapy. Scrapy seems to be an elegant way not only for parsing web pages but also for travelling web pages, mainly those which have some sort of ‘Next’ or ‘Older posts’ button you wanna click through to e.g. retrieve all pages from a blog.

I installed Scrapy and ran into one import error, thus as mentioned in the FAQ and elsewhere I had to manually install pypiwin32:

pip install pypiwin32

Based on the example on the home page I wrote a little script to retrieve titles and URLs from my German blog “Axel Unterwegs” and enhanced it to write those into a Table-Of-Contents type HTML file, after figuring out how to overwrite the Init and Close method of my spider class.

import scrapy
header = """
<html><head>
<meta content='text/html; charset=UTF-8' http-equiv='Content-Type'/>
</head><body>
"""
footer = """
</body></html> 
"""

class BlogSpider(scrapy.Spider):
 name = 'blogspider'
 start_urls = ['http://axelunterwegs.blogspot.co.uk/']
 
 def __init__(self, *a, **kw):
   super(BlogSpider, self).__init__(*a, **kw)
   self.file = open('blogspider.html','w')
   self.file.write(header)

 def parse(self, response):
   for title in response.css('h3.post-title'):
     t = title.css('a ::text').extract_first()
     url = title.css('a ::attr(href)').extract_first()
     self.file.write("<a target=\"_NEW_\" href=\"%s\">%s</a>\n<br/>" % (url.encode('utf8'),t.encode('utf8')))
     yield {'title': t, 'url': url}

   for next_page in response.css('a.blog-pager-older-link'):
     yield response.follow(next_page, self.parse)
 
 def spider_closed(self, spider):
   self.file.write(footer)
   self.file.close()

Thus, here is the TOC of my German blog.

I tried to get the same done with my English blog here on WordPress but have been struggling so far. One challenge is that the modern UI of WordPress does not have any ‘Older posts’ type of button anymore; new postings are retrieved as soon as you scroll down. Also the parsing doesn’t seem to work for now, but may be I figure it out some time later.

 

 

Advertisements

Project Jupyter

Project Jupyter is an open source project allowing to run Python code in a web browser, focusing to support interactive data science and scientific computing not only for Python but across all programming languages. It is a spin-off from IPython I blogged about here.
Typically you would have to install Jupyter and a full stack of Python packages on your computer and start the Jupyter server to get started.
But there is also an alternative available in the web where you can run IPython notebooks for free: https://try.jupyter.org/
This site does not allow you to save your projects permanently but you can export projects and download and also upload notebooks from your local computer.
IPython notebooks are a great way to get started with Python and learn the language. It makes it easy to run your script in small increments and preserves the state of those increments aka cells. It also nicely integrates output into your workflow including graphical plots created with packages like matplotlib.pyplot, and it comes with some primitive markup language to add documentation to your scripts.
The possibilities are endless with IPython or Jupyter – to learn Python as a language or data analysis techniques.
I was inspired by this video on IBM developerWorks to again get started with this: “Use data science to up your game performance“. And the book “Learning IPython for Interactive Computing and Data Visualization – Second Edition” by Cyrille Rossant is the source where I got this tip from about free Jupyter in the web.

Of course you can also sign up for a trial on IBMs Bluemix and start a IBM Data Science Experience project.

Victim of some Facebook Phishing

Facebook.jpgToday I became a victim of some Facebook credentials phishing. I received an instant message from one of my Facebook contacts containing a video. When trying to play the video I got prompted to enter my Facebook credentials. After having done this … my credentials went into the wrong hands. And it became obvious that this video was not from my contact.
This happened on my smartphone. I believe on a PC this never would have happened to me because there are many means to cross-check urls and links and other things to detect a phishing. On a mobile device it is much harder. The login screen really looked authentic.
The result was: many dubious videos sent to all my contacts. In the meantime Facebook right away locked my account because they detect suspicious behavior. I also ( too late ) read the warning from my contact in Facebook from whom I had received the malicious message that her account had been compromised.
I unlocked my Facebook account by setting a new password and acknowledging a confirmation code; Facebook did a quiet good job to detect the problem and take me through steps to resolve. I then posted warning on my Facebook page and also sent warning messages to most of my contacts; luckily I have less than 100 Laughing
Interestingly my Chrome browser on one of my laptops later on insisted in downloading a Malicious Software Removal tool from Facebook, which right away was blocked by my virus scanner. This happened while Facebook was working fine in my Firefox browser. I found this very helpful hint here ( see comment # 3 in this lengthy article ) how to overcome this strange means and enable Facebook again in my Chrome browser.

Enigma

I just wiped out Windows XP from my little Asus Eee Netbook and replaced it with an Ubuntu 16.04. Of course the Asus Eee is a weak little laptop but it turns out ubuntu runs quiet nicely on it. A modern Windows was not a good choice IMHO since it is too resource hungry, especially when I look at all the Windows services attempting to scan my mechanical hard disk. Sometimes I think Microsoft has been sponsored by flash drive manufacturer to increase market demand for their products Wink
While exploring available software in the Ubuntu Software store I discovered Enigma, a nice game I started playing right away. I used to play it some years ago and knew it under the name Oxyd. It is a puzzle game in which you control a ball with the mouse and need to find pairing oxyd stones. In some levels you have to control two little white balls and get them into a hole. Other levels are Sokoban like where you have to move stones around.

Enigma comes with tons of levels, many are real challenging !

Longest internet outage you ever experienced ?

In March my internet connection provided by Vodafone was broken for 6 days. That means: no phone, no internet @ home, for a long period of time. Luckily I have a Samsung Galaxy S5 Mini with LTE and hotspot capability and Vodafone gave me free 50 GByte data volume when I called them to address the problem.

Nevertheless, what duration of an internet outage can be tolerated ? I have read articles in the meantime saying that an outage of 3 working days needs to be tolerated, but anything beyond qualifies you for asking for compensation or might give you the right to cancel your contract in advance; typical contracts here in Germany last 2 years and can be cancelled three month before they end.

So, what is the longest internet outage you ever experienced at home and what did you do about this ?

And: how many of those outages can be tolerated per year ? I am asking this because since yesterday my line is down again. I called Vodafone and after keying in my phone number a voice told me something about a global problem.

1900 001337
1900 001337” by StephenMitchell.

In former times when we had an analog phone at home I don’t recall that it ever failed, it always had been available for years and years without a single failure. It just worked.
Modern phones use voice over ip and thus need the internet to work. And internet connections turn out to be quiet unreliable these days, as I am experiencing myself;  also I hear that the line in the house of my mother in law doesn’t work anymore since last week.

In former times we had an analog radio in our kitchen and it always worked. Nowadays we have an ip radio and of course it doesn’t work without an internet connection. Thus: no radio this morning.

Modern technology is fascinating, but much more fragile than older technology. If I think about the Internet Of Things on one end and the increasing amount of activities of criminals on the other end to sabotage that technology, which already is kind of fragile due to its complexity, I get a bit nervous about the future.

Saturn V Launch Vehicle Digital Computer, IBM Part Number 6109030

Saturn V Computer Ring
"Saturn V Computer Ring"

Did you know that IBM designed the Saturn V Launch Vehicle Digital Computer in the 1960th ? I didn’t until my wife and I stopped at Huntsville, Alabama, on our 4-week-trip through the South States of USA, where the U.S. Space and Rocket Center is located which we visited.

IBM Team responsible for the Saturn V Instrument Unit
"IBM Team responsible for the Saturn V Instrument Unit"

IBM actually had been assigned the overall responsibility to design the Saturn V Instrument Unit and I have posted here a picture of the IBM team working on that: impressive how many people we assigned to a single project these days !

When NASA designed the Saturn V they discussed whether launch and flight of this huge rocket should be controlled by the astronauts or automatically. They came to the conclusion that stress during launch due to vibrations and noise during takeoff whould be too much for human beings so that they better design some instrument unit controlling the launch phase of Apollo missions.

This turned out to be a wise decision when the rocket was hit by electrical discharges during takeoff of Apollo 12. The Command Module where the astronauts are sitting went offline but Saturn V continued its flight without any major impacts, under control by the Instrument Unit. Later on astronauts were able to bring the Command Unit back online.

This wikipedia article about the Launch Vehicle Digital Computer (LVDC) has a link to a pdf copy of the IBM maintenance instructions where I found the IBM part number mentioned in the title of this posting.

Powerpoint VBA Code

Finding documentation about how to write VBA code for MS Powerpoint is a challenge. When it comes to writing VBA macros most people think about number crunching with MS Excel first.

I got a presentation from our offering management containing RTC work item numbers and I wanted to write a VBA macro to extract those numbers ( to then run a RTC query to cross-check those work items in RTC itself ). Should be a piece of cake, shouldn’t it ? Well …

I bumped into some material here and here about the Powerpoint Object Model, but at the end this was not that helpful. At least it got me started, together with this article on Lifehacker how to loop through slides and shapes in a Powerpoint presentation.

I started to use the VBA Development Environment in Powerpoint and especially the Object Browser to discover what type of objects to use. I used a lot of intuition to go fishing in the sea of classes and members . At the end I figured it out, thus here is the code to get the job done and loop through all text in all table cells in all tables and look for digits of length 5 or 6:

Sub ExtractTextFromTableCells()

  Dim slide As Object
  Dim shape As Object
  Dim regEx As Object
  Dim strPattern As String: strPattern = "^\d{5,6}"
  Dim word As String
  Dim listOfIds As String
  listOfIds = ""
 
  Set regEx = CreateObject("vbscript.regexp")
  With regEx
        .Global = True
        .MultiLine = False
        .IgnoreCase = False
        .Pattern = strPattern
    End With
   
  Debug.Print "——————————————-"

  For Each slide In ActivePresentation.Slides
      For Each shape In slide.Shapes
          If shape.HasTable Then
              For Each Row In shape.Table.Rows
                For Each Cell In Row.Cells
                    txt = Cell.shape.TextFrame.TextRange.Text
                    If regEx.test(txt) Then
                        Dim WrdArray() As String
                        WrdArray() = Split(txt)
                        For i = LBound(WrdArray) To UBound(WrdArray)
                            Dim WrdArray2() As String
                            WrdArray2() = Split(WrdArray(i), ",")
                            For j = LBound(WrdArray2) To UBound(WrdArray2)
                                word = Replace(WrdArray2(j), " ", "")
                                word = Replace(word, "\n", "")
                                word = Replace(word, "|", "")
                                If regEx.test(word) And word <> "" Then
                                    listOfIds = listOfIds & word & ","
                                End If
                            Next j
                        Next i
                    End If
                Next
              Next
          End If
      Next
  Next
  Debug.Print listOfIds
End Sub

That script will return a comma separated list of RTC work item ids which can be easily used in a RTC query like so: