<wbr> tags irritate HTML::TreeBuilder

This has cost me a few hours of my life. I have been using the perl module HTML::TreeBuilder and experienced a very weird behavior: it simply retrieved incomplete HTML code when using the look_down function in some cases ( I am using it to retrieve blog articles from a blog ) and it looked like this happened in cases when the retrieved HTML code contained nested tables, something like …

   1: <table>
   2: <tbody>
   3: <tr>
   4: <td>
   5: ....
   6:     <table>
   7:     <tbody>
   8:     <td> 
   9:     ....
  10:     </td>
  11:     </tbody>
  12:     </table>
  13: ...
  14: </td>
  15: </tr>
  16: </tbody>
  17: </table>

After a few hours of investigation and trying this and that I noticed this suspicious <wbr> tag in my HTML code and learned it is used to indicate to the browser that it might insert a word break if it wishes.

I decided to get rid of it and changed my perl codes as follows:

   1: my $page = get($url) or die $!;
   2: # Need to get rid off <wbr> tags; they confuse HTML::TreeBuildr and cause incomplete HTML code retrieved
   3: # especially in case of nested tables.
   4: $page =~ s/<wbr>//g;
   5: my $tree = HTML::TreeBuilder->new_from_content($page);

And bingo – my problem is gone !

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: