0

HTML Dom in Ruby with Nokogiri

Posted by jason on Feb 24, 2010 in Ruby

I recently needed to do a screen scrape of a website.  The page I’m trying to parse is marked as an “XHTML 1.0 Transitional”.  XHTML?  Should be easy.  Parse the doc, use XPath and I’ll be done.

If you search “xml parser ruby”, the first result you will get is REXML.   I’ve read comparisons that point out that libxml is several of orders of magnitude faster.  My first attempt used REXML.  I failed miserably in this attempt because the web page I was parsing was not actually valid XHTML.  After I learned it was broken, I ran it through W3C’s validation service and discovered the site had over 100 errors.  XML Parsing is out.

That led me to a search for HTML Dom ruby which led me to hpricot.  I actually didn’t even try this parser because Andrew Kavanaugh pointed me to Nokogiri.  Nokogiri is interesting because it provides two different ways to find the elements you are interested in.  It lets you find an element using XPath or CSS selectors.  Lately, I’ve been doing a lot of CSS selectors so I went that route.  The document I was searching through had something like the following HTML:

...
<div class="section">
<h4>Section 1</h4>
<p>
  <sup class="requirement">1</sup>Requirement 1 descriptive sentence.
  <sup class="requirement">2</sup>Requirement 2 descriptive sentence.
  <sup class="footnote"><a href="#footnote1">footnote 1</a></sup>
    &nbsp;&nbsp;&nbsp;
    More descriptive text.
  <sup class="requirement">3</sup>Requirement 3 descriptive sentence.</div>
</p>
</div>

I needed to translate this into a format I could insert into my database.  I don’t care about footnotes, or spacing or anything other than the raw text.  I need it to look something like:

Section 1.1, Requirement 1 descriptive sentence.
Section 1.2, Requirement 2 descriptive sentence. More descriptive text.
Section 1.3, Requirement 3 descriptive sentence.

I used the following code to get the section number I was after:

@section = doc.at('div.section h4').inner_html

I then used the following code to get the subsection number and the text associated with it:

doc.css('div.section sup.requirement').each do |element|
  # Get the requirement subsection number
  @requirement = element.to_s.strip

  # Since we are interested in all the text between each of the subs
  # We need to get every text node until we run into the start of the next
  # sub class='requirement' node
  @node = element.next
  @text = ""
  while @node != nil && (@node['class'] != 'requirement') do
    if (@node.text?) then
      @text = @text + " " + @node.to_s.strip
    end
    @text = @text.strip
    @node = @node.next
  end
  puts  @section + "." + num.inner_html + ", " + @text
end

Man I love these “whatever.each do |element|” style blocks.  Very powerful.  When I ran this for the first time, I encountered an oddity I didn’t quite understand.  Even though I was calling strip to eliminate the white space, I was getting a row that looked like:

Section 1.2, Requirement 2 descriptive sentence.    More descriptive text.

It turns out that when calling to_s on a node, it converts &nbsp; into something that is whitespace but not stripped out by the normal strip function.  I modified the strip function of the String class and all worked well.

class String
  alias_method :strip_old, :strip
  def strip
    self.gsub(/^[\302\240|\s]*|[\302\240|\s]*$/, '')
  end

  def strip!
    before = self.reverse.reverse
    self.gsub!(/^[\302\240|\s]*|[\302\240|\s]*$/, '')
    before == self ? nil : self
  end
end

In the past, this would have been something I’d have just thrown together in Java.  If I ever need to do something like this again in Java, I’m going to try out this HTML parser called Cobra.  It even handles javascript calls in the page (like document.write).

Copyright © 2010 programming with passion All rights reserved. Theme by Laptop Geek.