HTML Dom in Ruby with Nokogiri
I recently needed to do a screen scrape of a website. The page I’m trying to parse is marked as an “XHTML 1.0 Transitional”. XHTML? Should be easy. Parse the doc, use XPath and I’ll be done.
If you search “xml parser ruby”, the first result you will get is REXML. I’ve read comparisons that point out that libxml is several of orders of magnitude faster. My first attempt used REXML. I failed miserably in this attempt because the web page I was parsing was not actually valid XHTML. After I learned it was broken, I ran it through W3C’s validation service and discovered the site had over 100 errors. XML Parsing is out.
That led me to a search for HTML Dom ruby which led me to hpricot. I actually didn’t even try this parser because Andrew Kavanaugh pointed me to Nokogiri. Nokogiri is interesting because it provides two different ways to find the elements you are interested in. It lets you find an element using XPath or CSS selectors. Lately, I’ve been doing a lot of CSS selectors so I went that route. The document I was searching through had something like the following HTML:
...
<div class="section">
<h4>Section 1</h4>
<p>
<sup class="requirement">1</sup>Requirement 1 descriptive sentence.
<sup class="requirement">2</sup>Requirement 2 descriptive sentence.
<sup class="footnote"><a href="#footnote1">footnote 1</a></sup>
More descriptive text.
<sup class="requirement">3</sup>Requirement 3 descriptive sentence.</div>
</p>
</div>
I needed to translate this into a format I could insert into my database. I don’t care about footnotes, or spacing or anything other than the raw text. I need it to look something like:
Section 1.1, Requirement 1 descriptive sentence. Section 1.2, Requirement 2 descriptive sentence. More descriptive text. Section 1.3, Requirement 3 descriptive sentence.
I used the following code to get the section number I was after:
@section = doc.at('div.section h4').inner_html
I then used the following code to get the subsection number and the text associated with it:
doc.css('div.section sup.requirement').each do |element|
# Get the requirement subsection number
@requirement = element.to_s.strip
# Since we are interested in all the text between each of the subs
# We need to get every text node until we run into the start of the next
# sub class='requirement' node
@node = element.next
@text = ""
while @node != nil && (@node['class'] != 'requirement') do
if (@node.text?) then
@text = @text + " " + @node.to_s.strip
end
@text = @text.strip
@node = @node.next
end
puts @section + "." + num.inner_html + ", " + @text
end
Man I love these “whatever.each do |element|” style blocks. Very powerful. When I ran this for the first time, I encountered an oddity I didn’t quite understand. Even though I was calling strip to eliminate the white space, I was getting a row that looked like:
Section 1.2, Requirement 2 descriptive sentence. More descriptive text.
It turns out that when calling to_s on a node, it converts into something that is whitespace but not stripped out by the normal strip function. I modified the strip function of the String class and all worked well.
class String
alias_method :strip_old, :strip
def strip
self.gsub(/^[\302\240|\s]*|[\302\240|\s]*$/, '')
end
def strip!
before = self.reverse.reverse
self.gsub!(/^[\302\240|\s]*|[\302\240|\s]*$/, '')
before == self ? nil : self
end
end
In the past, this would have been something I’d have just thrown together in Java. If I ever need to do something like this again in Java, I’m going to try out this HTML parser called Cobra. It even handles javascript calls in the page (like document.write).