Hawaiʻi's Technology Community

Screen Scraping Tools for Rails Developers

Screen scraping is defined in Wikipedia as "a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the World Wide Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding a fully-fledged web browser, such as Internet Explorer or Mozilla Firefox". There are a number of tools available to the Rubyist to accomplish almost any scenario imagined.

I recently had a project where almost the entirety of the application relied on screen scraping due to the unavailability of a public API for the data being made available. So, I dug into RubyGems to see what I could find. There are two main types of gems available: browser emulation and web crawler (for lack of a better term).

Emulators generally follow the Capybara/choose your driver (Selenium,Watir, etc). Most, if not all, of these were developed with the end goal of automating unit/functional/integration tests. They can also be used in conjunction with the headless gem in the event that you don't won't to see the 'browser' on your screen during testing. One variation on this theme is the use of the poltergeist gem in conjunction with PhantomJS. So, poltergeist is a capybara driver that drives navigation via PhantomJS. The advantage of this combination is that part of the gtk libraries have been incorporated in the PhantomJS library so that you have headless browsing without the xvfb library and headless gem.

Web crawlers, on the other hand, implement 'browsing' via an http stack and xml/html parser. The good ones utilize the excellent nokogiri gem to handle parsing. The one obvious missing functionality in this stack is that no access is provided to the DOM of the navigated page. This makes the use of this tool very difficult in an AJAX-laden site. The best of these tools is Mechanize.

I have to say, unequivocally, that you should try everything possible to make Mechanize your chosen tool. Every facet of scraping data with this tool seemed to be about an order of magnitude faster than using the Capybara/xxx combination. Part of the reason for this is by design. The emulated browsers (designed as testing tools) were not necessarily optimized for speed.

The biggest challenge I encountered in building my application was the requirement to download some documents stored by the site I was scraping them and archive them to an Amazon s3 store. I thought this would be a fairly straightforward evolution as Mechanize provides what they call 'pluggable parsers'. This is just a mapping of what Mechanize parser would handle certain mime types encountered on a page. This code block tells Mechanize that whenever the application/pdf mime type is encountered, use the Mechanize::Download class to download the pdf.

agent = Mechanize.new
agent.get 'http://samplepdf.com'
agent.pluggable_parsers.pdf = Mechanize::Download

The pdf document can now be saved using:

doc = agent.get 'http://samplepdf.com/sample.pdf'
doc.save(mylocalfilename)

Great, I thought. I'll still be able to use the fast tool. As I delved further into the site I was scraping, though, I discovered that I was not provided a direct link to the pdf's that were to be downloaded. Sloggng through the javascript library revealed that upon clicking the download button, a form was constructed by a function using the
required parameters (document number, case number, preferences, etc). Mechanize::Form to the rescue...

params = @agent.page.at('form').attributes["onsubmit"].value.match(/\((.*?)?\)/)[1].split(",").each{|e| e.gsub!(/'/,"")}
builder = Nokogiri::HTML::Builder.new do |doc|
doc.form_ :enctype=>"multipart\/form-data", :method=>"POST", :action=>params[0], :id=>id" do
doc.input :type=>"hidden", :name=>"caseid", :value => params[1]
doc.input :type=>"hidden", :name=>"de_seq_num", :value => params[2]
doc.input :type=>"hidden", :name=>"got_receipt", :value => params[3]
end
end
------------------------------------------------------------------------------------------------------------------------------
node = Nokogiri::HTML(builder.to_html)
f2 = Mechanize::Form.new(node.at('form'), @agent, @agent.page)
doc = f2.submit #submit the newly built form

#sometimes the document was being loaded into an iframe
if doc.is_a?(Mechanize::Page)
src = doc.at('iframe')['src']
doc = @agent.get src
end
doc.save(document_name))

where:

params - the params submitted on the actual form are parsed for use by the new form
builder - a nokogiri classed used to build xml/html documents
node - the form elements contained in the builder's output
f2 - the new form. See how it is attached to the current location of the Mechanize agent during initialization
doc - the pdf to be downloaded

Using these techniques, I was able to switch from using the watir-webdriver for Capybara to Mechanize and achieve a 12x increase in performance on this application. Now, if I can just figure out what to do with the Ajax interactions...

Screen Scraping Tools for Rails Developers

You need to be a member of TechHui to add comments!

Sponsors