Hautaulogy
Broken Arrow
10/22/13

I recently started a Rails project that involves web page scrapping to acquire data. The scrapping process uses the Nokogiri gem to fetch HTML as Nokogiri::HTML::Document objects, which can be manipulated and stored as data in my database— so far, so good!

But it wasn't good. Certain numerical data depended on the presence of multibyte characters in the document text, specifically "↑" and "↓". This would indicate that integers were either positive or negative.

  
    string = "↓1"

    # Turn tiny arrows into operators so I can parse strings as integers!

    # plus
    string.gsub!("↑","+")

    # minus
    string.gsub!("↓","-")

    # "↓1" should become "-1"
    integer = string.to_i

  

However, initializing the application produced this error:

  
    21:59:05 resque.1 | rake aborted!
    21:59:05 resque.1 | .../ruby-1.9.3-p448@global/gems/rake-10.1.0/lib/rake/traceoutput.rb:16:in `block in traceon': invalid byte sequence in US-ASCII (ArgumentError)
  

In the words of Don LaFontaine:

  
  ["↑","↓"].include?(@enemy)
  

There were dependencies in the codebase that couldn't handle non-ASCII characters. Thus began my short and unexpected adventure into character encoding in Ruby— a brief chapter of the ole' Pickaxe (Chapter 17, "Character Encoding" is unfortunately not available in the free web extracts).

Use of non-ASCII characters in Ruby can be allowed by specifying a different encoding via a "magic comment" at the top of a script, like so:

  
    #encoding: ISO-8891-1
    puts "Olé!"
  

I eventually turned impatient sifting for an alternative encoding that would work for "↑" and "↓". Instead of using a "magic comment", I decided to see if I could parse the string object as an HTML entity.

One of the joys of programming in Ruby is that there is a plethora of open source libraries available to tackle confounding problems such as these. The HTMLEntities gem offers a powerful and convenient Swiss Army Knife to parse HTML entities as you like.

My predicament with "↑" and "↓" was over. Phew!

  
    string = "↓1"
    # Encode tiny arrows as HTML entities if they're present.
    coder = HTMLEntities.new
    string = coder.encode(string, :hexadecimal)
  
    # Turn encoded tiny arrows into operators so I can parse strings as integers!
    # plus
    string.gsub!("↑","+")
    
    # minus
    string.gsub!("↓","-")

    # "↓1" should become "-1" and not break anything!
    integer = string.to_i
  

comments powered by Disqus