Using RubyInline for Optimization

Contents:

Here's how I optimized png.rb to make PNG saving over 100 times faster using RubyInline. This is a good model to follow to make any ruby program faster.

The basic steps are to optimizing code are:

  1. Profile
  2. Find the method taking the most time
  3. Try one change to make it faster
  4. Profile
  5. Check for improvement

Tools

ruby-prof

ruby-prof is a profiler for ruby that uses event_hook instead of set_trace_func to give a fast profile. It also has a really nifty cross-referenced HTML call graphs that lets you easily spot the slow parts of your code.

ruby-prof comes in gem form, gem install ruby-prof.

Below I've snipped out sections of the ruby-prof's output to show what parts of the profile I found most relevant.

Benchmark code

A profiling benchmark is important, your profile results need to be consistent and repeatable. Attempting to profile code that you keep changing will only lead you astray.

I used this code as my profiling benchmark:

require 'png'

class PNGProfile
  
  def draw
    canvas = PNG::Canvas.new 400, 400
    png = PNG.new canvas
    png.to_blob
  end
  
end

pp = PNGProfile.new

10.times do pp.draw end

I used two commands to benchmark this code:

time ruby -Ilib example/profile.rb

For asking the simple question, "is it faster?" and:

rm profile.html; \
ruby -Ilib -S ruby-prof -p graph_html example/profile.rb > profile.html; \
open profile.html

For generating a full profile to see exactly what changed.

RubyInline

RubyInline allows you to easily embed other languages like C into ruby code. It handles argument conversions, shared library initialization, compiling, linking, and loading for you leaving you with nothing to do but write C code.

I'll be using RubyInline to optimize png.rb.

Play along at home!

If you want to play along at home, download and unpack png-1.0.0.tgz and save the profile benchmark code into example/.

I also added PNG#to_blob to eliminate the need to write the file to disk, it looks almost exactly like PNG#save:

def to_blob
  blob = []
  blob << [137, 80, 78, 71, 13, 10, 26, 10].pack("C*") # PNG signature
  blob << PNG.chunk('IHDR',
                    [ @height, @width, @bits, 6, 0, 0, 0 ].pack("N2C5"))
  # 0 == filter type code "none"
  data = @data.map { |row| [0] + row.map { |p| p.values } }.flatten
  blob << PNG.chunk('IDAT', Zlib::Deflate.deflate(data.pack("C*"), 9))
  blob << PNG.chunk('IEND', '')
  blob.join
end

I added my RubyInline extensions in png.rb and modified png.rb as appropriate, so while you play along you should do that too.

Optimize Your Ruby

Before you jump down to C you need to optimize the ruby code as much as possible. Use the profiler to determine the slow spots and optimize until you can't optimize anymore.

When done, you should end up with most of your time in one or two methods:

%Total%SelfTotalSelfChildrenCallsName
  61.550.0461.5110/10PNGProfile#draw
98.34%0.06%61.550.0461.5110PNG#to_blob
  1.850.071.7810/4013Array#map
  0.270.270.0010/10<Class::Zlib::Deflate>#deflate
  42.3225.5216.8030/60Array#pack
  16.4112.264.1510/10Array#flatten
  0.000.000.0010/90Array#join
  0.660.010.6530/30<Class::PNG>#chunk
  0.000.000.0040/46Array#<<

Ryan Davis had already optimized png.rb to the point that 98% of the time was spent in one method, PNG#to_blob. This was as fast as it could get before sacrificing code clarity. If your profile doesn't look like this you've got some work to do before resorting to C. So get crackin!

Optimizing Array#pack

In the profile you can see that most of the time taken to draw a PNG is used up by Array#pack and Array#flatten. Array#pack is consuming two-thirds of our time, so that is where we'll focus our attention first.

Here is Array#pack's chunk of the profile:

%Total%SelfTotalSelfChildrenCallsName
  42.3225.5216.8030/60PNG#to_blob
  0.000.000.0030/60<Class::PNG>#chunk
67.61%40.77%42.3225.5216.8060Array#pack
  16.8016.800.006404210/6433184Integer#to_int

Wow! 6.4 million calls to Integer#to_int? Where is that coming from? Looking in pack.c we find:

static unsigned long
num2i32(x)
    VALUE x;
{
    x = rb_to_int(x); /* is nil OK? (should not) */
    
    if (FIXNUM_P(x)) return FIX2LONG(x);
    if (TYPE(x) == T_BIGNUM) {
        return rb_big2ulong_pack(x);
    }
    rb_raise(rb_eTypeError, "can't convert %s to `integer'", rb_obj_classname(x));
    return 0;			/* not reached */
}

Pack is making sure code like [Rational(2, 1)].pack 'C' will work, but we know we have only Fixnum objects in our Arrays, so we don't need all those checks. I'll just take them out:

class Array # :nodoc:
  
  require 'inline'
  
  inline do |builder|
    builder.c <<-EOC
      static void
      fast_pack() {
        VALUE res;
        long i;
        char c;
        
        res = rb_str_buf_new(0);
        
        for (i = 0; i < RARRAY(self)->len; i++) {
          c = FIX2LONG(RARRAY(self)->ptr[i]);
          rb_str_buf_cat(res, &c, sizeof(char));
        }
        
        return res;
      }
    EOC
  end
  
rescue
  
  def fast_pack
    pack 'C*'
  end
  
end

Now we run the profile again and see what happened to PNG#to_blob:

%Total%SelfTotalSelfChildrenCallsName
  20.480.0420.4410/10PNGProfile#draw
94.81%0.19%20.480.0420.4410PNG#to_blob
  1.810.041.7710/4013Array#map
  0.240.240.0010/10#deflate
  0.000.000.0020/50Array#pack
  16.5012.354.1510/10Array#flatten
  1.231.230.0010/10Array#fast_pack
  0.000.000.0010/90Array#join
  0.660.000.6630/30<Class::PNG>#chunk
  0.000.000.0040/46Array#<<

Much better! PNG#to_blob is now three times faster!

Optimizing Array#flatten

There's still room for improvement though, what is Array#flatten doing that's taking so long?

%Total%SelfTotalSelfChildrenCallsName
  20.480.0420.4410/10PNGProfile#draw
  16.5012.354.1510/10PNG#to_blob
76.39%57.18%16.5012.354.1510Array#flatten
  4.154.150.001600000/1602057Fixnum#==

1.6 million calls to Fixnum#==, why is it doing that? Looking in array.c we find:

  static long
  VALUE
  rb_ary_includes(ary, item)
      VALUE ary;
      VALUE item;
  {
      long i;
      
      for (i=0; ilen; i++) {
          if (rb_equal(RARRAY(ary)->ptr[i], item)) {
    
  /* ... */
    
  flatten(ary, idx, ary2, memo)
      VALUE ary;
      long idx;
      VALUE ary2, memo;
  {
      VALUE id;
      long i = idx;
      long n, lim = idx + RARRAY(ary2)->len;
      
      id = rb_obj_id(ary2);
      if (rb_ary_includes(memo, id)) {
      /* ... */

So Array#flatten checks for recursive arrays to prevent you from running out of memory if you happen to give it one. (a = []; a << a; a.flatten) I know the PNG isn't a recursive Array, so my first thought is to pull that code out:

class Array # :nodoc:

  require 'inline'
  
  inline do |builder|
    builder.include '"intern.h"'
    
    builder.prefix <<-EOC
      static inline VALUE
      elt(VALUE ary, long offset) { /* from array.c */
        /* paste rb_ary_elt here */
      }
      
      static void
      splice(VALUE ary, long beg, long len, VALUE rpl) { /* from array.c */
        /* paste rb_ary_splice here */
      }
      
      static long
      flatten(VALUE ary, long idx, VALUE ary2) { /* from array.c */
        VALUE id;
        long i = idx;
        long n, lim = idx + RARRAY(ary2)->len;
        
        splice(ary, idx, 1, ary2);
        while (i < lim) {
          VALUE tmp;
          
          tmp = rb_check_array_type(elt(ary, i));
          if (!NIL_P(tmp)) {
              n = flatten(ary, i, tmp);
              i += n; lim += n;
          }
          i++;
        }
        
        return lim - idx - 1; /* returns number of increased items */
      }
    EOC
    
    builder.c <<-EOC
      static void
      fast_flatten() { /* from array.c */
        VALUE flat = rb_ary_dup(self);
        long i = 0;
        
        while (i < RARRAY(flat)->len) {
          VALUE sub_ary = RARRAY(flat)->ptr[i];
          VALUE tmp;
          
          tmp = rb_check_array_type(sub_ary);
          if (!NIL_P(tmp)) {
            i += flatten(flat, i, tmp);
          }
          i++;
        }
        return flat;
      }
    EOC
    
    # ...
  end
  
rescue

  def fast_flatten
    flatten
  end
  
end

Yuck! lots of cut and paste to pull out that small check, let's see if it did any good.

%Total%SelfTotalSelfChildrenCallsName
  10.820.0410.7810/10PNGProfile#draw
91.00%0.34%10.820.0410.7810PNG#to_blob
  6.816.810.0010/10Array#fast_flatten
  1.850.011.8410/4013Array#map
  0.280.280.0010/10#deflate
  0.000.000.0020/50Array#pack
  1.191.190.0010/10Array#fast_pack
  0.000.000.0010/90Array#join
  0.650.000.6530/30#chunk
  0.000.000.0040/46Array#<<

An improvement, but not as much improvement as I was hoping for. We've dropped from 60 seconds to 11 seconds, about a 6x speedup. I bet there's more to squeeze out of this Array#flatten.

Note that calling #== took about four seconds of the time spent in Array#flatten, but removing it eliminated about ten seconds from the total in Array#fast_flatten. Part of those six seconds of time is ruby's method call overhead.

Optimizing Array#flatten Take Two

Flatten is built to handle any structure of Array, but the PNG canvas' Array has a uniform structure:

[
  # [filter_id, [R, G, B, A], ...],
  [0, [0xff, 0xff, 0xff, 0xff], [0xff, 0xff, 0xff, 0xff]],
  [0, [0xff, 0xff, 0xff, 0xff], [0xff, 0xff, 0xff, 0xff]],
]

From this regular structure we can copy the chunks of the component Array into the correct spot in the flattened Array, skipping the recursion, resizing and bookkeeping that Array#flatten needs.

class Array # :nodoc:
  
  require 'inline'
  
  inline do |builder|
    # ...
    
    builder.c <<-EOC
      static void
      faster_flatten() {
        VALUE flat, row, pixel;
        long total_length, height, width, cur = 0, x = 0, y = 0;
        
        height = RARRAY(self)->len;
        width = RARRAY(RARRAY(self)->ptr[0])->len;
        total_length = height * (width - 1) * 4 + height; /* data + filter */
        
        flat = rb_ary_new2(total_length);
        
        for (x = 0; x < height; x++) {
          row = RARRAY(self)->ptr[x];
          
          pixel = RARRAY(row)->ptr[0]; /* row filter */
          MEMCPY(RARRAY(flat)->ptr + cur, &pixel, VALUE, 1);
          cur++;
          
          for (y = 1; y < width; y++) { /* row data */
            pixel = RARRAY(row)->ptr[y];
            MEMCPY(RARRAY(flat)->ptr + cur, RARRAY(pixel)->ptr, VALUE, 4);
            cur += 4;
          }
        }
        RARRAY(flat)->len = total_length;
        return flat;
      }
    EOC
  end

rescue
  
  # ...
  
  def faster_flatten
    flatten
  end
  
end

Much cleaner than hacking recursive Array checking out of Array#flatten. Let's see how it performs:

%Total%SelfTotalSelfChildrenCallsName
  4.110.004.1110/10PNGProfile#draw
79.50%0.00%4.110.004.1110PNG#to_blob
  1.830.041.7910/4013Array#map
  0.720.720.0010/10<Class::Zlib::Deflate>#deflate
  0.120.120.0010/10Array#faster_flatten
  0.000.000.0020/50Array#pack
  0.780.780.0010/10Array#fast_pack
  0.000.000.0010/90Array#join
  0.660.000.6630/30<Class::PNG>#chunk
  0.000.000.0040/46Array#<<

That's more speedup than I was expecting! Array#faster_flatten is over fifty times faster than Array#fast_flatten, and two orders of magnitude faster than Array#flatten. Overall the time to write a PNG has been cut by an order of magnitude.

The result

In the end, PNG#to_blob looks like this:

def to_blob
  blob = []
  blob << [137, 80, 78, 71, 13, 10, 26, 10].pack("C*") # PNG signature
  blob << PNG.chunk('IHDR',
                    [ @height, @width, @bits, 6, 0, 0, 0 ].pack("N2C5"))
  # 0 == filter type code "none"
  data = @data.map { |row| [0] + row.map { |p| p.values } }.faster_flatten
  blob << PNG.chunk('IDAT', Zlib::Deflate.deflate(data.fast_pack, 9))
  blob << PNG.chunk('IEND', '')
  blob.join
end

Remember how we got here, profile, tweak, profile, always making sure your results match your expectations. When my expectations didn't match my results I backed off and tried again. I also only used C after removing the pure-ruby options for optimization.

Further Optimization

If we wanted to do further optimization, Array#map would be the next place to look. Optimization of Array#map would involve replacing part of PNG#to_blob with C code that would add the filter ids and extracted the colors from the canvas rows into a flat Array. I'll leave this optimization as an exercise for you.