Here's how I optimized png.rb to make PNG saving over 100 times faster using RubyInline. This is a good model to follow to make any ruby program faster.
The basic steps are to optimizing code are:
ruby-prof is a profiler for ruby that uses event_hook instead of set_trace_func to give a fast profile. It also has a really nifty cross-referenced HTML call graphs that lets you easily spot the slow parts of your code.
ruby-prof comes in gem form, gem install ruby-prof.
Below I've snipped out sections of the ruby-prof's output to show what parts of the profile I found most relevant.
A profiling benchmark is important, your profile results need to be consistent and repeatable. Attempting to profile code that you keep changing will only lead you astray.
I used this code as my profiling benchmark:
require 'png'
class PNGProfile
def draw
canvas = PNG::Canvas.new 400, 400
png = PNG.new canvas
png.to_blob
end
end
pp = PNGProfile.new
10.times do pp.draw end
I used two commands to benchmark this code:
time ruby -Ilib example/profile.rb
For asking the simple question, "is it faster?" and:
rm profile.html; \ ruby -Ilib -S ruby-prof -p graph_html example/profile.rb > profile.html; \ open profile.html
For generating a full profile to see exactly what changed.
RubyInline allows you to easily embed other languages like C into ruby code. It handles argument conversions, shared library initialization, compiling, linking, and loading for you leaving you with nothing to do but write C code.
I'll be using RubyInline to optimize png.rb.
If you want to play along at home, download and unpack png-1.0.0.tgz and save the profile benchmark code into example/.
I also added PNG#to_blob to eliminate the need to write the file to disk, it looks almost exactly like PNG#save:
def to_blob
blob = []
blob << [137, 80, 78, 71, 13, 10, 26, 10].pack("C*") # PNG signature
blob << PNG.chunk('IHDR',
[ @height, @width, @bits, 6, 0, 0, 0 ].pack("N2C5"))
# 0 == filter type code "none"
data = @data.map { |row| [0] + row.map { |p| p.values } }.flatten
blob << PNG.chunk('IDAT', Zlib::Deflate.deflate(data.pack("C*"), 9))
blob << PNG.chunk('IEND', '')
blob.join
end
I added my RubyInline extensions in png.rb and modified png.rb as appropriate, so while you play along you should do that too.
Before you jump down to C you need to optimize the ruby code as much as possible. Use the profiler to determine the slow spots and optimize until you can't optimize anymore.
When done, you should end up with most of your time in one or two methods:
| %Total | %Self | Total | Self | Children | Calls | Name |
|---|---|---|---|---|---|---|
| 61.55 | 0.04 | 61.51 | 10/10 | PNGProfile#draw | ||
| 98.34% | 0.06% | 61.55 | 0.04 | 61.51 | 10 | PNG#to_blob |
| 1.85 | 0.07 | 1.78 | 10/4013 | Array#map | ||
| 0.27 | 0.27 | 0.00 | 10/10 | <Class::Zlib::Deflate>#deflate | ||
| 42.32 | 25.52 | 16.80 | 30/60 | Array#pack | ||
| 16.41 | 12.26 | 4.15 | 10/10 | Array#flatten | ||
| 0.00 | 0.00 | 0.00 | 10/90 | Array#join | ||
| 0.66 | 0.01 | 0.65 | 30/30 | <Class::PNG>#chunk | ||
| 0.00 | 0.00 | 0.00 | 40/46 | Array#<< |
Ryan Davis had already optimized png.rb to the point that 98% of the time was spent in one method, PNG#to_blob. This was as fast as it could get before sacrificing code clarity. If your profile doesn't look like this you've got some work to do before resorting to C. So get crackin!
In the profile you can see that most of the time taken to draw a PNG is used up by Array#pack and Array#flatten. Array#pack is consuming two-thirds of our time, so that is where we'll focus our attention first.
Here is Array#pack's chunk of the profile:
| %Total | %Self | Total | Self | Children | Calls | Name |
|---|---|---|---|---|---|---|
| 42.32 | 25.52 | 16.80 | 30/60 | PNG#to_blob | ||
| 0.00 | 0.00 | 0.00 | 30/60 | <Class::PNG>#chunk | ||
| 67.61% | 40.77% | 42.32 | 25.52 | 16.80 | 60 | Array#pack |
| 16.80 | 16.80 | 0.00 | 6404210/6433184 | Integer#to_int |
Wow! 6.4 million calls to Integer#to_int? Where is that coming from? Looking in pack.c we find:
static unsigned long
num2i32(x)
VALUE x;
{
x = rb_to_int(x); /* is nil OK? (should not) */
if (FIXNUM_P(x)) return FIX2LONG(x);
if (TYPE(x) == T_BIGNUM) {
return rb_big2ulong_pack(x);
}
rb_raise(rb_eTypeError, "can't convert %s to `integer'", rb_obj_classname(x));
return 0; /* not reached */
}
Pack is making sure code like [Rational(2, 1)].pack 'C' will
work, but we know we have only Fixnum objects in our Arrays, so we don't need
all those checks. I'll just take them out:
class Array # :nodoc:
require 'inline'
inline do |builder|
builder.c <<-EOC
static void
fast_pack() {
VALUE res;
long i;
char c;
res = rb_str_buf_new(0);
for (i = 0; i < RARRAY(self)->len; i++) {
c = FIX2LONG(RARRAY(self)->ptr[i]);
rb_str_buf_cat(res, &c, sizeof(char));
}
return res;
}
EOC
end
rescue
def fast_pack
pack 'C*'
end
end
Now we run the profile again and see what happened to PNG#to_blob:
| %Total | %Self | Total | Self | Children | Calls | Name |
|---|---|---|---|---|---|---|
| 20.48 | 0.04 | 20.44 | 10/10 | PNGProfile#draw | ||
| 94.81% | 0.19% | 20.48 | 0.04 | 20.44 | 10 | PNG#to_blob |
| 1.81 | 0.04 | 1.77 | 10/4013 | Array#map | ||
| 0.24 | 0.24 | 0.00 | 10/10 | |||
| 0.00 | 0.00 | 0.00 | 20/50 | Array#pack | ||
| 16.50 | 12.35 | 4.15 | 10/10 | Array#flatten | ||
| 1.23 | 1.23 | 0.00 | 10/10 | Array#fast_pack | ||
| 0.00 | 0.00 | 0.00 | 10/90 | Array#join | ||
| 0.66 | 0.00 | 0.66 | 30/30 | <Class::PNG>#chunk | ||
| 0.00 | 0.00 | 0.00 | 40/46 | Array#<< |
Much better! PNG#to_blob is now three times faster!
There's still room for improvement though, what is Array#flatten doing that's taking so long?
| %Total | %Self | Total | Self | Children | Calls | Name |
|---|---|---|---|---|---|---|
| 20.48 | 0.04 | 20.44 | 10/10 | PNGProfile#draw | ||
| 16.50 | 12.35 | 4.15 | 10/10 | PNG#to_blob | ||
| 76.39% | 57.18% | 16.50 | 12.35 | 4.15 | 10 | Array#flatten |
| 4.15 | 4.15 | 0.00 | 1600000/1602057 | Fixnum#== |
1.6 million calls to Fixnum#==, why is it doing that? Looking in array.c we find:
static long
VALUE
rb_ary_includes(ary, item)
VALUE ary;
VALUE item;
{
long i;
for (i=0; ilen; i++) {
if (rb_equal(RARRAY(ary)->ptr[i], item)) {
/* ... */
flatten(ary, idx, ary2, memo)
VALUE ary;
long idx;
VALUE ary2, memo;
{
VALUE id;
long i = idx;
long n, lim = idx + RARRAY(ary2)->len;
id = rb_obj_id(ary2);
if (rb_ary_includes(memo, id)) {
/* ... */
So Array#flatten checks for recursive arrays to prevent you from running out of
memory if you happen to give it one. (a = []; a << a; a.flatten)
I know the PNG isn't a recursive Array, so my first thought is to pull that
code out:
class Array # :nodoc:
require 'inline'
inline do |builder|
builder.include '"intern.h"'
builder.prefix <<-EOC
static inline VALUE
elt(VALUE ary, long offset) { /* from array.c */
/* paste rb_ary_elt here */
}
static void
splice(VALUE ary, long beg, long len, VALUE rpl) { /* from array.c */
/* paste rb_ary_splice here */
}
static long
flatten(VALUE ary, long idx, VALUE ary2) { /* from array.c */
VALUE id;
long i = idx;
long n, lim = idx + RARRAY(ary2)->len;
splice(ary, idx, 1, ary2);
while (i < lim) {
VALUE tmp;
tmp = rb_check_array_type(elt(ary, i));
if (!NIL_P(tmp)) {
n = flatten(ary, i, tmp);
i += n; lim += n;
}
i++;
}
return lim - idx - 1; /* returns number of increased items */
}
EOC
builder.c <<-EOC
static void
fast_flatten() { /* from array.c */
VALUE flat = rb_ary_dup(self);
long i = 0;
while (i < RARRAY(flat)->len) {
VALUE sub_ary = RARRAY(flat)->ptr[i];
VALUE tmp;
tmp = rb_check_array_type(sub_ary);
if (!NIL_P(tmp)) {
i += flatten(flat, i, tmp);
}
i++;
}
return flat;
}
EOC
# ...
end
rescue
def fast_flatten
flatten
end
end
Yuck! lots of cut and paste to pull out that small check, let's see if it did any good.
| %Total | %Self | Total | Self | Children | Calls | Name |
|---|---|---|---|---|---|---|
| 10.82 | 0.04 | 10.78 | 10/10 | PNGProfile#draw | ||
| 91.00% | 0.34% | 10.82 | 0.04 | 10.78 | 10 | PNG#to_blob |
| 6.81 | 6.81 | 0.00 | 10/10 | Array#fast_flatten | ||
| 1.85 | 0.01 | 1.84 | 10/4013 | Array#map | ||
| 0.28 | 0.28 | 0.00 | 10/10 | |||
| 0.00 | 0.00 | 0.00 | 20/50 | Array#pack | ||
| 1.19 | 1.19 | 0.00 | 10/10 | Array#fast_pack | ||
| 0.00 | 0.00 | 0.00 | 10/90 | Array#join | ||
| 0.65 | 0.00 | 0.65 | 30/30 | |||
| 0.00 | 0.00 | 0.00 | 40/46 | Array#<< |
An improvement, but not as much improvement as I was hoping for. We've dropped from 60 seconds to 11 seconds, about a 6x speedup. I bet there's more to squeeze out of this Array#flatten.
Note that calling #== took about four seconds of the time spent in Array#flatten, but removing it eliminated about ten seconds from the total in Array#fast_flatten. Part of those six seconds of time is ruby's method call overhead.
Flatten is built to handle any structure of Array, but the PNG canvas' Array has a uniform structure:
[ # [filter_id, [R, G, B, A], ...], [0, [0xff, 0xff, 0xff, 0xff], [0xff, 0xff, 0xff, 0xff]], [0, [0xff, 0xff, 0xff, 0xff], [0xff, 0xff, 0xff, 0xff]], ]
From this regular structure we can copy the chunks of the component Array into the correct spot in the flattened Array, skipping the recursion, resizing and bookkeeping that Array#flatten needs.
class Array # :nodoc:
require 'inline'
inline do |builder|
# ...
builder.c <<-EOC
static void
faster_flatten() {
VALUE flat, row, pixel;
long total_length, height, width, cur = 0, x = 0, y = 0;
height = RARRAY(self)->len;
width = RARRAY(RARRAY(self)->ptr[0])->len;
total_length = height * (width - 1) * 4 + height; /* data + filter */
flat = rb_ary_new2(total_length);
for (x = 0; x < height; x++) {
row = RARRAY(self)->ptr[x];
pixel = RARRAY(row)->ptr[0]; /* row filter */
MEMCPY(RARRAY(flat)->ptr + cur, &pixel, VALUE, 1);
cur++;
for (y = 1; y < width; y++) { /* row data */
pixel = RARRAY(row)->ptr[y];
MEMCPY(RARRAY(flat)->ptr + cur, RARRAY(pixel)->ptr, VALUE, 4);
cur += 4;
}
}
RARRAY(flat)->len = total_length;
return flat;
}
EOC
end
rescue
# ...
def faster_flatten
flatten
end
end
Much cleaner than hacking recursive Array checking out of Array#flatten. Let's see how it performs:
| %Total | %Self | Total | Self | Children | Calls | Name |
|---|---|---|---|---|---|---|
| 4.11 | 0.00 | 4.11 | 10/10 | PNGProfile#draw | ||
| 79.50% | 0.00% | 4.11 | 0.00 | 4.11 | 10 | PNG#to_blob |
| 1.83 | 0.04 | 1.79 | 10/4013 | Array#map | ||
| 0.72 | 0.72 | 0.00 | 10/10 | <Class::Zlib::Deflate>#deflate | ||
| 0.12 | 0.12 | 0.00 | 10/10 | Array#faster_flatten | ||
| 0.00 | 0.00 | 0.00 | 20/50 | Array#pack | ||
| 0.78 | 0.78 | 0.00 | 10/10 | Array#fast_pack | ||
| 0.00 | 0.00 | 0.00 | 10/90 | Array#join | ||
| 0.66 | 0.00 | 0.66 | 30/30 | <Class::PNG>#chunk | ||
| 0.00 | 0.00 | 0.00 | 40/46 | Array#<< |
That's more speedup than I was expecting! Array#faster_flatten is over fifty times faster than Array#fast_flatten, and two orders of magnitude faster than Array#flatten. Overall the time to write a PNG has been cut by an order of magnitude.
In the end, PNG#to_blob looks like this:
def to_blob
blob = []
blob << [137, 80, 78, 71, 13, 10, 26, 10].pack("C*") # PNG signature
blob << PNG.chunk('IHDR',
[ @height, @width, @bits, 6, 0, 0, 0 ].pack("N2C5"))
# 0 == filter type code "none"
data = @data.map { |row| [0] + row.map { |p| p.values } }.faster_flatten
blob << PNG.chunk('IDAT', Zlib::Deflate.deflate(data.fast_pack, 9))
blob << PNG.chunk('IEND', '')
blob.join
end
Remember how we got here, profile, tweak, profile, always making sure your results match your expectations. When my expectations didn't match my results I backed off and tried again. I also only used C after removing the pure-ruby options for optimization.
If we wanted to do further optimization, Array#map would be the next place to look. Optimization of Array#map would involve replacing part of PNG#to_blob with C code that would add the filter ids and extracted the colors from the canvas rows into a flat Array. I'll leave this optimization as an exercise for you.