Archive for June, 2009

Pixman gets NEON support

Friday, June 12th, 2009

I’ve been working on NEON fastpaths for Pixman lately, and as I write, these are being pushed upstream, hopefully in time for Pixman’s next stable release.  They complement some work already done in this area by engineers at ARM.  Some ARM hardware does use 32-bit framebuffers, but hardware constraints still seem tight enough that 16-bit framebuffers are still common.  So while the ARM guys focused mostly on 32-bit framebuffers and some internal operations, we focused firmly on 16-bit framebuffers.

For those who don’t know, Pixman is a backend library shared by Cairo and X.org, which takes care of various basic 2D graphics operations when there isn’t any specific GPU support for them.  It gets pretty heavy use if you use the XRender protocol on a bare framebuffer, for example.  So optimising Pixman for the latest ARM developments will make Gecko faster, as well as any of those “fancy” compositing window managers which are all the rage these days.

Now the following operations are accelerated, all on RGB565 framebuffers (which may or may not be cached):

  • Flat rectangular fills.  (These also work on other framebuffer formats.)
  • Copying 16-bit images around.
  • Converting 24-bit xRGB images (eg. a decoded JPEG) into the framebuffer format.
  • Flat translucent rectangles.
  • Compositing 32-bit ARGB images (eg. a decoded PNG).
  • Glyphs and strings thereof (8-bit alpha masks, with an overall colour that might be translucent).

Most of the listed operations are now at least twice as fast as they were without NEON, and many come within spitting distance of available memory bandwidth on typical ARMv7 hardware.  Using a benchmark of common operations (as issued by a common Web browser visiting a popular news portal), we measured an overall doubling in performance, despite the most common drawing operations being extremely tiny and therefore difficult to optimise.

In some cases on a more synthetic benchmark, the throughput is vastly greater than that, at least when running on an uncached framebuffer (which tends to hurt generic code very badly).  The main performance techniques were to read from the framebuffer in big chunks (where required), preload source data into the cache, and then process data in decent-sized chunks per loop iteration.  This essentially removes the performance advantage of a “shadowed framebuffer”, so you can now sensibly save memory by turning it off.

We also found some opportunities for reducing per-request overhead in both Pixman and X.org.  Hopefully these improvements will also be integrated upstream in the near future.