How can I make my ARM device fast?

In a previous article, I described common performance pitfalls that ARM devices typically succumb to.  Here, I will lay out how to avoid most of those problems.

Tip 1: Give your designers representative hardware to test on.

The latest ARM hardware has roughly the same performance as a good netbook.  So give your UI designers netbooks or nettops (based on Atom, ION or AMD Fusion), if not as their main workstation, then as a performance-appropriate test platform.  They aren’t very expensive and won’t take up much desk space, and can usually be multiplexed into their existing keyboard, mouse and monitor.

This will encourage them to write efficient software in the first place, so do this as soon as possible when setting up the project.

Even better is to give them the real hardware to try out, but a netbook has the advantage of being usable in a desktop-like way, so it can be used for debugging.

Tip 2: Understand, before specifying hardware, what you need it to do.

Do you want simple 2D graphics?  Video?  Alpha blending (and is it premultiplied or not)?  Alpha blending on top of video (which probably requires “textured video”)?  3D?  Two-way video?  High definition?  Dual displays?  Touch gestures – with or without stylus?  ClearType?  More than one of these simultaneously?

What about sound?  Security?  Updates?  Recovery?  Power consumption? Battery recharging?  Ethernet?  Wifi?  Bluetooth?  USB (as host and/or slave)?  GPS (does it require Assistance)?  Mobile data?

What happens if you need to use a lot of textures / a lot of pixmaps / a very very large pixmap/texture?  A very prominent and popular website uses several extremely tall images, over 20,000 pixels on one side, to store it’s customisable skins, so this is a relevant question even if you “only” want a Web browser.

A lot of hardware fails to effectively support at least one of the above, but will seem attractive because it is cheap.  But trying to add even one of these features after committing to the hardware spec will be *expensive*.  Beware of false economies.

Tip 3: Pick hardware that works out of the box and can reliably support everything you will ask of it.

Get assurances from the vendor, in writing and tied into penalties for non-compliance, that the features you require will actually work at the performance you need.  Remember also that it doesn’t matter if the vendor’s own tech demos show something working, if the drivers are so unreliable or non-standards-compliant that you can’t integrate it into your product.

Get the vendor to set the hardware up with your favoured operating system, with your engineers present (and your subcontractor, if they’re doing the work) so that they can later replicate it easily.  At this stage, the features on your checklist should all be demonstrated working – individually is okay.  If more than one hardware vendor is involved, get them all in the same room for this purpose.

Do this *before* starting billable engineering effort on software integration.  Meanwhile, get your software working on those netbooks.

It should not take two weeks to figure out how to flash an OS image in and boot the device, followed by six man-months to make the 3D engine work.  That’s what you’re paying the hardware vendor for.

Tip 4: Pick middleware carefully.

Most software toolkits, such as Adobe Flash, Qt, Gstreamer, and X11, offer a very rich array of capabilities to applications.  They practically guarantee that if you ask them to do something, they will do it.  You might think this is a good thing, and on the desktop it is a good thing.

But what they do not offer is any indication of whether your command will be done quickly or smoothly.  What’s worse, most toolkits don’t provide any way for you to determine what can and can’t be done efficiently – which is called introspection.  It doesn’t even always match up with the hardware’s capabilities.

There is one prominent graphics API which does not share this problem: OpenGL ES.  The base API is designed specifically around the capabilities of common GPUs, and new GPUs are expected to accelerate these features as a minimum.  Extra capabilities are explicitly advertised at runtime via queriable constants and extensions – you can write simple test programs to see them yourself.

GLES hardware vendors generally don’t advertise features which they haven’t managed to get running acceptably fast, for one simple reason: it risks games running slowly on their hardware.  There is no such built-in restraint for most other APIs.

You can still make GLES run slowly by simply giving it too much to do, or by using a feature which is not expected to be fast (like reading back the contents of a texture or the framebuffer).  But the hardware-centric design does make it far less likely that you will be surprised by it.  At the very least, if you use GLES directly, you get to choose whether you need to read back the framebuffer.

GLES can be used for 2D UIs as well.  The iPhone uses it to provide it’s famously slick UI, despite (in the older versions) having a slightly feeble ARM11 CPU.  There’s absolutely no reason, in principle, why you can’t do the same.

Of course, you don’t have to use GLES if you don’t want to – after all, it can’t do absolutely everything.  But if you are choosing another API because it supports more features, you should ask yourself exactly how it implements them, and whether you’ll get the performance you require.

Some APIs are designed to run well on top of GLES, explicitly using it’s strengths and avoiding it’s weaknesses.  Others run into trouble when they stumble across something that they promise to do but the hardware can’t accelerate, and don’t think ahead to avoid a major penalty.  A select few are actually performance-tested regularly, using real application traces – Cairo is among these.

Tip 5: Insist on usable video acceleration support (if you need video).

Many vendors provide some kind of video decode accelerator, which can often cope with typical H.264 video at 720p30, and some are now appearing with claims of 1080p30 support.  ARM CPUs, even the latest multicore NEON-enabled versions, should not be expected to decode high-definition video unassisted.

You will need one of the following features to use your decoder:

1) Video-to-GLES-texture support.  This is usually done via OpenMAX and various EGL extensions, and is essentially required for accelerated Adobe Flash support.  Often called “textured video”.

2) Direct scaled output from the video decoder to the framebuffer or a hardware overlay.  This is not sufficient for Adobe Flash support, but it is useful for many relatively simple applications, including two-way video calls.  Note that you may need to scale small videos up and large videos down, so check that both work properly and look good.

3) Cached (or otherwise fast) CPU access to the video decoder’s output buffers.  This is the only truly acceptable alternative to explicit video-texture support, as you can copy (or convert) the data into any point in the graphics pipeline.

We have not yet seen an implementation in this last category – the video decoder (along with the rest of the GPU) always seem to hang directly off the main bus rather than the CPU cache, and flushing the cache is not made fast enough to make that a viable method of maintaining coherency.  Note that standard uncached access is too slow to be useful.

CPU and SoC vendors take note: including the GPU in the cache hierarchy makes sense, and that’s how Sandy Bridge does it – ie. cache the DRAM, not the CPU.  Or at least include a fast address-range cache flush and expose it via a kernel API or an unprivileged cp15 instruction.  The problem we need to avoid is a full column-address (or even row-address) latency on the memory bus for every single load instruction in a performance-critical graphics routine.

Tip 6: When in doubt, ask an expert with a track record.

That’s us.  :-)

In particular, if you have any doubt as to whether a particular toolkit or API uses the hardware (or the underlying APIs) efficiently and effectively – which is often not clear from the marketing claims or the desktop performance – we can probably investigate it for you.

Tip 7: Resist the temptation to add features once the project is underway.

Seriously, this has been the most basic feature of Project Management since The Mythical Man Month was published decades ago.  Yet we still see it happening, and these projects always end up adding months to their schedule.

Once you’ve specified your platform for a specific job, expanding that job runs a very high risk that the platform won’t live up to it.  You can’t “just bolt on” a video player or a Flash plugin.  You might not even be able to run the video decoder and the 3D engine at the same time.  You might run into VRAM limitations if you add something as “simple” as permitting custom theming of the UI, or a marginally acceptable fillrate might be destroyed if you add a tiny translucent corner or shadow to a window.  So think very carefully when considering any change to the spec.

One Response to “How can I make my ARM device fast?”

  1. Giri Says:

    Jag kan itne se en enda video du le4gger upp, iste4llet ff6r env deo ste5r det bara your browser kcan not sopuprt the video tag eller ne5t.Kan du inte byta eller ne5got, ff6r det e4r tre5kigt stt ite kunna kolla pe5 en enda film ne4r lisa le4gger upp funkar det dock

Leave a Reply

To prove you're a person (not a spam script), type the answer to the math equation shown in the picture. Click on the picture to hear an audio file of the equation.
Click to hear an audio file of the anti-spam equation