Why is my ARM device slow?

At Movial, we are often asked to help customers achieve their dream user interface and multimedia experience with some selected hardware and/or software platform. Usually we are happy to provide our expertise, but all too often those of us at the coalface issue a collective groan when we are told what is expected of us.

The task often proves difficult due to mismatches between the hardware’s capabilities or design, the software toolkit, and the expectations of what can be achieved with them. This article will explain some of the difficulties we encounter and how customers and managers – that’s you – can help to maximise the likelihood of success.

The first and most important thing to remember is that embedded or mobile devices do not have the same capabilities as the PCs that you may be used to on your desk or at home.

Typical PCs have enormously powerful CPUs and graphics processors, linked by high-bandwidth buses, and can therefore get away with a lot of brute-force-and-ignorance in the software department. This ability comes at the cost of extremely high power consumption, which can easily reach into the hundreds of watts.

But this is unacceptable in the mobile domain, where a pocket-sized battery is often required to last for days. Hardware manufacturers therefore work hard to produce highly efficient processors inside a one-watt power envelope. The latest OMAP4 and Tegra2 chips are roughly as fast as a good netbook. Obviously, older chips – which may be cheaper – will have even less performance.

This all means that for a good user experience, the available hardware must be used to maximum efficiency, and special features of the hardware must match up to what is required. When this is not the case, your device will be slow.

The most obvious problem might be the use of high-level interpreted scripting languages, such as Javascript, for core functionality. These languages are literally incapable of using the ARM CPU efficiently.  Even Java is usually preferable, since many ARM CPUs have special support for Java, and modern JVMs are often well optimised for it.  While often harder to use, a compiled language will usually be a lot faster than an interpreted one, all other things being equal.

Most software toolkits, such as Adobe Flash, Qt, Gstreamer, and X11, offer a very rich array of capabilities to applications. They practically guarantee that if you ask them to do something, they will do it. But what they do not offer is any indication of whether your command will be done quickly or smoothly. What’s worse, most toolkits don’t provide any way for you to determine what can and can’t be done efficiently – which is called introspection.

If the toolkit doesn’t know how to make the hardware do something efficiently, it will do it in an inefficient way – and without telling the application that it is doing so. Usually this means retrieving all the necessary image data from the GPU (an extremely slow operation in itself), doing the job using generic routines on the CPU, and then pushing the completed image back to the GPU.  Sometimes the software fallback can be run directly on the graphics memory, but since this is an uncached area, this will still be much slower than expected – the CPU cannot use it’s extensive latency-hiding techniques to optimise loads from uncached memory.

This problem is mostly hidden on desktop hardware – not only are the drivers for desktop GPUs very well-featured, but the connections between modern CPUs and GPUs are very fast, which allows software fallbacks to run relatively efficiently. These advantages are not available on typical ARM hardware.

As a concrete example, we have several times been asked to investigate why some simple blitting hardware was not accelerating alpha-blending properly, when it was being used successfully for fills and copies. We usually found that the hardware could only accelerate non-premultiplied alpha-blending, whereas the graphics framework (eg. XRender or Qt) required premultiplied alpha-blending. The workarounds varied from coaxing some more capable part of the hardware into life, to completely replacing the hardware platform with a more capable one.

We have also occasionally discovered that features mentioned in the hardware documentation simply did not work. This not only reduces the capability of the hardware, it also completely throws off our effort estimates as we must scrape around for alternatives and workarounds which do work correctly.

Another typical problem area is integrating video decode acceleration into a rich graphics framework. Video is one of the most demanding tasks asked of a typical mobile device, with many videos now being at 720p or 1080p resolution (requiring 30-60 megapixels per second) and often requiring rescaling (without blockiness!) to fit the device’s screen. In a one-watt power envelope, this level of capability requires dedicated hardware acceleration.

Unfortunately, the video decoder’s output buffers are usually provided in a variety of formats which are often not directly interpretable by the main graphics APIs, and the buffers (being uncached) cannot be read efficiently by the CPU either. After a while, writing yet another convert-YUV-to-RGB routine to run on uncached source memory, and seeing it eat up nearly all of the CPU because of the memory-access inefficiency, gets a bit tiring. Even so, copying RGB data from those same uncached buffers would be even more taxing, because RGB data is larger than 4:2:0 YUV data.

Another potentially showstopping pitfall is where the display hardware can read the video buffers, but only for display directly on the framebuffer or on some overlay structure. This is acceptable for many simple applications, but Adobe Flash requires that the video frames are sent through the whole graphics pipeline, so retrieving them from the framebuffer or an overlay is unacceptable. If your idea of “the full desktop/Web experience” includes Flash – or even just YouTube – you will need hardware which is designed to accommodate Flash, which requires extremely tight integration of the video and graphics accelerators.

The above paragraph certainly goes a long way towards explaining Steve Jobs’ attitude towards Flash on the iPhone and iPad.

In a separate article to follow, I will outline how to set up your project for success, by avoiding the above traps.

One Response to “Why is my ARM device slow?”

  1. Avni Says:

    that DrisyA comes by binary exbleatue file. I think this is the reason why it became a 4.6 MB pack. Otherwise, a drawing program will not become too big. Making a software binary is mainly for three things:1. To make it compatible for every system.2. To make it run fast.3. To hide the source code.Since all GNU/Linux distribution used now have Python and GTK, no problem with compatibility.Speed problem is not such terrible for a normal use.Hiding the source codes looks not so good if it is a free software.Forget all! This software is a giant step! Thanks.People who like to make their software binary, please listen:Making a Python app binary means compiling all modules to one file. So, if you make even a simple calculator binary, it will take at least 4 MB. It is not good in Internet. (When you make a PyGTK app binary, at least these modules will be merged: gtk,pygtk,atk,pango,cairo,pangocairo,pickle)Thank you! Hope everyone is enjoying Free Software Revolution in Maths Blog!Once again, thanks Rajesh Sir!

Leave a Reply

*
To prove you're a person (not a spam script), type the answer to the math equation shown in the picture. Click on the picture to hear an audio file of the equation.
Click to hear an audio file of the anti-spam equation