Turn on the NEON-light. Stronger than you would expect.
ARM-NEON is a SIMD (Single instruction, multiple data) instruction set for the latest ARM cores. The NEON instruction set focuses on integer (8, 16 and 32 bit) and single precision floating point arithmetic. It also has some unique features like using the same register array as the Vector Floating Point (VFP) unit which allows mixing SIMD and common floating point instructions.
To demonstrate the power of the NEON instruction set, we have reimplemented the
Reimplementing something in assembly level is not just a simple optimization (compilers are able to do that effectively). It requires to "rethink" the whole algorithm. Only we know the purpose of our code, the compilers knowledge is limited by the source code. This extra knowledge allows us to utilize better the available features, and not just translating the C++ representation to some unmaintable assembly rubbish.
Ok, so what do I do in a different way? First of all, the normal vector calculation. To do this, we need the 3x3 alpha channel matrix centered around the current pixel. Since the pixels are calculated left-to-right and top-to-bottom, it is enough to update the pixels as seen on the next image:
This matrix is stored in the NEON registers, and we only need 3 memory loads to update the matrix. However, it is difficult to describe this on C++ level, so we always reload the nine (3x3) pixels. To be more precise, we only load 8 pixels, since the center pixel is unnecessary for the normal vector calculation. But still, this is more than twice memory loads than necessary! After we have the matrix we need to multiply each value by a coefficent. Since the alpha values are in 0-255 range and the coefficients are small integer numbers, results are always fit to a signed 16 bit short integer value. We can do these (eight!) multiplications by using a single SIMD instruction! Furthermore we even need to summarize the result values, which would normally take 7 addition instructions, but three parallel additions are enough on NEON (Note: 23 == 8 values).
Lighting algorithms use lots of floating point calculations as well. For example the normalized dot product is calculated in the following way: all NEON registers can contain up to four single values, which are the X, Y, Z coordinates of a the vector and the length of the vector as the fourth value. The result of the normalized dot product is (X1*Y1+X2*Y2+X3*Y3)/(Length1*Length2). All multiplications can be done by only one SIMD instruction. The rest of the operations (two addition and one division) are performed by VFP floating point instructions.
Since memory loads are costly operations, reducing them is essential for high performance. Although the lighting filter requires lots of arguments we were able to keep them in the regular ARM and NEON registers. The stack contains only the saved registers.
Since the primary aim of using assembly is increasing the performance, we should talk about how the resulting code affects the runtime. According to my measurements on a CortexA8 ARM-NEON CPU, the execution time can be 4 times faster for large filters. This work is available in the bugzilla but hopefully it will be part of WebKit soon.