Search |
Nitro-Extreme on ARM. What happens to nitro during extreme conditions?
Recently, a new branch has appeared in the WebKit trac, called Nitro-Extreme. I know the work has not been finished yet, but it never hurts to take a look at the current revision. What is the big deal about this branch? The guys at Apple changed the format of the core structure of JavaScriptCore: the JSValue. All JavaScript-level objects (from booleans to arrays) can be represented as JSValues, and many functions take JSValues as input and output arguments. JSValue was a 32 byte aligned pointer before, which could also hold atomic data types if the low-order bits were non-zero. Unfortunately, double precision floating-point numbers cannot be atomic types on 32 bit machines, because pointers are only 4 bytes long. In this branch, JSValue was extended with another 4 bytes, so it is unnecessary to allocate memory for doubles anymore. Therefore, the amount of work performed by the garbage collector can be reduced (hopefully). Garbage collection is an expensive operation, may take 20-30% of the total runtime. The downside of the new JSValue is that loading and storing them requires two CPU instructions instead of one. Dispite of the cache improvements, memory accesses are still the bottleneck of the CPU performance, especially on embedded systems. I made measurements with Nitro-Extreme on a Nokia N810 internet tablet equipped with an OMAP-2420 ARM CPU. The results are produced using the interpreter, since the branch does not yet support jit. First, we start with SunSpider, as it is the default benchmark of WebKit: TEST COMPARISON FROM TO DETAILS ============================================================================= ** TOTAL **: 1.064x as fast 52557.8ms +/- 1.7% 49393.3ms +/- 2.3% significant ============================================================================= 3d: 1.28x as fast 9237.8ms +/- 4.1% 7216.0ms +/- 3.4% significant cube: 1.66x as fast 2971.8ms +/- 6.0% 1793.0ms +/- 0.3% significant morph: ?? 3295.3ms +/- 3.7% 3427.3ms +/- 8.1% not conclusive: might be *1.040x as slow* raytrace: 1.49x as fast 2970.8ms +/- 5.2% 1995.7ms +/- 2.9% significant access: ?? 6308.5ms +/- 3.9% 6395.3ms +/- 12.4% not conclusive: might be *1.014x as slow binary-trees: 1.25x as fast 722.8ms +/- 7.8% 579.7ms +/- 7.7% significant fannkuch: *2.89x as slow* 1009.0ms +/- 2.3% 2912.7ms +/- 25.8% significant nbody: 2.34x as fast 4163.5ms +/- 5.4% 1781.7ms +/- 2.3% significant nsieve: *2.71x as slow* 413.3ms +/- 0.4% 1121.3ms +/- 1.6% significant bitops: *1.38x as slow* 2318.0ms +/- 2.7% 3193.0ms +/- 2.6% significant 3bit-bits-in-byte: *2.11x as slow* 329.0ms +/- 0.9% 694.0ms +/- 0.9% significant bits-in-byte: *1.59x as slow* 410.0ms +/- 0.5% 652.0ms +/- 12.6% significant bitwise-and: *1.32x as slow* 339.3ms +/- 1.8% 449.3ms +/- 1.8% significant nsieve-bits: *1.127x as slow* 1239.8ms +/- 4.9% 1397.7ms +/- 1.0% significant controlflow: *2.66x as slow* 302.3ms +/- 20.3% 805.3ms +/- 9.7% significant recursive: *2.66x as slow* 302.3ms +/- 20.3% 805.3ms +/- 9.7% significant crypto: ?? 2998.8ms +/- 3.4% 3031.3ms +/- 1.3% not conclusive: might be *1.011x as slow* aes: *1.42x as slow* 814.5ms +/- 1.5% 1155.0ms +/- 1.8% significant md5: 1.155x as fast 1083.5ms +/- 4.6% 938.3ms +/- 7.0% significant sha1: 1.174x as fast 1100.8ms +/- 5.8% 938.0ms +/- 5.1% significant date: - 5473.5ms +/- 3.9% 5472.7ms +/- 6.5% format-tofte: ?? 1929.8ms +/- 5.8% 1969.7ms +/- 6.5% not conclusive: might be *1.021x as slow* format-xparb: - 3543.8ms +/- 4.2% 3503.0ms +/- 6.7% math: 1.34x as fast 8873.5ms +/- 4.7% 6645.7ms +/- 1.6% significant cordic: 1.44x as fast 2979.8ms +/- 7.0% 2062.3ms +/- 7.2% significant partial-sums: 1.52x as fast 4135.5ms +/- 11.8% 2724.3ms +/- 6.7% significant spectral-norm: ?? 1758.3ms +/- 8.9% 1859.0ms +/- 7.2% not conclusive: might be *1.057x as slow* regexp: - 3940.0ms +/- 0.5% 3922.3ms +/- 1.2% dna: - 3940.0ms +/- 0.5% 3922.3ms +/- 1.2% string: 1.031x as fast 13105.5ms +/- 1.7% 12711.7ms +/- 2.6% significant base64: ?? 1513.8ms +/- 6.3% 1595.0ms +/- 4.9% not conclusive: might be *1.054x as slow* fasta: ?? 2780.0ms +/- 5.4% 2792.3ms +/- 9.6% not conclusive: might be *1.004x as slow* tagcloud: 1.059x as fast 2829.8ms +/- 0.8% 2671.7ms +/- 1.8% significant unpack-code: - 3755.0ms +/- 5.8% 3719.0ms +/- 4.5% validate-input: 1.152x as fast 2227.0ms +/- 4.1% 1933.7ms +/- 3.1% significant As you can see, the speed is greatly increased for both the math and the 3d groups but the others (especially bitops and contrloflow) suffer a great performance loss. Overall, the performance is slighlty increased for SunSpider. The next one is Google's v8 benchmark set: TEST COMPARISON FROM TO DETAILS ============================================================================= ** TOTAL **: *1.63x as slow* 165203.5ms +/- 14.5% 269875.0ms +/- 6.5% significant ============================================================================= v8: *1.63x as slow* 165203.5ms +/- 14.5% 269875.0ms +/- 6.5% significant crypto: *4.91x as slow* 20308.5ms +/- 2.3% 99666.0ms +/- 5.3% significant deltablue: *1.22x as slow* 53850.0ms +/- 11.3% 65641.7ms +/- 11.1% significant earley-boyer: ?? 19729.5ms +/- 53.1% 21353.0ms +/- 11.4% not conclusive: might be *1.082x as slow* raytrace: 1.21x as fast 24228.5ms +/- 11.5% 20084.0ms +/- 14.2% significant richards: *1.34x as slow* 47087.0ms +/- 36.5% 63130.3ms +/- 11.2% significant Only raytrace benefits from the branch. Note that crypto is nearly 5 times slower! Finally, our WindScorpion benchmark suite follows: TEST COMPARISON FROM TO DETAILS ============================================================================= ** TOTAL **: ?? 517314.0ms +/- NaN% 551942.0ms +/- NaN% not conclusive: might be *1.067x as slow* ============================================================================= WS: ?? 517314.0ms +/- NaN% 551942.0ms +/- NaN% not conclusive: might be *1.067x as slow* bubbleSort: ?? 23101.0ms +/- NaN% 32583.0ms +/- NaN% not conclusive: might be *1.41x as slow* des: - 51461.0ms +/- NaN% 30900.0ms +/- NaN% dicePoker: ?? 24493.0ms +/- NaN% 25374.0ms +/- NaN% not conclusive: might be *1.036x as slow* email: ?? 43445.0ms +/- NaN% 48088.0ms +/- NaN% not conclusive: might be *1.107x as slow* factor: - 22298.0ms +/- NaN% 22243.0ms +/- NaN% floyd: ?? 35773.0ms +/- NaN% 45194.0ms +/- NaN% not conclusive: might be *1.26x as slow* formatNumber: ?? 39421.0ms +/- NaN% 42229.0ms +/- NaN% not conclusive: might be *1.071x as slow* genetic: ?? 45551.0ms +/- NaN% 47097.0ms +/- NaN% not conclusive: might be *1.034x as slow* huffman: - 25344.0ms +/- NaN% 24755.0ms +/- NaN% IEEE754Conv: ?? 49178.0ms +/- NaN% 60139.0ms +/- NaN% not conclusive: might be *1.22x as slow* longFact: ?? 44821.0ms +/- NaN% 45709.0ms +/- NaN% not conclusive: might be *1.020x as slow* quickSort: ?? 12237.0ms +/- NaN% 18869.0ms +/- NaN% not conclusive: might be *1.54x as slow* redBlackTree: ?? 23470.0ms +/- NaN% 31335.0ms +/- NaN% not conclusive: might be *1.34x as slow* solve: ?? 48657.0ms +/- NaN% 49261.0ms +/- NaN% not conclusive: might be *1.012x as slow* xmlParser: ?? 28064.0ms +/- NaN% 28166.0ms +/- NaN% not conclusive: might be *1.004x as slow* WindScorpion takes much more time to complete than the other ones, so we usually run it only once. To give a real chance for the Nitro-Extreme branch we have repeated the whole measurement multiple times, and selected the best result. Since standard deviation cannot be calculated from one sample, the runtime environment yields that the performance change is "not conclusive". Can we see behind the raw runtimes? Yes we can, since we have an Intel-XScale cycle accurate simulator called XEEMU. The simulated CPU is configured for 600MHz with 32K instruction and data caches. The system calls are handled by the Linux kernel (version 2.6.21.5). We have selected two candidates from SunSpider: 3d-cube and controlflow-recursive. The first one became much faster, the latter became much slower. Let's see 3d-cube first: Executed instructions: 387910172 (Trunk) The number of extra instructions only partly explains the longer runtime of the trunk, we need to search further. A memory stall cycle means that the CPU cannot execute the next insturction, because either the instruction is not yet delivered by the instruction fetch stage, or a required input data is not yet loaded from the memory. Number of stall cycles caused by memory: 257012628 (Trunk) In this case the cache access pattern is much better for Nitro-Extreme, thus it runs much faster. Perhaps the memory scan performed by the garbage collector caused extra cache line evictions for the trunk. What about controlflow-recursive? Executed instructions: 55813742 (Trunk) Here the instruction increase fully explains the runtime increase. This statement is also proved by the other statistic values generated by the simulator. Update There was a huge (5x) slowdown for v8-crypto, so I decided to investigate that benchmark as well (it took 10 hours on xeemu because cycle accurate simulators are slow as molasses in January): CPU cycles: 6993234850 (Trunk) Executed instructions: 4768205023 (Trunk) IPC (instruction per cycle): |
Monthly archive
|
Anonymous (not verified) - 05/07/2009 - 18:09
"32 byte" -> "32 bit"
zoltan.herczeg - 05/08/2009 - 07:58
No, that is 32 bytes. JSValues points to JSCells. JSCells are aligned by a custom JavaScriptCore allocator to 32 bytes on 32 bit machines and 64 bytes on 64 bit machines.
Post new comment