Nitro-Extreme on ARM. What happens to nitro during extreme conditions?

Recently, a new branch has appeared in the WebKit trac, called Nitro-Extreme. I know the work has not been finished yet, but it never hurts to take a look at the current revision.

What is the big deal about this branch? The guys at Apple changed the format of the core structure of JavaScriptCore: the JSValue. All JavaScript-level objects (from booleans to arrays) can be represented as JSValues, and many functions take JSValues as input and output arguments. JSValue was a 32 byte aligned pointer before, which could also hold atomic data types if the low-order bits were non-zero. Unfortunately, double precision floating-point numbers cannot be atomic types on 32 bit machines, because pointers are only 4 bytes long. In this branch, JSValue was extended with another 4 bytes, so it is unnecessary to allocate memory for doubles anymore. Therefore, the amount of work performed by the garbage collector can be reduced (hopefully). Garbage collection is an expensive operation, may take 20-30% of the total runtime. The downside of the new JSValue is that loading and storing them requires two CPU instructions instead of one. Dispite of the cache improvements, memory accesses are still the bottleneck of the CPU performance, especially on embedded systems.

I made measurements with Nitro-Extreme on a Nokia N810 internet tablet equipped with an OMAP-2420 ARM CPU. The results are produced using the interpreter, since the branch does not yet support jit. First, we start with SunSpider, as it is the default benchmark of WebKit:

TEST                   COMPARISON            FROM                 TO             DETAILS


** TOTAL **:           1.064x as fast    52557.8ms +/- 1.7%   49393.3ms +/- 2.3%     significant


  3d:                  1.28x as fast      9237.8ms +/- 4.1%    7216.0ms +/- 3.4%     significant
    cube:              1.66x as fast      2971.8ms +/- 6.0%    1793.0ms +/- 0.3%     significant
    morph:             ??                 3295.3ms +/- 3.7%    3427.3ms +/- 8.1%     not conclusive: might be *1.040x as slow*
    raytrace:          1.49x as fast      2970.8ms +/- 5.2%    1995.7ms +/- 2.9%     significant

  access:              ??                 6308.5ms +/- 3.9%    6395.3ms +/- 12.4%     not conclusive: might be *1.014x as slow
    binary-trees:      1.25x as fast       722.8ms +/- 7.8%     579.7ms +/- 7.7%     significant
    fannkuch:          *2.89x as slow*    1009.0ms +/- 2.3%    2912.7ms +/- 25.8%     significant
    nbody:             2.34x as fast      4163.5ms +/- 5.4%    1781.7ms +/- 2.3%     significant
    nsieve:            *2.71x as slow*     413.3ms +/- 0.4%    1121.3ms +/- 1.6%     significant

  bitops:              *1.38x as slow*    2318.0ms +/- 2.7%    3193.0ms +/- 2.6%     significant
    3bit-bits-in-byte: *2.11x as slow*     329.0ms +/- 0.9%     694.0ms +/- 0.9%     significant
    bits-in-byte:      *1.59x as slow*     410.0ms +/- 0.5%     652.0ms +/- 12.6%     significant
    bitwise-and:       *1.32x as slow*     339.3ms +/- 1.8%     449.3ms +/- 1.8%     significant
    nsieve-bits:       *1.127x as slow*   1239.8ms +/- 4.9%    1397.7ms +/- 1.0%     significant

  controlflow:         *2.66x as slow*     302.3ms +/- 20.3%     805.3ms +/- 9.7%     significant
    recursive:         *2.66x as slow*     302.3ms +/- 20.3%     805.3ms +/- 9.7%     significant

  crypto:              ??                 2998.8ms +/- 3.4%    3031.3ms +/- 1.3%     not conclusive: might be *1.011x as slow*
    aes:               *1.42x as slow*     814.5ms +/- 1.5%    1155.0ms +/- 1.8%     significant
    md5:               1.155x as fast     1083.5ms +/- 4.6%     938.3ms +/- 7.0%     significant
    sha1:              1.174x as fast     1100.8ms +/- 5.8%     938.0ms +/- 5.1%     significant

  date:                -                  5473.5ms +/- 3.9%    5472.7ms +/- 6.5%
    format-tofte:      ??                 1929.8ms +/- 5.8%    1969.7ms +/- 6.5%     not conclusive: might be *1.021x as slow*
    format-xparb:      -                  3543.8ms +/- 4.2%    3503.0ms +/- 6.7%

  math:                1.34x as fast      8873.5ms +/- 4.7%    6645.7ms +/- 1.6%     significant
    cordic:            1.44x as fast      2979.8ms +/- 7.0%    2062.3ms +/- 7.2%     significant
    partial-sums:      1.52x as fast      4135.5ms +/- 11.8%    2724.3ms +/- 6.7%     significant
    spectral-norm:     ??                 1758.3ms +/- 8.9%    1859.0ms +/- 7.2%     not conclusive: might be *1.057x as slow*

  regexp:              -                  3940.0ms +/- 0.5%    3922.3ms +/- 1.2%
    dna:               -                  3940.0ms +/- 0.5%    3922.3ms +/- 1.2%

  string:              1.031x as fast    13105.5ms +/- 1.7%   12711.7ms +/- 2.6%     significant
    base64:            ??                 1513.8ms +/- 6.3%    1595.0ms +/- 4.9%     not conclusive: might be *1.054x as slow*
    fasta:             ??                 2780.0ms +/- 5.4%    2792.3ms +/- 9.6%     not conclusive: might be *1.004x as slow*
    tagcloud:          1.059x as fast     2829.8ms +/- 0.8%    2671.7ms +/- 1.8%     significant
    unpack-code:       -                  3755.0ms +/- 5.8%    3719.0ms +/- 4.5%
    validate-input:    1.152x as fast     2227.0ms +/- 4.1%    1933.7ms +/- 3.1%     significant

As you can see, the speed is greatly increased for both the math and the 3d groups but the others (especially bitops and contrloflow) suffer a great performance loss. Overall, the performance is slighlty increased for SunSpider.

The next one is Google's v8 benchmark set:

TEST              COMPARISON            FROM                 TO             DETAILS


** TOTAL **:      *1.63x as slow*   165203.5ms +/- 14.5%   269875.0ms +/- 6.5%     significant


  v8:             *1.63x as slow*   165203.5ms +/- 14.5%   269875.0ms +/- 6.5%     significant
    crypto:       *4.91x as slow*    20308.5ms +/- 2.3%    99666.0ms +/- 5.3%     significant
    deltablue:    *1.22x as slow*    53850.0ms +/- 11.3%    65641.7ms +/- 11.1%     significant
    earley-boyer: ??                 19729.5ms +/- 53.1%    21353.0ms +/- 11.4%     not conclusive: might be *1.082x as slow*
    raytrace:     1.21x as fast      24228.5ms +/- 11.5%    20084.0ms +/- 14.2%     significant
    richards:     *1.34x as slow*    47087.0ms +/- 36.5%    63130.3ms +/- 11.2%     significant

Only raytrace benefits from the branch. Note that crypto is nearly 5 times slower!

Finally, our WindScorpion benchmark suite follows:

TEST              COMPARISON            FROM                 TO             DETAILS


** TOTAL **:      ??                517314.0ms +/- NaN%   551942.0ms +/- NaN%     not conclusive: might be *1.067x as slow*


  WS:             ??                517314.0ms +/- NaN%   551942.0ms +/- NaN%     not conclusive: might be *1.067x as slow*
    bubbleSort:   ??                 23101.0ms +/- NaN%    32583.0ms +/- NaN%     not conclusive: might be *1.41x as slow*
    des:          -                  51461.0ms +/- NaN%    30900.0ms +/- NaN%
    dicePoker:    ??                 24493.0ms +/- NaN%    25374.0ms +/- NaN%     not conclusive: might be *1.036x as slow*
    email:        ??                 43445.0ms +/- NaN%    48088.0ms +/- NaN%     not conclusive: might be *1.107x as slow*
    factor:       -                  22298.0ms +/- NaN%    22243.0ms +/- NaN%
    floyd:        ??                 35773.0ms +/- NaN%    45194.0ms +/- NaN%     not conclusive: might be *1.26x as slow*
    formatNumber: ??                 39421.0ms +/- NaN%    42229.0ms +/- NaN%     not conclusive: might be *1.071x as slow*
    genetic:      ??                 45551.0ms +/- NaN%    47097.0ms +/- NaN%     not conclusive: might be *1.034x as slow*
    huffman:      -                  25344.0ms +/- NaN%    24755.0ms +/- NaN%
    IEEE754Conv:  ??                 49178.0ms +/- NaN%    60139.0ms +/- NaN%     not conclusive: might be *1.22x as slow*
    longFact:     ??                 44821.0ms +/- NaN%    45709.0ms +/- NaN%     not conclusive: might be *1.020x as slow*
    quickSort:    ??                 12237.0ms +/- NaN%    18869.0ms +/- NaN%     not conclusive: might be *1.54x as slow*
    redBlackTree: ??                 23470.0ms +/- NaN%    31335.0ms +/- NaN%     not conclusive: might be *1.34x as slow*
    solve:        ??                 48657.0ms +/- NaN%    49261.0ms +/- NaN%     not conclusive: might be *1.012x as slow*
    xmlParser:    ??                 28064.0ms +/- NaN%    28166.0ms +/- NaN%     not conclusive: might be *1.004x as slow*

WindScorpion takes much more time to complete than the other ones, so we usually run it only once. To give a real chance for the Nitro-Extreme branch we have repeated the whole measurement multiple times, and selected the best result. Since standard deviation cannot be calculated from one sample, the runtime environment yields that the performance change is "not conclusive".

Can we see behind the raw runtimes? Yes we can, since we have an Intel-XScale cycle accurate simulator called XEEMU. The simulated CPU is configured for 600MHz with 32K instruction and data caches. The system calls are handled by the Linux kernel (version We have selected two candidates from SunSpider: 3d-cube and controlflow-recursive. The first one became much faster, the latter became much slower.

Let's see 3d-cube first:
A CPU cycle is the smallest time unit of the core. In our case it is 1/600.000.000 second.
CPU cycles: 881778528 (Trunk)
CPU cycles: 555313298 (Nitro-Extreme)
Nitro-Extreme runs 58% faster.

Executed instructions: 387910172 (Trunk)
Executed instructions: 307957274 (Nitro-Extreme)
Trunk requires 25% more ARM instructions.

The number of extra instructions only partly explains the longer runtime of the trunk, we need to search further. A memory stall cycle means that the CPU cannot execute the next insturction, because either the instruction is not yet delivered by the instruction fetch stage, or a required input data is not yet loaded from the memory.

Number of stall cycles caused by memory: 257012628 (Trunk)
Number of stall cycles caused by memory: 86277920 (Nitro-Extreme)

In this case the cache access pattern is much better for Nitro-Extreme, thus it runs much faster. Perhaps the memory scan performed by the garbage collector caused extra cache line evictions for the trunk.

What about controlflow-recursive?
CPU cycles: 86169759 (Trunk)
CPU cycles: 215456144 (Nitro-Extreme)
Trunk runs 150% faster on an XScale core.

Executed instructions: 55813742 (Trunk)
Executed instructions: 134120509 (Nitro-Extreme)
Nitro-Extreme executes 140% more instructions.

Here the instruction increase fully explains the runtime increase. This statement is also proved by the other statistic values generated by the simulator.


There was a huge (5x) slowdown for v8-crypto, so I decided to investigate that benchmark as well (it took 10 hours on xeemu because cycle accurate simulators are slow as molasses in January):

CPU cycles: 6993234850 (Trunk)
CPU cycles: 29652666950 (Nitro-Extreme)
Trunk runs 324% faster.

Executed instructions: 4768205023 (Trunk)
Executed instructions: 19312115853 (Nitro-Extreme)
Nitro-Extreme executes 305% more instructions.

IPC (instruction per cycle):
Trunk IPC: 0.68
Nitro-extreme IPC: 0.65
The workload for the core is nearly the same. This is a similar case to controlflow-recursive. The extra executed instructions caused the runtime increase here as well.

Anonymous (not verified) - 05/07/2009 - 18:09

"32 byte" -> "32 bit"

zoltan.herczeg - 05/08/2009 - 07:58

No, that is 32 bytes. JSValues points to JSCells. JSCells are aligned by a custom JavaScriptCore allocator to 32 bytes on 32 bit machines and 64 bytes on 64 bit machines.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • No HTML tags allowed
  • Lines and paragraphs break automatically.

More information about formatting options

This question is for testing whether you are a human visitor and to prevent automated spam submissions.
Fill in the blank