jpauclair

Because standing still is going backward…

Flash Assembler (not ByteCode) : And you thought it couldn’t get faster

with 24 comments

In the last posts, I talked about what you can get from mm.cfg setups but I didn’t give any GREAT example of how to use that in an optimization context.
Well here it is:

var a:int = 1;
var b:int = 2;
b = a + a + a;

How could something this simple could be optimized? Can it?
Well yes it can! Not 10% faster… 300% faster!

In fact, Flash is spending precious CPU time working on Numbers data type, assuming that everything can be done with numbers and then converting the whole thing back to int.
If you try to add two Integers, flash will first generate the code as if both var were numbers, and make the float addition with those. But flash (or Jit?) take this code and convert it to a nice int+int operation. This is great, except that the optimization will work only if there is only two operators , not one more!
So b = a+a is pretty fast, but if you had that third “a”, you will be smashed with a nice Number + Number + Number + NumberToInteger operation. This cost A LOT!

Warning: ByteCode AND Assembler ahead!

With mm.cfg, you can get the whole ByteCode with associated Assembler (MIR) in the flashlog when you run a SWF.


The ByteCode

Let’s try some addition variation and generate this with mm.cfg::AS3Verbose=1
Simple addition: In this variation, we add the three vars and make no optimization. Result: two additions of numbers, and one final conversion to int.

 // b = a + a + a;
    25      getlocal1   // get “a”
    26      getlocal1   // get “a” again
    27      add         // Add both int  (no convert_i here)
    28      getlocal1   // get “a”
    29      add         // Add “a” to previous result
    30      convert_i   // Convert the whole result to integer
    31      setlocal2   // Operation is over, save in register

This variation includes a great optimization: Casting the first two elements. It splits the series of addition and let the JIT make its Integer optimization. Drawback is an access to ”findpropstrict” and a call of property. It also hold one more variable in a register but I was not able to determine if this is really a drawback yet.

// b = int(a + a) + a
    25      findpropstrict	int     // find property int (the conversion function)
    27      getlocal1     	        // get “a”
    28      getlocal1     	        // get “a” again
    29      add           	        // Add both int (AS Number?)
    30      callproperty  	int (1) // No number here, use int optimization instead
    33      getlocal1     	        // get i
    34      add           	        // Convert to integer? Optimize: Use integer add.
    35      convert_i     	        // Operation is over, save in register
    36      setlocal2

The last variation is splitting the calculus manually on multiple lines to make sure there is never more than one operation at any time. It forces the JIT to keep int all the way to the end.

// b = a + a
    // b += a
    20      getlocal1   // get “a”
    21      getlocal1   // get “a” again
    22      add         // Add both int
    23      convert_i   // Convert to integer? Optimize: Use integer add.
    24      setlocal2   // Operation 1 is over, save in register
    25      getlocal2   // Start second operation (reload var)
    26      getlocal1   // Prepare addition with “a”
    27      add         // Add both int
    28      convert_i   // Convert to integer? Optimize: Use integer add.
    29      setlocal2   // Operation 2 is over, save in register


The Benchmark

On my computer, using Grant Skinner Performance Test tool:

//Time: 15ms
 b = a + a + a
//Time: 5.8ms
b = int(a + a) + a
//Time: 5ms
b = a + a;
b += a;


The Assembler

How about some assembler now?

// b = a + a + a;
  25:getlocal1
       @77	use   @50 [1]
                        stack: int@77
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@68
                         locals: tests::Bitwise?@16 int@77 int@55 int@59 Number@63
  26:getlocal1
                        stack: int@77 int@77
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@68
                         locals: tests::Bitwise?@16 int@77 int@55 int@59 Number@63
  27:add
       @78	i2d   @77
        	cse   @78
       @79	fadd  @78 @78
                        stack: Number@79
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@68
                         locals: tests::Bitwise?@16 int@77 int@55 int@59 Number@63
  28:getlocal1
                        stack: Number@79 int@77
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@68
                         locals: tests::Bitwise?@16 int@77 int@55 int@59 Number@63
  29:add
        	cse   @78
       @80	fadd  @79 @78
                        stack: Number@80
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@68
                         locals: tests::Bitwise?@16 int@77 int@55 int@59 Number@63
  30:convert_i
       @81	csop  AVMCORE_integer_d_sse2 (@80)
                        stack: int@81
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@68
                         locals: tests::Bitwise?@16 int@77 int@55 int@59 Number@63
  31:setlocal2
                        stack:
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@68
                         locals: tests::Bitwise?@16 int@77 int@81 int@59 Number@63

At the first addition of int, you can see two i2d operations transferring the integers into floats, and then a float addition, giving a “number” result staying on the stack.
Also, there is the final “AVMCORE_integer_d_sse2” which seem to be a great thing (using sse2) optimization, but this cost A LOT! It took almost 9 of the 15ms to process this conversion!

// b = int(a + a) + a
  25:findpropstrict int
       @77	imm   18784736
       @78	imm   18739432
       @79	cmop  MethodEnv::finddefNs (@3, @78, @77)
                        stack: global@79
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@68
                         locals: tests::Bitwise?@16 int@50 int@55 int@59 Number@63
  27:getlocal1
       @81	use   @50 [1]
                        stack: global@79 int@81
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@68
                         locals: tests::Bitwise?@16 int@81 int@55 int@59 Number@63
  28:getlocal1
                        stack: global@79 int@81 int@81
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@68
                         locals: tests::Bitwise?@16 int@81 int@55 int@59 Number@63
  29:add
       @82	i2d   @81
        	cse   @82
       @83	fadd  @82 @82
                        stack: global@79 Number@83
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@68
                         locals: tests::Bitwise?@16 int@81 int@55 int@59 Number@63
  30:callproperty int 1
       @84	add   @81 @81
     	dead   @83
     	dead   @82
     	dead   @82
                        stack: int@84
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@68
                         locals: tests::Bitwise?@16 int@81 int@55 int@59 Number@63
  33:getlocal1
                        stack: int@84 int@81
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@68
                         locals: tests::Bitwise?@16 int@81 int@55 int@59 Number@63
  34:add
       @85	i2d   @84
       @86	i2d   @81
       @87	fadd  @85 @86
                        stack: Number@87
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@68
                         locals: tests::Bitwise?@16 int@81 int@55 int@59 Number@63
  35:convert_i
       @88	add   @84 @81
     	dead   @87
     	dead   @86
     	dead   @85
                        stack: int@88
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@68
                         locals: tests::Bitwise?@16 int@81 int@55 int@59 Number@63
  36:setlocal2
                        stack:
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@68
                         locals: tests::Bitwise?@16 int@81 int@88 int@59 Number@63

In this on take a look at the “findpropstrict” which is done in every loop, and then after the first addition there is the “callproperty int” removing in the same time useless float calculus.
This variation keeps the code easy to read. That’s one big advantage!

// b = a + a
    // b += a
  20:getlocal1
       @66	use   @44 [1]
                        stack: int@66
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@57
                         locals: tests::Bitwise?@15 int@66 int@49 int@53
  21:getlocal1
                        stack: int@66 int@66
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@57
                         locals: tests::Bitwise?@15 int@66 int@49 int@53
  22:add
       @67	i2d   @66
        	cse   @67
       @68	fadd  @67 @67
                        stack: Number@68
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@57
                         locals: tests::Bitwise?@15 int@66 int@49 int@53
  23:convert_i
       @69	add   @66 @66
     	dead   @68
     	dead   @67
     	dead   @67
                        stack: int@69
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@57
                         locals: tests::Bitwise?@15 int@66 int@49 int@53
  24:setlocal2
                        stack:
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@57
                         locals: tests::Bitwise?@15 int@66 int@69 int@53
  25:getlocal2
                        stack: int@69
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@57
                         locals: tests::Bitwise?@15 int@66 int@69 int@53
  26:getlocal1
                        stack: int@69 int@66
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@57
                         locals: tests::Bitwise?@15 int@66 int@69 int@53
  27:add
       @70	i2d   @69
       @71	i2d   @66
       @72	fadd  @70 @71
                        stack: Number@72
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@57
                         locals: tests::Bitwise?@15 int@66 int@69 int@53
  28:convert_i
       @73	add   @69 @66
     	dead   @72
     	dead   @71
     	dead   @70
                        stack: int@73
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@57
                         locals: tests::Bitwise?@15 int@66 int@69 int@53
  29:setlocal2
                        stack:
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@57
                         locals: tests::Bitwise?@15 int@66 int@73 int@53

This is the fastest variation. Instead of calling anything it just save the current state and let the optimization happen.

    22      add         // Add both int
    23      convert_i   // Convert to integer? Optimize: Use integer add.
    24      setlocal2   // Operation 1 is over, save in register
    25      getlocal2   // Start second operation (reload var)

After each operation, there is the integer optimization, and then a save to local register, followed by the reloading of the same register.
…but the code is a lot less easy to read.


Is this really useful information?

Well for everyday work, I guess not. Still it’s good to know how thing works.
There is some special application that could benefits efforts like 3D rendering engines. Physics engine etc.
For example, in a 3D space where you would like to map a 3 dimensional coordinate into one ByteArray, you could use:

Pos1D = (50 * 50 * z) + (50 * y + x);

And convert this to:

pos1D = 50*50;
pos1D *= z;
pos1D += int(50 * y);
pos1D += x;

The result is the same, except it’s 225% faster. Kinda cool.


Conclusion

I personally consider this as a bug. I hope Adobe (or Tamarin/NanoJIT team?) will fix it in 10.1 (didn’t try the optimization there yet).
But if fixed by Adobe in the next player release, this could mean great optimization for everybody!

Also, some compilers could do that in the background while generating the ByteCode (Yes Nicolas C. I’m talking to you! :) ).

Also, I did some minor modification in the Base64 Class, I will commit that a bit later but instead of doing “writeInt” I’m using for bytearray[x] instead, this eliminate 3 bitOr and direct access in byte array is a lot faster than using wrinteInt. I gained almost 25% by doing this optimization. I guess I’m now over 1000% faster than AS3Crypto.

add to del.icio.usAdd to Blinkslistadd to furlDigg itadd to ma.gnoliaStumble It!add to simpyseed the vineTailRankpost to facebook


Written by jpauclair

March 15, 2010 at 9:32 pm

24 Responses

Subscribe to comments with RSS.

  1. Ouh my… Never thought that I will agree with Steve Jobs on anything(was apple neutral few years back but becoming Apple hater slowly) but stuff like this does produce words like “Adobe is lazy”…

    Why such trivial optimizations are possible and compilers does not do them? Why developer should know about such ridiculous code optimizations like this
    b = a + a;
    b += a;

    Those two are rhetorical questions…

    wonderwhy-er

    March 16, 2010 at 2:43 am

  2. I don’t get it – can I draw use from this if I don’t know how to write Assembler?

    Like the example you write right before “Conclusion” – would that be 225% faster?
    Or does it require working with Assembler first?

    It sounds really interesting if it’s something that can speed up regular calculations, could be useful in game programming which is something I do.

    This is all way over my head, but I love reading stuff like this. Really liked your previous posts about mm.cfg!

    Karl Macklin

    March 16, 2010 at 4:13 am

    • You don’t need any assembler! It’s just eye candy for geeks who want to understand the “Why” and not the “How” to optimize ;)

      You only need to make your code on multiple line or cast operations as int. that’s all!

      jpauclair

      March 16, 2010 at 5:49 am

  3. Interesting article.

    Can I ask what your setup is? What version of the Flex compiler (or CS compiler), and what version of the Flash player you are running it on?

    I’m curious because I tried testing similar code over 1,000,000 iterations in the performance test harness, and both functions executed in the same time – ~65ms in total. The bytecode output I get appears to be the same.

    I was testing using the Flex 3.5.0.12683 and the release Flash player version WIN 10,0,22,87.

    You can see the code for the functions I tested at http://gist.github.com/333822.

    Great articles!

    David Wagner

    March 16, 2010 at 5:31 am

    • Try to look at the assembler flash is generating.
      search the flashlog file looking for your function name. You should find something like verify*YourFunctionName
      Check if its doing the code optimization on int.
      It might just be my computer which handle badly sse2 optimization but it would rather surprise me.
      I’m on WinXP SP2 on a Core2 Duo
      I tested on both flash player external and flash player plugin for Chrome.

      jpauclair

      March 16, 2010 at 5:47 am

      • Cheers, I’ll have a look at the disassembly tonight when I get home. I’m pretty sure this article is correct, but I’m interested in whether or not the process of generating the assembler is changing the execution paths for the Flash VM – I’d really hope that they don’t have all that debugging in there wrapped in if’s :)

        Just to check – when you were doing the comparative timings, you were running it on a release version of the Flash player or plugin rather than the debug version? And you weren’t using any special options in mm.cfg?

        How many iterations did you use for your timings? I’ve tried a variety of values for the performance test running on a release version of the Flash player, but the difference between the functions is so small it’s likely to be within the margin of error in the timers. It’s literally down to values such as 65ms vs 67ms for the non-optimised vs optimised versions of the function call, with 1,000,000 iterations.

        My machine here is a Core 2 Quad CPU, so it’ll be interesting to compare it to my Mac at home, which is traditionally a bit duff at running Flash :)

        David Wagner

        March 16, 2010 at 8:05 am

  4. I love AS3, and AVM2 is loads better than AVM1 was, but this is pathetic, really.

    I don’t know the reasons behind the lack of even basic optimizations in both the bytecode compiler and the JIT.

    The bytecode compiler is just sad – in the year 2010, there’s no valid reason for a compiler to both generate poor code *and* run slower than molasses in January (Here in Minnesota. Running uphill. Into a stiff headwind.)

    In the case of AVM2, I’m assuming that much of the problem is the lack of developer time and the prioritization of new features over optimization and correctness. That has worked out for Flash so far, but Flash is no longer the only game in town – it has many competitors nipping at its heels.

    All props to what the AVM2 team has done so far, but Adobe had best either hire more compiler engineers, open source AVM2, or both, or I don’t think they’re going to maintain their lead. Ubiquity is a strong defense, but that only works until something else provides enough capability to drive a switch. Once that switch reaches critical mass, it’s hard, if not impossible, to turn the tide.

    For reference, I ran similar tests to those shown here, with the following setup and results:

    * 5,000,000 reps
    * mean of three runs
    * release Flash Player v10.0.45.2 in FF 3.6
    * 1.6 GHz Atom (Samy NC10)
    * All variables declared as :int.
    * Accumulator assignment in loops to eliminate chance of calculation being optimized away (if only that were a problem ….).

    Non-optimized t = 541 ms.
    =========================
    for (i = 0; i < REPS; ++i) {
        Pos1D = (50 * 50 * z) + (50 * y + x);
        acc += Pos1D;
    }
    
    Optimized (from here) t = 187 ms.
    ==================================
    for (i = 0; i < REPS; ++i) {
        Pos1D = 50 * 50;
        Pos1D *= z;
        Pos1D += int(50 * y);
        Pos1D += x;
        acc += Pos1D;
    }
    
    Optimized one-liner t = 186 ms.
    ================================
    for (i = 0; i < REPS; ++i) {
        Pos1D = int(int(50 * 50) * z) + int(int(50 * y) + x);
        acc += Pos1D;
    }
    

    I tried the one-liner out of curiosity, to see a) how clearly it read b) if the additional explicit int casts might change the speed any. It looks like it's exactly as fast, and might be more clear in simple cases.

    [Sheesh. I need to stop leaving these long comments on various blogs and just start my own!]

    Sindisil

    March 16, 2010 at 11:30 am

  5. I just tried with flashPlayer 10.1 beta 3 active-x
    I can’t see any noticable difference between any method.

    I just filled a bug because I am completly unable to trace into flashlogs.txt with the 10.1 beta 2 and 3.

    https://bugs.adobe.com/jira/browse/FP-4161

    Anyone experienced the same bug?

    jpauclair

    March 16, 2010 at 12:56 pm

    • Excellent! Glad to hear that the AVM2 team has been given some cycles to spend on JIT tech, and not just adding features. I’ll have to load up the beta and give it a try on some of my testing code to see where else things have improved!

      (BTW – thanks for editing my previous comment!)

      Sindisil

      March 16, 2010 at 1:03 pm

    • Well, your bug report is not accessible (jira says it’s a permission violation), but yes.

      I’m getting zero length files with the linux standalone 10,1,53,64 players :(

      Have you had any luck tracing into flashlogs.txt with *any* 10.1 players?

      yonatan

      June 10, 2010 at 5:23 pm

      • Sorry, correction, AS3Trace works, AS3Verbose doesn’t seem to.

        yonatan

        June 12, 2010 at 3:22 am

      • That has been the case for all 10.1 RC
        I guess it’s because of the LLVM update..
        I really hope they will put it back!

        I can’t get any traces out of projector debugger 10.1
        This is also an issue.. :/

        jpauclair

        June 12, 2010 at 6:23 am

  6. I find this fascinating… and infuriating!

    @Sindisil – I can confirm your tests and agree that there’s no good reason for optimization this poor.

    Now if you’ll excuse me, I have a lot of code to go through. :(

    Thank you again for another great article Jean-Philippe!

    Jackson Dunstan

    March 16, 2010 at 1:23 pm

  7. Oh, Nicoals! Just tried with haxe.
    The ByteCode is perfectly clean!
    b = a + a + a is the fastest option,
    follow by the cast, and the multiline is last.

    Nice job haXe guys!

    jpauclair

    March 17, 2010 at 3:34 pm

    • It’s been known for long time that MXMLC produces underoptimized bytecode for integer manipulation code ;) I’m not sure that Adobe will fix that anytime soon since that would break some applications because the semantics of int32 overflow would change.

      haXe also produces quite nice bytecode for integer loop counters, if you want to look at it ;)

      Nicolas

      March 23, 2010 at 3:39 pm

  8. [...] Flash Assembler (not ByteCode) : And you thought it couldn’t get faster [...]

  9. [...] Faster arithmetic – Flash Assembler (not ByteCode) : And you thought it couldn’t get faster [...]

  10. Do you know if this has been fixed in 10.1?

    Agnes

    May 19, 2010 at 3:56 am

    • I don’t know.
      But 10.1 is the player, not the compiler.
      And this is a compiler problem.
      The generated Opcode is just not optimized.

      jpauclair

      May 19, 2010 at 7:28 am

  11. [...] Flash Assembler (not ByteCode) : And you thought it couldn’t get faster [...]

  12. Looking at the source of Tamarin, it looks like convert_i, and, int the constructor and function, do the same thing, which is right here:

    http://hg.mozilla.org/tamarin-redux/file/54d9cd1eb2c3/core/AvmCore-inlines.h#l406

    Jasper St. Pierre

    June 5, 2010 at 8:35 am

    • I’m note sure of what you mean? Did I made some kind of bad asumption?

      jpauclair

      June 6, 2010 at 8:59 pm

      • Not sure. Why are you using findpropstrict and then callpropstrict, if convert_i does the same thing.

        Jasper St. Pierre

        June 7, 2010 at 12:01 am

      • Bah, it’s my inability to read. I didn’t realize that this was AS3 that was generated, I thought this was bytecode tweaking. I was wondering why you did the findprop/callprop dance if you took it out in the end.

        Nevermind, I can’t comprehend basic English.

        Jasper St. Pierre

        June 7, 2010 at 12:03 am


Leave a Reply