Flash Assembler (not ByteCode) : And you thought it couldn’t get faster

March 15, 2010

In the last posts, I talked about what you can get from mm.cfg setups but I didn’t give any GREAT example of how to use that in an optimization context.
Well here it is:

var a:int = 1;
var b:int = 2;
b = a + a + a;

How could something this simple could be optimized? Can it?
Well yes it can! Not 10% faster… 300% faster!

In fact, Flash is spending precious CPU time working on Numbers data type, assuming that everything can be done with numbers and then converting the whole thing back to int.
If you try to add two Integers, flash will first generate the code as if both var were numbers, and make the float addition with those. But flash (or Jit?) take this code and convert it to a nice int+int operation. This is great, except that the optimization will work only if there is only two operators , not one more!
So b = a+a is pretty fast, but if you had that third “a”, you will be smashed with a nice Number + Number + Number + NumberToInteger operation. This cost A LOT!

Warning: ByteCode AND Assembler ahead!

With mm.cfg, you can get the whole ByteCode with associated Assembler (MIR) in the flashlog when you run a SWF.

The ByteCode

Let’s try some addition variation and generate this with mm.cfg::AS3Verbose=1
Simple addition: In this variation, we add the three vars and make no optimization. Result: two additions of numbers, and one final conversion to int.

 // b = a + a + a;
    25      getlocal1   // get “a”  	  	
    26      getlocal1   // get “a” again  	
    27      add         // Add both int  (no convert_i here) 
    28      getlocal1   // get “a”  	
    29      add         // Add “a” to previous result  	
    30      convert_i   // Convert the whole result to integer  	
    31      setlocal2   // Operation is over, save in register

This variation includes a great optimization: Casting the first two elements. It splits the series of addition and let the JIT make its Integer optimization. Drawback is an access to ”findpropstrict” and a call of property. It also hold one more variable in a register but I was not able to determine if this is really a drawback yet.

// b = int(a + a) + a
    25      findpropstrict	int     // find property int (the conversion function)
    27      getlocal1     	        // get “a”  	  	
    28      getlocal1     	        // get “a” again  	
    29      add           	        // Add both int (AS Number?)
    30      callproperty  	int (1) // No number here, use int optimization instead
    33      getlocal1     	        // get i  
    34      add           	        // Convert to integer? Optimize: Use integer add.  	  	
    35      convert_i     	        // Operation is over, save in register  	
    36      setlocal2

The last variation is splitting the calculus manually on multiple lines to make sure there is never more than one operation at any time. It forces the JIT to keep int all the way to the end.

// b = a + a
    // b += a
    20      getlocal1   // get “a”  	
    21      getlocal1   // get “a” again  	
    22      add         // Add both int  	
    23      convert_i   // Convert to integer? Optimize: Use integer add.  	
    24      setlocal2   // Operation 1 is over, save in register	
    25      getlocal2   // Start second operation (reload var)  	
    26      getlocal1   // Prepare addition with “a”  	
    27      add         // Add both int
    28      convert_i   // Convert to integer? Optimize: Use integer add.  	  	
    29      setlocal2   // Operation 2 is over, save in register

The Benchmark

On my computer, using Grant Skinner Performance Test tool:

//Time: 15ms
 b = a + a + a

//Time: 5.8ms
b = int(a + a) + a

//Time: 5ms
b = a + a;
b += a;

The Assembler

How about some assembler now?

// b = a + a + a;
  25:getlocal1
       @77	use   @50 [1]
                        stack: int@77
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@68 
                         locals: tests::Bitwise?@16 int@77 int@55 int@59 Number@63 
  26:getlocal1
                        stack: int@77 int@77
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@68 
                         locals: tests::Bitwise?@16 int@77 int@55 int@59 Number@63 
  27:add
       @78	i2d   @77
        	cse   @78
       @79	fadd  @78 @78
                        stack: Number@79
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@68 
                         locals: tests::Bitwise?@16 int@77 int@55 int@59 Number@63 
  28:getlocal1
                        stack: Number@79 int@77
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@68 
                         locals: tests::Bitwise?@16 int@77 int@55 int@59 Number@63 
  29:add
        	cse   @78
       @80	fadd  @79 @78
                        stack: Number@80
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@68 
                         locals: tests::Bitwise?@16 int@77 int@55 int@59 Number@63 
  30:convert_i
       @81	csop  AVMCORE_integer_d_sse2 (@80)
                        stack: int@81
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@68 
                         locals: tests::Bitwise?@16 int@77 int@55 int@59 Number@63 
  31:setlocal2
                        stack:
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@68 
                         locals: tests::Bitwise?@16 int@77 int@81 int@59 Number@63

At the first addition of int, you can see two i2d operations transferring the integers into floats, and then a float addition, giving a “number” result staying on the stack.
Also, there is the final “AVMCORE_integer_d_sse2” which seem to be a great thing (using sse2) optimization, but this cost A LOT! It took almost 9 of the 15ms to process this conversion!

// b = int(a + a) + a
  25:findpropstrict int
       @77	imm   18784736
       @78	imm   18739432
       @79	cmop  MethodEnv::finddefNs (@3, @78, @77)
                        stack: global@79
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@68 
                         locals: tests::Bitwise?@16 int@50 int@55 int@59 Number@63 
  27:getlocal1
       @81	use   @50 [1]
                        stack: global@79 int@81
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@68 
                         locals: tests::Bitwise?@16 int@81 int@55 int@59 Number@63 
  28:getlocal1
                        stack: global@79 int@81 int@81
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@68 
                         locals: tests::Bitwise?@16 int@81 int@55 int@59 Number@63 
  29:add
       @82	i2d   @81
        	cse   @82
       @83	fadd  @82 @82
                        stack: global@79 Number@83
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@68 
                         locals: tests::Bitwise?@16 int@81 int@55 int@59 Number@63 
  30:callproperty int 1
       @84	add   @81 @81
     	dead   @83
     	dead   @82
     	dead   @82
                        stack: int@84
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@68 
                         locals: tests::Bitwise?@16 int@81 int@55 int@59 Number@63 
  33:getlocal1
                        stack: int@84 int@81
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@68 
                         locals: tests::Bitwise?@16 int@81 int@55 int@59 Number@63 
  34:add
       @85	i2d   @84
       @86	i2d   @81
       @87	fadd  @85 @86
                        stack: Number@87
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@68 
                         locals: tests::Bitwise?@16 int@81 int@55 int@59 Number@63 
  35:convert_i
       @88	add   @84 @81
     	dead   @87
     	dead   @86
     	dead   @85
                        stack: int@88
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@68 
                         locals: tests::Bitwise?@16 int@81 int@55 int@59 Number@63 
  36:setlocal2
                        stack:
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@68 
                         locals: tests::Bitwise?@16 int@81 int@88 int@59 Number@63

In this on take a look at the “findpropstrict” which is done in every loop, and then after the first addition there is the “callproperty int” removing in the same time useless float calculus.
This variation keeps the code easy to read. That’s one big advantage!

// b = a + a
    // b += a
  20:getlocal1
       @66	use   @44 [1]
                        stack: int@66
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@57 
                         locals: tests::Bitwise?@15 int@66 int@49 int@53 
  21:getlocal1
                        stack: int@66 int@66
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@57 
                         locals: tests::Bitwise?@15 int@66 int@49 int@53 
  22:add
       @67	i2d   @66
        	cse   @67
       @68	fadd  @67 @67
                        stack: Number@68
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@57 
                         locals: tests::Bitwise?@15 int@66 int@49 int@53 
  23:convert_i
       @69	add   @66 @66
     	dead   @68
     	dead   @67
     	dead   @67
                        stack: int@69
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@57 
                         locals: tests::Bitwise?@15 int@66 int@49 int@53 
  24:setlocal2
                        stack:
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@57 
                         locals: tests::Bitwise?@15 int@66 int@69 int@53 
  25:getlocal2
                        stack: int@69
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@57 
                         locals: tests::Bitwise?@15 int@66 int@69 int@53 
  26:getlocal1
                        stack: int@69 int@66
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@57 
                         locals: tests::Bitwise?@15 int@66 int@69 int@53 
  27:add
       @70	i2d   @69
       @71	i2d   @66
       @72	fadd  @70 @71
                        stack: Number@72
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@57 
                         locals: tests::Bitwise?@15 int@66 int@69 int@53 
  28:convert_i
       @73	add   @69 @66
     	dead   @72
     	dead   @71
     	dead   @70
                        stack: int@73
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@57 
                         locals: tests::Bitwise?@15 int@66 int@69 int@53 
  29:setlocal2
                        stack:
                        scope: [global Object$ flash.events::EventDispatcher$ com.gskinner.performance::TestSuite$ tests::Bitwise$] tests::Bitwise@57 
                         locals: tests::Bitwise?@15 int@66 int@73 int@53

This is the fastest variation. Instead of calling anything it just save the current state and let the optimization happen.

    22      add         // Add both int  	
    23      convert_i   // Convert to integer? Optimize: Use integer add.  	
    24      setlocal2   // Operation 1 is over, save in register	
    25      getlocal2   // Start second operation (reload var)

After each operation, there is the integer optimization, and then a save to local register, followed by the reloading of the same register.
…but the code is a lot less easy to read.

Is this really useful information?

Well for everyday work, I guess not. Still it’s good to know how thing works.
There is some special application that could benefits efforts like 3D rendering engines. Physics engine etc.
For example, in a 3D space where you would like to map a 3 dimensional coordinate into one ByteArray, you could use:

Pos1D = (50 * 50 * z) + (50 * y + x);

And convert this to:

pos1D = 50*50;
pos1D *= z;
pos1D += int(50 * y);
pos1D += x;

The result is the same, except it’s 225% faster. Kinda cool.

Conclusion

I personally consider this as a bug. I hope Adobe (or Tamarin/NanoJIT team?) will fix it in 10.1 (didn’t try the optimization there yet).
But if fixed by Adobe in the next player release, this could mean great optimization for everybody!

Also, some compilers could do that in the background while generating the ByteCode (Yes Nicolas C. I’m talking to you! 🙂 ).

Also, I did some minor modification in the Base64 Class, I will commit that a bit later but instead of doing “writeInt” I’m using for bytearray[x] instead, this eliminate 3 bitOr and direct access in byte array is a lot faster than using wrinteInt. I gained almost 25% by doing this optimization. I guess I’m now over 1000% faster than AS3Crypto.

From → actionscript, general

32 Comments

wonderwhy-er permalink

Ouh my… Never thought that I will agree with Steve Jobs on anything(was apple neutral few years back but becoming Apple hater slowly) but stuff like this does produce words like “Adobe is lazy”…

Why such trivial optimizations are possible and compilers does not do them? Why developer should know about such ridiculous code optimizations like this
b = a + a;
b += a;

Those two are rhetorical questions…
- Avari permalink
  
  dh33OE I’m out of league here. Too much brain power on display!
Karl Macklin permalink

I don’t get it – can I draw use from this if I don’t know how to write Assembler?

Like the example you write right before “Conclusion” – would that be 225% faster?
Or does it require working with Assembler first?

It sounds really interesting if it’s something that can speed up regular calculations, could be useful in game programming which is something I do.

This is all way over my head, but I love reading stuff like this. Really liked your previous posts about mm.cfg!
- jpauclair permalink
  
  You don’t need any assembler! It’s just eye candy for geeks who want to understand the “Why” and not the “How” to optimize 😉
  
  You only need to make your code on multiple line or cast operations as int. that’s all!
David Wagner permalink

Interesting article.

Can I ask what your setup is? What version of the Flex compiler (or CS compiler), and what version of the Flash player you are running it on?

I’m curious because I tried testing similar code over 1,000,000 iterations in the performance test harness, and both functions executed in the same time – ~65ms in total. The bytecode output I get appears to be the same.

I was testing using the Flex 3.5.0.12683 and the release Flash player version WIN 10,0,22,87.

You can see the code for the functions I tested at http://gist.github.com/333822.

Great articles!
- jpauclair permalink
  
  Try to look at the assembler flash is generating.
  search the flashlog file looking for your function name. You should find something like verify*YourFunctionName
  Check if its doing the code optimization on int.
  It might just be my computer which handle badly sse2 optimization but it would rather surprise me.
  I’m on WinXP SP2 on a Core2 Duo
  I tested on both flash player external and flash player plugin for Chrome.
  - David Wagner permalink
    
    Cheers, I’ll have a look at the disassembly tonight when I get home. I’m pretty sure this article is correct, but I’m interested in whether or not the process of generating the assembler is changing the execution paths for the Flash VM – I’d really hope that they don’t have all that debugging in there wrapped in if’s 🙂
    
    Just to check – when you were doing the comparative timings, you were running it on a release version of the Flash player or plugin rather than the debug version? And you weren’t using any special options in mm.cfg?
    
    How many iterations did you use for your timings? I’ve tried a variety of values for the performance test running on a release version of the Flash player, but the difference between the functions is so small it’s likely to be within the margin of error in the timers. It’s literally down to values such as 65ms vs 67ms for the non-optimised vs optimised versions of the function call, with 1,000,000 iterations.
    
    My machine here is a Core 2 Quad CPU, so it’ll be interesting to compare it to my Mac at home, which is traditionally a bit duff at running Flash 🙂
Sindisil permalink

I love AS3, and AVM2 is loads better than AVM1 was, but this is pathetic, really.

I don’t know the reasons behind the lack of even basic optimizations in both the bytecode compiler and the JIT.

The bytecode compiler is just sad – in the year 2010, there’s no valid reason for a compiler to both generate poor code *and* run slower than molasses in January (Here in Minnesota. Running uphill. Into a stiff headwind.)

In the case of AVM2, I’m assuming that much of the problem is the lack of developer time and the prioritization of new features over optimization and correctness. That has worked out for Flash so far, but Flash is no longer the only game in town – it has many competitors nipping at its heels.

All props to what the AVM2 team has done so far, but Adobe had best either hire more compiler engineers, open source AVM2, or both, or I don’t think they’re going to maintain their lead. Ubiquity is a strong defense, but that only works until something else provides enough capability to drive a switch. Once that switch reaches critical mass, it’s hard, if not impossible, to turn the tide.

For reference, I ran similar tests to those shown here, with the following setup and results:

* 5,000,000 reps
* mean of three runs
* release Flash Player v10.0.45.2 in FF 3.6
* 1.6 GHz Atom (Samy NC10)
* All variables declared as :int.
* Accumulator assignment in loops to eliminate chance of calculation being optimized away (if only that were a problem ….).
```
Non-optimized t = 541 ms.
=========================
for (i = 0; i < REPS; ++i) {
    Pos1D = (50 * 50 * z) + (50 * y + x);
    acc += Pos1D;
}

Optimized (from here) t = 187 ms.
==================================
for (i = 0; i < REPS; ++i) {
    Pos1D = 50 * 50;
    Pos1D *= z;
    Pos1D += int(50 * y);
    Pos1D += x;
    acc += Pos1D;
}

Optimized one-liner t = 186 ms.
================================
for (i = 0; i < REPS; ++i) {
    Pos1D = int(int(50 * 50) * z) + int(int(50 * y) + x);
    acc += Pos1D;
}
```
I tried the one-liner out of curiosity, to see a) how clearly it read b) if the additional explicit int casts might change the speed any. It looks like it's exactly as fast, and might be more clear in simple cases.

[Sheesh. I need to stop leaving these long comments on various blogs and just start my own!]
- chamberlainpi permalink
  
  I much prefer your one-liner approach! The only assembly (or multiple-line statements) I need to write should be AGAL, not AS3!
  
  Thanks for sharing this Sindisil!
jpauclair permalink

I just tried with flashPlayer 10.1 beta 3 active-x
I can’t see any noticable difference between any method.

I just filled a bug because I am completly unable to trace into flashlogs.txt with the 10.1 beta 2 and 3.

https://bugs.adobe.com/jira/browse/FP-4161

Anyone experienced the same bug?
- Sindisil permalink
  
  Excellent! Glad to hear that the AVM2 team has been given some cycles to spend on JIT tech, and not just adding features. I’ll have to load up the beta and give it a try on some of my testing code to see where else things have improved!
  
  (BTW – thanks for editing my previous comment!)
- yonatan permalink
  
  Well, your bug report is not accessible (jira says it’s a permission violation), but yes.
  
  I’m getting zero length files with the linux standalone 10,1,53,64 players 😦
  
  Have you had any luck tracing into flashlogs.txt with *any* 10.1 players?
  - yonatan permalink
    
    Sorry, correction, AS3Trace works, AS3Verbose doesn’t seem to.
  - jpauclair permalink
    
    That has been the case for all 10.1 RC
    I guess it’s because of the LLVM update..
    I really hope they will put it back!
Jackson Dunstan permalink

I find this fascinating… and infuriating!

@Sindisil – I can confirm your tests and agree that there’s no good reason for optimization this poor.

Now if you’ll excuse me, I have a lot of code to go through. 😦

Thank you again for another great article Jean-Philippe!
jpauclair permalink

Oh, Nicoals! Just tried with haxe.
The ByteCode is perfectly clean!
b = a + a + a is the fastest option,
follow by the cast, and the multiline is last.

Nice job haXe guys!
- Nicolas permalink
  
  It’s been known for long time that MXMLC produces underoptimized bytecode for integer manipulation code 😉 I’m not sure that Adobe will fix that anytime soon since that would break some applications because the semantics of int32 overflow would change.
  
  haXe also produces quite nice bytecode for integer loop counters, if you want to look at it 😉
Agnes permalink

Do you know if this has been fixed in 10.1?
- jpauclair permalink
  
  I don’t know.
  But 10.1 is the player, not the compiler.
  And this is a compiler problem.
  The generated Opcode is just not optimized.
Jasper St. Pierre permalink

Looking at the source of Tamarin, it looks like convert_i, and, int the constructor and function, do the same thing, which is right here:

http://hg.mozilla.org/tamarin-redux/file/54d9cd1eb2c3/core/AvmCore-inlines.h#l406
- jpauclair permalink
  
  I’m note sure of what you mean? Did I made some kind of bad asumption?
  - Jasper St. Pierre permalink
    
    Not sure. Why are you using findpropstrict and then callpropstrict, if convert_i does the same thing.
  - Jasper St. Pierre permalink
    
    Bah, it’s my inability to read. I didn’t realize that this was AS3 that was generated, I thought this was bytecode tweaking. I was wondering why you did the findprop/callprop dance if you took it out in the end.
    
    Nevermind, I can’t comprehend basic English.
jwopitz permalink

I don’t know if the compiler has been optimized since you posted this but I am consistently getting the opposite results.

[CODE]
var a:int = 7, b:int = 31, c:int, i:int, m:int = 1000000;
var t:uint = getTimer();

while ( i < m )
{
c = a + a + b;
i++;
}
trace( "complex", getTimer() – t );
i = 0;
t = getTimer();

while ( i < m )
{
c = a + a;
c += b;
i++;
}
trace( "multiline", getTimer() – t );
[CODE]
- jwopitz permalink
  
  Forgot to post the results (averaging over a half dozen runs):
  complex 114
  multiline 147
  
  I also tested it such that all the variable declarations at the top had their own new line (which I didn’t think would make a difference) and it didn’t make a difference.
  
  You say this is a compiler optimization issue not a player issue… I am testing this in the latest Flash Builder Plugin for Eclipse on a PC (build 272416 targeting the 4.1.0.16076 SDK)
  - jpauclair permalink
    
    This was almost a year ago.
    I think they solved it a few months ago.
thienhaflash permalink

Great post as usual ! thanks man
Youtube.com permalink

Good day I am so grateful I found your site, I really found you by mistake, while I was looking on Bing for something else, Nonetheless I am here now and would just like to say thank you for a marvelous post and a all
round thrilling blog (I also love the theme/design), I don’t have time to
look over it all at the minute but I have bookmarked it and also
added your RSS feeds, so when I have time I will be
back to read a lot more, Please do keep up the fantastic b.