Friday, November 19, 2010

The Illusion of Performance Hacks

Don't try a performance tip until you learn how your compiler of choice works against that type of optimization; you might be another victim of premature optimization which, according to Donald Knuth; is the source of all evil. (In addition to teletubbies, they scare the living hell out of me)

I'll give you an example:
The XOR swap algorithm (really interesting stuff here) is a method to swap values of two variables without using a third temporary variable! Great, we just saved the overhead incurred by creating a third variable, right? RIGHT?

Well, I offer my deep condolences, but not in all cases; actually not in most cases. Modern compilers are very very good at optimizations themselves, and so using hacks like that can defeat the compiler's good optimization techniques and rendering it useless.
In our case, XOR swapping is done like this:

char a = 'A';
char b = 'B';
a ^=b; //<-- this means a = a^b;

The result is the values are swapped between the two variables in the same number of operations without using a third variable, which sounds as a good performance catch...

... Except the compiler already does its own optimization whenever it encounters an algorithm that it finds to be a swapping algorithm by using the same number of registers used as in XOR swap; all while performing the swap operation using the registers faster than going through the ALU for the XOR operation! Actually some CPU architectures can support internal swapping (simply exchanging memory addresses of variables) or, if supported, using a single
XCHG (exchange) instruction
. Yes, compilers are cool like that.

Another reason why the XOR swap might be bad optimization is that modern processors work on pipelines but the XOR operation flow forces the processor to perform XORing sequentially because each step in XOR swap depends on the previous one.

If you use XOR swap instead of regular swap, the compiler will not be able to use its own optimized swap. 

There is another reason that will surely result in programmers pulling their hair out if they did use the XOR swap trick. If you tried to XOR-swap two variables with the same value; you will lose data! XORing something with itself equals zero! Traditional swap doesn't have this problem.

Another similar hack is try swapping without using a third variable is by using:

a =a-b;
b = a+b;
a = -1*(a-b);

Which still suffers from the same problem. The reality is that the compiler is a lot better than most programmers at optimization, and that the best approach before you implement an optimization trick is to research it first, see how the compiler tries to optimize such a case, and then some sweet, sweet benchmarking. 


To cut the chase for you, unless you are experiencing extreme register spills (CPU register shortages) or developing for microcontrollers and smart cards with very limited RAM, you should use the regular swap algorithm.

Assembly language nuts might respond by reminding us that the assembly instruction xor is often used to zero out a register, and they are correct. To zero out a register, it is more efficient (especially in code size) to xor it with itself than to use a mov instruction, and it is especially useful for writing shellcodes since you won't have to write null bytes that break the vulnerable string functions that you may use to inject the shellcode. 


But we're not talking about zeroing out registers here, are we? We're talking about swapping. The two cases are disjoint and unrelated.


Trivia: Can you tell why xor is used to zero out a register instead of subtracting it by itself with a sub instruction?


  1. So what you're saying is that I should avoid manually optimizing my code in hope that the compiler would recognize relevant parts of the code and optimize them for me, right?

    Me thinks you're relying on the compiler too much. It might detect what you're trying to do and then again, it might not. It makes no sense to avoid optimizing your code altogether just to maximize the chances of your code being automatically optimized.

    Also, although gcc is fairly clever when it comes to optimization, not all c compilers are. There's no guaranty that other compilers will attempt to optimize your code let alone detect what you're doing.

    Besides, I like writing code that I have a hard time understanding, and I love using prefixed increment operator whenever possible which are rumored to be slightly more efficient than their postfixed counterparts. please don't take that simple joy from me :)

    You're pipeline argument makes some sense especially since some optimization techniques are rather convoluted beyond both human and optimizer comprehension/recognition. (but your examples have dependencies, so I don't think it will have much effect on pipelining).

    I really enjoyed your post as you can probably tell by this long comment. Keep'em coming.

  2. Well, no, I'm not saying saying you should avoid optimizing your code. By all means, you should. Especially since the compiler may not understand what you're really trying to do (unless it's obvious, like swapping values), and so you should take care of optimization, given two conditions:

    1- You try to have an idea of how the compiler optimizes code, in order to avoid messing up its optimization chances. In other words if you understand the compiler, you can write code in a way that the compiler can efficiently optimize.
    2- You perform code profiling before and after optimization. You might have a bottleneck elsewhere. Your optimization might mess up the efficiency of the code somewhere...etc

    Optimization is a tricky game. It can take your focus off something important, like code readability, correctness or security. Many times vulnerabilities are created by a programmer wanting to optimize the hell out of a function.

    Your best chances is to optimize your logic itself rather than the code. Try to optimize the method you want to use, then implement its code in smaller components. This way profiling can be easier, and the compiler will have a better chance of optimizing each chunk alone. This is how you would cooperate with the compiler.