Which is fastest: integer addition or XOR?
The bitwise exclusive or (e.g., 1110 XOR 1001 = 0111) looks simpler to compute than integer addition (e.g., 2 + 9 = 11). Some research articles claim that XOR is faster. It appears to be Computer Science folklore. But is it true?
Which line runs faster? (The symbol “^” is the XOR.)
for(int k = 0; k < N; ++k) sum+= k;
for(int k = 0; k < N; ++k) sum^= k;
My result: In C++ and Java, both run at the same speed (within 1%).
Disclaimer: I’d be delighted if you could prove me wrong. Please provide Java or C++ source code.
Montreal, Canada 
Facebook
Friendfeed
LinkedIn
SlideShare
Delicious
Unless there is more context here, there should be no difference, as both ops map to a single-cycle operation on modern processors. Arbitrary-precision numbers maybe?
However, at the pure hardware level, XOR is faster than addition since there is no carry bit, but other details obscure that and for all practical purposes, they run at the same speed as instructions.
Comment by Eric LaForest — 12/3/2010 @ 20:32
They’re probably both the same number of machine cycles, if only because there’s no way for an ALU to take advantage of fewer stages of transistors in the XOR versus the ADD (the outputs of the combinatorial logic are just latched into the output after the same amount of time regardless of the operation of that type. Is integer multiplication any different these days (I’m too lazy to check for myself)?
Comment by Mike Stiber — 12/3/2010 @ 20:32
@LaForest Yes, I agree.
@Stiber
Multiplication might run at the same speed, but you have to use a different test.
Comment by Daniel Lemire — 12/3/2010 @ 20:48
The answer is going to depend on the CPU.
For anything like current desktop and server CPUs, they use a ridiculous amount of hardware to accelerate a single instruction stream. Both operations are common and pretty much dead simple, so I’d expect both to be among the very-fastest instructions.
This is likely true of embedded CPUs, like those used in the iPad and cell phones.
Multiplication might show a bigger difference (compared to simpler operations) between high-end and low-end CPUs.
Comment by Preston L. Bannister — 12/3/2010 @ 22:17
XOr requires less transistor than addition. This is one measure of complexity on which XOr wins.
XOr would probably be harder to teach to my kid than sums. This is a measure of complexity on which addition wins.
We have a stalemate…
(As a sidenote, have you thought of updating your spam protection mechanism to ask for XOrs instead of sums?
)
Comment by Philippe Beaudoin — 12/3/2010 @ 23:23
I’m finding that when I compile with gcc, xor is 10-20% faster. When I compile the same code with g++, add is about 10% faster. Here is the code I used: http://pastebin.com/PzmmFsai .
uname -a ouputs:
Linux lappy 2.6.32-ARCH #1 SMP PREEMPT Tue Feb 23 19:43:46 CET 2010 x86_64 Intel(R) Core(TM)2 Duo CPU T7500 @ 2.20GHz GenuineIntel GNU/Linux
And I’m using gcc 4.4.3.
Comment by Drew Frank — 13/3/2010 @ 3:44
Isn’t this because there are complex carry lookahead adder circuitry already incorporated in the hardware ? XOR is a bit wise parallel simpler operation. So in that sense, ADD finishes at the same
same as XOR because hardware has been optimised for AND.
Comment by Harisankar H — 13/3/2010 @ 4:22
Well Daniel, you have achieved a measure of success with your weblog – the spammers have arrived, despite your protection.
Either that, or we have someone (a student?) with an odd sense of humor.
Comment by Preston L. Bannister — 13/3/2010 @ 6:47
@Frank With the same code, and testing over many runs, I get that both run at the same speed.
I have modified your code so that it loops around more:
http://pastebin.com/mEZBBYKZ
$ g++ -O2 -o code1 code1.cpp
$ g++ -O2 -o code2 code2.cpp
$ time ./code1 9999999
real 0m5.953s
user 0m5.944s
sys 0m0.002s
$ time ./code2 9999999
real 0m5.974s
user 0m5.971s
sys 0m0.002s
Comment by Daniel Lemire — 13/3/2010 @ 10:44
@Bannister I have a bunch of counter-measures against spammers for this blog. They work very well, but a small percentage of spam is unavoidable. I prune spam comments every day, believe it or not.
Comment by Daniel Lemire — 13/3/2010 @ 10:47
Folk, you are wasting your time comparing XOR against addition – both are dirt cheap and going to take the minimum amount of time possible for an instruction that takes two operands and stores a result.
The difference in complexity between the two operations (if any) is insignificant compared to the transistor budget of the CPU designer, and has been for a couple decades.
If you find a difference, it will be to either (1) a measurement error, or (2) something wonky in your favorite language interpreter.
I used to pay a lot of attention to CPU architecture, counting clock cycles per instruction, and writing benchmarks to verify the variants of instruction sequences. As the CPU designers got a bigger transistor budget, they dropped in bigger circuit blocks to perform common operations in a single cycle, or as close as possible.
I was delighted when CPUs got barrel shifters. Ever try to write efficient graphics ops when bit-shift time is linear to the size of the shift? That was a long time back. (And I found a solution.)
When the Intel 486 came out, I lost interest. Most common operations were at that point were very fast.
There is another aspect to this. The unique logic in a CPU is like lines of code in software (and VLSI design became more like software design). Since then CPUs have become very large programs indeed. Large programs almost invariably have bugs. We normally attribute failures to bugs in software, but some are bugs in hardware, and we do not know the proportion.
With current CPUs, massive hardware and overlapped speculative execution has even removed or minimized the cost of subroutine calls and pointer chasing – in many cases. Instruction clock-cycle counting has not been effective for a long time.
Comment by Preston L. Bannister — 13/3/2010 @ 15:08