The bitwise exclusive or (e.g., 1110 XOR 1001 = 0111) looks simpler to compute than integer addition (e.g., 2 + 9 = 11). Some research articles claim that XOR is faster. It appears to be Computer Science folklore. But is it true?

Which line runs faster? (The symbol “^” is the XOR.)

for(int k = 0; k < N; ++k) sum+= k;

for(int k = 0; k < N; ++k) sum^= k;

My result: In C++ and Java, both run at the same speed (within 1%).

Disclaimer: I’d be delighted if you could prove me wrong. Please provide Java or C++ source code.

15 Comments »

  1. Unless there is more context here, there should be no difference, as both ops map to a single-cycle operation on modern processors. Arbitrary-precision numbers maybe?

    However, at the pure hardware level, XOR is faster than addition since there is no carry bit, but other details obscure that and for all practical purposes, they run at the same speed as instructions.

    Comment by Eric LaForest — 12/3/2010 @ 20:32

  2. They’re probably both the same number of machine cycles, if only because there’s no way for an ALU to take advantage of fewer stages of transistors in the XOR versus the ADD (the outputs of the combinatorial logic are just latched into the output after the same amount of time regardless of the operation of that type. Is integer multiplication any different these days (I’m too lazy to check for myself)?

    Comment by Mike Stiber — 12/3/2010 @ 20:32

  3. @LaForest Yes, I agree.

    @Stiber

    Multiplication might run at the same speed, but you have to use a different test.

    Comment by Daniel Lemire — 12/3/2010 @ 20:48

  4. The answer is going to depend on the CPU.

    For anything like current desktop and server CPUs, they use a ridiculous amount of hardware to accelerate a single instruction stream. Both operations are common and pretty much dead simple, so I’d expect both to be among the very-fastest instructions.

    This is likely true of embedded CPUs, like those used in the iPad and cell phones.

    Multiplication might show a bigger difference (compared to simpler operations) between high-end and low-end CPUs.

    Comment by Preston L. Bannister — 12/3/2010 @ 22:17

  5. XOr requires less transistor than addition. This is one measure of complexity on which XOr wins.

    XOr would probably be harder to teach to my kid than sums. This is a measure of complexity on which addition wins.

    We have a stalemate…

    (As a sidenote, have you thought of updating your spam protection mechanism to ask for XOrs instead of sums? ;) )

    Comment by Philippe Beaudoin — 12/3/2010 @ 23:23

  6. I’m finding that when I compile with gcc, xor is 10-20% faster. When I compile the same code with g++, add is about 10% faster. Here is the code I used: http://pastebin.com/PzmmFsai .

    uname -a ouputs:
    Linux lappy 2.6.32-ARCH #1 SMP PREEMPT Tue Feb 23 19:43:46 CET 2010 x86_64 Intel(R) Core(TM)2 Duo CPU T7500 @ 2.20GHz GenuineIntel GNU/Linux

    And I’m using gcc 4.4.3.

    Comment by Drew Frank — 13/3/2010 @ 3:44

  7. Isn’t this because there are complex carry lookahead adder circuitry already incorporated in the hardware ? XOR is a bit wise parallel simpler operation. So in that sense, ADD finishes at the same
    same as XOR because hardware has been optimised for AND.

    Comment by Harisankar H — 13/3/2010 @ 4:22

  8. Well Daniel, you have achieved a measure of success with your weblog – the spammers have arrived, despite your protection.

    Either that, or we have someone (a student?) with an odd sense of humor.

    Comment by Preston L. Bannister — 13/3/2010 @ 6:47

  9. @Frank With the same code, and testing over many runs, I get that both run at the same speed.

    I have modified your code so that it loops around more:

    http://pastebin.com/mEZBBYKZ


    $ g++ -O2 -o code1 code1.cpp
    $ g++ -O2 -o code2 code2.cpp
    $ time ./code1 9999999
    real 0m5.953s
    user 0m5.944s
    sys 0m0.002s
    $ time ./code2 9999999
    real 0m5.974s
    user 0m5.971s
    sys 0m0.002s

    Comment by Daniel Lemire — 13/3/2010 @ 10:44

  10. @Bannister I have a bunch of counter-measures against spammers for this blog. They work very well, but a small percentage of spam is unavoidable. I prune spam comments every day, believe it or not.

    Comment by Daniel Lemire — 13/3/2010 @ 10:47

  11. Folk, you are wasting your time comparing XOR against addition – both are dirt cheap and going to take the minimum amount of time possible for an instruction that takes two operands and stores a result.

    The difference in complexity between the two operations (if any) is insignificant compared to the transistor budget of the CPU designer, and has been for a couple decades.

    If you find a difference, it will be to either (1) a measurement error, or (2) something wonky in your favorite language interpreter.

    I used to pay a lot of attention to CPU architecture, counting clock cycles per instruction, and writing benchmarks to verify the variants of instruction sequences. As the CPU designers got a bigger transistor budget, they dropped in bigger circuit blocks to perform common operations in a single cycle, or as close as possible.

    I was delighted when CPUs got barrel shifters. Ever try to write efficient graphics ops when bit-shift time is linear to the size of the shift? That was a long time back. (And I found a solution.)

    When the Intel 486 came out, I lost interest. Most common operations were at that point were very fast.

    There is another aspect to this. The unique logic in a CPU is like lines of code in software (and VLSI design became more like software design). Since then CPUs have become very large programs indeed. Large programs almost invariably have bugs. We normally attribute failures to bugs in software, but some are bugs in hardware, and we do not know the proportion.

    With current CPUs, massive hardware and overlapped speculative execution has even removed or minimized the cost of subroutine calls and pointer chasing – in many cases. Instruction clock-cycle counting has not been effective for a long time.

    Comment by Preston L. Bannister — 13/3/2010 @ 15:08

  12. FYI: Any test that counts the number of cpu cycles is going to have issues (this is essentially due to the Heisenberg uncertainty principle). Also, to answer a question from above, multiplication (i.e. the IMULT assembly command) usually runs at about 4 clock cycles. However, this is only when you are doing a lot of multiplication. If you are doing only a few this speed is slower. See an discussion of optimized multiplexers for more understanding. Also, if the two number being multiplied are greater than the size of register (ie. 32 bits or 64 bits) the speed is O(lg(n)) where n is the number of bits. While this may seem odd to mention, it is very common on any system doing cryptological calculation (particular those for the ElGamal and RSA encryption systems).

    Comment by David Hessler — 12/3/2011 @ 22:48

  13. I have done the same experiment but with xor and sub, since k -= k is 0 and k ^= k is also 0
    the result is :
    sub : 151 secs. (7 tries)
    xor : 171 secs. (also 7 tries)
    this shows that sub is faster

    Code:

    #include

    using namespace std;

    int main()
    {
    int k = 1000;
    for(int i= 0 ; i<10000; i++)
    {
    k -= k;
    cout << i << endl;
    k = 1000;

    }
    }

    and

    #include

    using namespace std;

    int main()
    {
    int k = 1000;
    for(int i= 0 ; i<10000; i++)
    {
    k ^= k;
    cout << i << endl;
    k = 1000;

    }
    }

    Comment by ethanara — 10/9/2011 @ 5:19

  14. Addition takes longer to compute in *hardware* because the carry bit has to be propogated through N sequential calculations — each of which involves gate delays. When computing XOR, each bit can be calculated in parallel.

    Whether it takes longer on a particular hardware implementation is implementation-dependent. Instruction timing on state-of-the-art Intel processors is complicated, to say the least. But, according to the Intel Architecture Optimization Manula, On intel architectures, The pair of integer ALU pipeline stages can each execute xor and addition operations in one clock cycle. They execute at exactly the same speed.

    That’s true for Pentium and later processors. It’s possible that smaller and leaner processors do have different execution times for XOR and ADD; but I would think that — given the prevalence of addition operations in computing, that even tiny processors would use the propogation delay time of an addition operation would for the low limit for a single pipleline stage in any modern microprocess, and all or almost all ancient ones as well. (Despite having taken high-school latin, I also think your spam filter is inappropriate).

    Comment by ed rowland — 10/9/2011 @ 23:51

  15. Would it not be wise and slightly more accurate if you were using unsigned integers?

    Comment by Aloz1 — 19/11/2011 @ 17:21

Leave a comment

Warning: When entering a long comment, please ensure that you make copy of your text prior to submitting it. If the server should fail or if you hit a bug, you might lose your work. I am not responsible for your lost effort.

To spammers: I carefully review every single post and make sure that spam gets deleted. You are wasting your time if you are manually entering spam using this form. Read my terms of use to see what I consider to be abusive.

Example: duo plus septem is '9'. The numbers are expressed in latin numerals but you should give your answers using ordinary digits.

 

« Blog's main page

Powered by WordPress