Is division slower than multiplication?

Blog

Directory

Submitted by srchvrs on Mon, 11/18/2013 - 02:48

I was discussing efficiency issues, related to nearest-neighbor searching, with Yury Malkov. One topic was: "is division slower than multiplication?" I personally believed that there should not be much difference, as in the case of other simple arithmetic operations. Yury pointed out that division was much slower. In particular, according to of Agner Fog, not only division has high latency (~20 CPU cycles), but also a low throughput (0.05 division per CPU cycle).

As I explained during my presentation at SISAP 2013, optimizing computation of a distance function can be much more important than designing a new data structure. In that, efficient computation of quotients can be crucial to some distance functions. So, it is important to know if multiplication is faster than division. If it is faster, then how much faster?
This topic is not new, but I have not found sufficiently thorough tests for recent CPUs. Thus, I have run my own and have come to a conclusion that multiplication is, indeed, faster. Division has higher latency than multiplication and in some cases this difference can be crucial. In my tests, there was a 2-6x difference. Last, but not least, when there are data dependencies, multiplication can also be slow and take several CPU cycles to complete. The details are below.

The code can be found on GitHub (module testdiv.cpp). To compile, I used the following command (the flag -Ofast is employed):

make -f Makefile.gcc_Ofast

Even though there is a makefile for the Intel compiler, I don't recommend using it. Intel can "cheat" and skip computations. I tested using the laptop version of Core i7. Similar results were obtained using the server version of Core i7 and a relatively modern AMD processor. Note that, in addition, to complete runtime, I compute the number of divisions/multiplications per second. This is somewhat inaccurate, because other operations (such as additions) also take some time. Yet, because multiplications and divisions appear to be considerably more expensive than other operations, this can serve as a rough indicator of throughput in different setups.

In the first test (testDivDataDep0), I deliberately introduce data dependencies. The result of one division becomes an argument of the next one:

 for(size_t i = 0; i < rep; ++i) {
     c1+=b1/c4;
     c2+=b2/c1;
     c3+=b3/c2;
     c4+=b4/c3;
 }

Similarly, see the function testMulDataDep0, I introduce dependencies for the multiplication:

for(size_t i = 0; i < rep; ++i) {
    c1+=b1*c4;
    c2+=b2*c1;
    c3+=b3*c2;
    c4+=b4*c3;
}

The functions testMulMalkovDataDep0 and testDivMalkovDataDep0 are almost identical, except one uses multiplication and another uses division. Measurements, show that testDivMalkovDataDep, which involves division, takes twice as long to finish. I can compute about 200 million divisions per second and about 400 million multiplications per second.

Let's now rewrite the code a bit, so our dependencies are "milder": To compute an i-th operation, we need to know the result of the (i-4)-th operation .The function to test efficiency of division (testDivDataDep1) now contains the following code:

for(size_t i = 0; i < rep; ++i) {
    c1+=b1/c1;
    c2+=b2/c2;
    c3+=b3/c3;
    c4+=b4/c4;
}

Similarly, we modify the function to carry out multiplications (testMulDataDep1):

for(size_t i = 0; i < rep; ++i) {
    c1+=b1*c1;
    c2+=b2*c2;
    c3+=b3*c3;
    c4+=b4*c4;
}

I can now carry out 450 million divisions and 2.5 billion multiplications per second. The overall runtime of the function that tests efficiency of division is 5 times as long compared to the function that tests efficiency of multiplications. In addition, the run-time of the function testMulMalkovDataDep0 is 5 times as long as the runtime of the function testMulMalkovDataDep1. To my understanding, the reason for such a difference is that computation of multiplications in testMulMalkovDataDep0 takes much longer than in testMulMalkovDataDep1 (due to data dependencies). What can we conclude at this point? Apparently, the latency of division is higher than that of multiplication. However, in the presence of data dependencies, multiplication can also be slow and take several CPU cycles to complete.

To conclude, I re-iterate that there appears to be some difference between multiplication and divisions. This difference does exist even in top-notch CPUs. Further critical comments and suggestions are appreciated.

Acknowledgements: I thank Yury Malkov for the discussion and scrutinizing some of my tests.

UPDATE 1: I have also tried to implement an old trick, where you reduce the number of divisions at the expense of carrying out additional multiplications. It is implemented in the function testDiv2Once. It does help to improve speed, but not always. In particular, it is not especially useful for single-precision floating-point numbers. Yet, you can get a 60% reduction of run-time for the type long double. To see this, change the value of constant USE_ONLY_FLOAT to false.

UPDATE 2: It is also interesting that vectorization (at least for single-precision floating-point operations) does not help to improve the speed of multiplication. Yet, it may boost the throughput of division in 4 times. Please, see the code here.

UPDATE 3: Maxim Zakharov ran my code for the ARM CPU. And the results (see the comments) are similar to that of Intel: division is about 3-6 times slower than multiplication.

srchvrs's blog

Comments

Submitted by Maxim Zakharov (not verified) on Wed, 11/20/2013 - 20:51

Below is the result of testdiv execution on RaspberryPi, which is an ARMv6 platform. SIMD tests were mocked as it seems there is no appropriate intrinsic data types for ARM platform.
It looks like multiplication is 3 to 8 times faster for ARM.

Thank for pointing out this fact, I will review my ranking function according it.

testDivDataDep0
Ignore: 4.35421e+06
Test WITH data dependencies
DIVs computed: 25600000, time 1103.79 ms, type: f
Milllions of DIVs per sec: 23.1929
=============================
testDivDataDep0
Ignore: 4.35445e+06
Test WITH data dependencies
DIVs computed: 25600000, time 1629.09 ms, type: d
Milllions of DIVs per sec: 15.7143
=============================
testDivDataDep0
Ignore: 4.35445e+06
Test WITH data dependencies
DIVs computed: 25600000, time 1610.42 ms, type: e
Milllions of DIVs per sec: 15.8965
=============================
testDivDataDep1
Ignore: 4.39545e+06
Test WITH data dependencies
DIVs computed: 25600000, time 737.966 ms, type: f
Milllions of DIVs per sec: 34.6899
=============================
testDivDataDep1
Ignore: 4.39296e+06
Test WITH data dependencies
DIVs computed: 25600000, time 1272.87 ms, type: d
Milllions of DIVs per sec: 20.1121
=============================
testDivDataDep1
Ignore: 4.39296e+06
Test WITH data dependencies
DIVs computed: 25600000, time 1265.07 ms, type: e
Milllions of DIVs per sec: 20.2361
=============================
testDivDataDep1_SIMD
Ignore: 0
Test WITH data dependencies
DIVs computed: 25600000, time 0.002 ms, type: f
Milllions of DIVs per sec: 1.28e+07
=============================
testDiv2AtOnce
Ignore: 4.46317e+06
Test WITH data dependencies
DIVs computed: 25600000, time 633.01 ms, type: f
Milllions of DIVs per sec: 40.4417
=============================
testDiv2AtOnce
Ignore: 4.45573e+06
Test WITH data dependencies
DIVs computed: 25600000, time 930.373 ms, type: d
Milllions of DIVs per sec: 27.5158
=============================
testDiv2AtOnce
Ignore: 4.45573e+06
Test WITH data dependencies
DIVs computed: 25600000, time 933.908 ms, type: e
Milllions of DIVs per sec: 27.4117
=============================
testMulDataDep0
Ignore: 5.20709e+37
Test WITH data dependencies
MULs computed: 25600000, time 359.797 ms, type: f
Milllions of MULs per sec: 71.1512
=============================
testMulDataDep0
Ignore: 5.20821e+37
Test WITH data dependencies
MULs computed: 25600000, time 393.27 ms, type: d
Milllions of MULs per sec: 65.0952
=============================
testMulDataDep0
Ignore: 5.20821e+37
Test WITH data dependencies
MULs computed: 25600000, time 397.355 ms, type: e
Milllions of MULs per sec: 64.426
=============================
testMulDataDep1
Ignore: 4.89995e+23
Test WITH data dependencies
MULs computed: 25600000, time 128.224 ms, type: f
Milllions of MULs per sec: 199.651
=============================
testMulDataDep1
Ignore: 4.90027e+23
Test WITH data dependencies
MULs computed: 25600000, time 147.656 ms, type: d
Milllions of MULs per sec: 173.376
=============================
testMulDataDep1
Ignore: 4.90027e+23
Test WITH data dependencies
MULs computed: 25600000, time 143.433 ms, type: e
Milllions of MULs per sec: 178.481
=============================
testMulDataDep1_SIMD
Ignore: 0
Test WITH data dependencies
DIVs computed: 25600000, time 0.002 ms, type: f
Milllions of DIVs per sec: 1.28e+07
=============================

Submitted by srchvrs on Thu, 11/21/2013 - 01:18

This is great, thank you! Surprisingly, the difference between division and multiplication is similar to that of the newest Intel, but is slightly larger.

In addition, the trick to reduce the number of divisions through multiplication by reciprocals results in larger performance boost.

You are here

Comments