UPDATE: As pointeded out here, this benchmark seems to have several issues, in particular, the compiler seems to vectorize the scalar addition, which is pretty awesome. I apologize for possibly misleading the readers. This issue clearly needs revisiting and "rebenchmarking".
Imagine that you have 4 floating point values stored in a 128 bit SSE register or 8 floating point values stored in a 256 bit AVX register. How do you sum them up efficiently? One naive solution involves storing the values of the register to memory and summing them up using scalar operations. Is it possible to vectorize the summation efficiently? In other words, is it possible to sum up using SSE or AVX operations?
I tried some obvious solutions, in particular, an approach that uses the so called "horizontal" addition. I also tried to extract floating point values using SIMD operations and sum them up using scalar addition, but without storing the original floating point values to memory.
None of the approaches worked faster than the simpler scalar approach. Maybe the reason is that SIMD operations have high latency. In contrast, scalar additions have smaller latency and the CPU can execute several of them in parallel thanks to superscalarity. Yet, I am not 100% sure that this explanation is good. In any case, it seems to be hard to sum up values of SSE/AVX registers efficiently. As usual, my code is available for scrutiny.