For most work stresses, it is not worthwhile to deal with assembly instructions. Excessive complexity and code ambiguity are usually more than comparatively decent gains. Mainly because compilers have become quite adept at generational code and since processors are very fast, it is difficult to achieve meaningful speed by tweaking a small piece of code. This changes when you introduce SIMD instructions and need to quickly decode lots of bitsets. Intel’s fancy AVX-512 SIMD instructions may offer some significant performance gains with relatively less custom assembly.
Like many software engineers, [Daniel Lemire] There were many bitsets (a range of ints / enums encoded in a binary number, each bit related to a different integer or enum). Instead of just checking if a specific flag is present (a bitwise and), [Daniel] Wanted to know all the flags on a given bitset. The easiest way is to repeat them all through:
while (word != 0) { result[i] = trailingzeroes(word); word = word & (word - 1); i++; }
Innocent versions of this look are more likely to have a branch misconception, and either you or the compiler will unlock the loop and speed it up. However, the AVX-512 instructions set to the latest Intel processors contain some practical instructions for this type of thing. The instruction is vpcompressd And Intel provides a simple and memorable C / C ++ function called _mm512_mask_compressstoreu_epi32.
The function creates an array of integers and you can use infamous popcnt Instructed to get number one. Some initial benchmark testing shows that the AVX-512 version uses 45% less cycles. You may be wondering, if a wide 512-byte register is used, does the processor not downlock? Yes. But with downclocking, the SIMD version is still 33% faster. The code is in Github if you want to try it yourself.