assembly - New AVX-instructions syntax -
i had c code written intel-intrinsincs. after compiled first avx , ssse3 flags, got 2 quite different assembly codes. e.g:
avx:
vpunpckhbw %xmm0, %xmm1, %xmm2
ssse3:
movdqa %xmm0, %xmm2 punpckhbw %xmm1, %xmm2
it's clear vpunpckhbw
punpckhbw
using avx 3 operand syntax. latency , throughput of first instruction equivalent latency , throughput of last ones combined? or answer depend on architecture i'm using? it's intelcore i5-6500 way.
i tried search answer in agner fog's instruction tables couldn't find answer. intel specifications didn't (however, it's missed 1 needed).
is better use new avx syntax if possible?
is better use new avx syntax if possible?
i think first question ask if folder instructions better non-folder instruction pair. folding takes pair of read , modify instructions this
vmovdqa %xmm0, %xmm2 vpunpckhbw %xmm2, %xmm1, %xmm1
and "folds" them 1 combined instruction
vpunpckhbw %xmm0, %xmm1, %xmm2
since ivy bridge register register move instruction can have 0 latency , can use 0 execution ports. however, unfolded instruction pair still counts 2 instructions on front-end , therefore can affect overall throughput. folded instruction counts 1 instruction in front-end lowers pressure on front-end without side effects. increase overall throughput.
however, memory register moves folding can may have side effect (there some debate this) if lowers pressure on front-end. reason out-of-order engine front-ends point of view sees folded instruction (assuming this answer correct) , if reason more optimal reorder memory read operation (since require execution ports , has latency) independently other operations in folded instruction out-of-order engine won't able take advantage of this. observed first time here.
for particular operation avx syntax better since folds register register move. however, if had memory register move folder avx instruction perform worse unfolded sse instruction pair in cases.
note that, in general, should still better use vex-encoded instructions. think compilers, if not all, assume folding better have no way control folding except assembly (not intrinsics) or in cases telling compiler not compile avx.
Comments
Post a Comment