SuccessChanges

Summary

  1. [AMDGPU] Simplify the exclusive scan used for optimized atomics Summary: Change the scan algorithm to use only power-of-two shifts (1, 2, 4, 8, 16, 32) instead of starting off shifting by 1, 2 and 3 and then doing a 3-way ADD, because: 1. It simplifies the compiler a little. 2. It minimizes vgpr pressure because each instruction is now of the form vn = vn + vn << c. 3. It is more friendly to the DPP combiner, which currently can't combine into an ADD3 instruction. Because of #2 and #3 the end result is improved from this: v_add_u32_dpp v4, v3, v3 row_shr:1 row_mask:0xf bank_mask:0xf bound_ctrl:0 v_mov_b32_dpp v5, v3 row_shr:2 row_mask:0xf bank_mask:0xf v_mov_b32_dpp v1, v3 row_shr:3 row_mask:0xf bank_mask:0xf v_add3_u32 v1, v4, v5, v1 s_nop 1 v_add_u32_dpp v1, v1, v1 row_shr:4 row_mask:0xf bank_mask:0xe s_nop 1 v_add_u32_dpp v1, v1, v1 row_shr:8 row_mask:0xf bank_mask:0xc s_nop 1 v_add_u32_dpp v1, v1, v1 row_bcast:15 row_mask:0xa bank_mask:0xf s_nop 1 v_add_u32_dpp v1, v1, v1 row_bcast:31 row_mask:0xc bank_mask:0xf To this: v_add_u32_dpp v1, v1, v1 row_shr:1 row_mask:0xf bank_mask:0xf bound_ctrl:0 s_nop 1 v_add_u32_dpp v1, v1, v1 row_shr:2 row_mask:0xf bank_mask:0xf bound_ctrl:0 s_nop 1 v_add_u32_dpp v1, v1, v1 row_shr:4 row_mask:0xf bank_mask:0xe s_nop 1 v_add_u32_dpp v1, v1, v1 row_shr:8 row_mask:0xf bank_mask:0xc s_nop 1 v_add_u32_dpp v1, v1, v1 row_bcast:15 row_mask:0xa bank_mask:0xf s_nop 1 v_add_u32_dpp v1, v1, v1 row_bcast:31 row_mask:0xc bank_mask:0xf I.e. two fewer computational instructions, one extra nop where we could schedule something else. Reviewers: arsenm, sheredom, critson, rampitec, vpykhtin Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D64411
  2. [Loop Peeling] Enable peeling of multiple exits by default. Enable loop peeling with multiple exits where all non-latch exits ends up with deopt by default. Reviewers: reames, fhahn Reviewed By: reames Subscribers: xbolva00, hiraditya, zzheng, llvm-commits Differential Revision: https://reviews.llvm.org/D64619
Revision 366543 by foad:
[AMDGPU] Simplify the exclusive scan used for optimized atomics

Summary:
Change the scan algorithm to use only power-of-two shifts (1, 2, 4, 8,
16, 32) instead of starting off shifting by 1, 2 and 3 and then doing
a 3-way ADD, because:

1. It simplifies the compiler a little.
2. It minimizes vgpr pressure because each instruction is now of the
   form vn = vn + vn << c.
3. It is more friendly to the DPP combiner, which currently can't
   combine into an ADD3 instruction.

Because of #2 and #3 the end result is improved from this:

  v_add_u32_dpp v4, v3, v3  row_shr:1 row_mask:0xf bank_mask:0xf bound_ctrl:0
  v_mov_b32_dpp v5, v3  row_shr:2 row_mask:0xf bank_mask:0xf
  v_mov_b32_dpp v1, v3  row_shr:3 row_mask:0xf bank_mask:0xf
  v_add3_u32 v1, v4, v5, v1
  s_nop 1
  v_add_u32_dpp v1, v1, v1  row_shr:4 row_mask:0xf bank_mask:0xe
  s_nop 1
  v_add_u32_dpp v1, v1, v1  row_shr:8 row_mask:0xf bank_mask:0xc
  s_nop 1
  v_add_u32_dpp v1, v1, v1  row_bcast:15 row_mask:0xa bank_mask:0xf
  s_nop 1
  v_add_u32_dpp v1, v1, v1  row_bcast:31 row_mask:0xc bank_mask:0xf

To this:

  v_add_u32_dpp v1, v1, v1  row_shr:1 row_mask:0xf bank_mask:0xf bound_ctrl:0
  s_nop 1
  v_add_u32_dpp v1, v1, v1  row_shr:2 row_mask:0xf bank_mask:0xf bound_ctrl:0
  s_nop 1
  v_add_u32_dpp v1, v1, v1  row_shr:4 row_mask:0xf bank_mask:0xe
  s_nop 1
  v_add_u32_dpp v1, v1, v1  row_shr:8 row_mask:0xf bank_mask:0xc
  s_nop 1
  v_add_u32_dpp v1, v1, v1  row_bcast:15 row_mask:0xa bank_mask:0xf
  s_nop 1
  v_add_u32_dpp v1, v1, v1  row_bcast:31 row_mask:0xc bank_mask:0xf

I.e. two fewer computational instructions, one extra nop where we could
schedule something else.

Reviewers: arsenm, sheredom, critson, rampitec, vpykhtin

Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D64411
Change TypePath in RepositoryPath in Workspace
The file was modified/llvm/trunk/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp (diff)llvm.src/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
The file was modified/llvm/trunk/test/CodeGen/AMDGPU/atomic_optimizations_buffer.ll (diff)llvm.src/test/CodeGen/AMDGPU/atomic_optimizations_buffer.ll
Revision 366542 by skatkov:
[Loop Peeling] Enable peeling of multiple exits by default.

Enable loop peeling with multiple exits where all non-latch exits
ends up with deopt by default.

Reviewers: reames, fhahn
Reviewed By: reames
Subscribers: xbolva00, hiraditya, zzheng, llvm-commits
Differential Revision: https://reviews.llvm.org/D64619
Change TypePath in RepositoryPath in Workspace
The file was modified/llvm/trunk/lib/Transforms/Utils/LoopUnrollPeel.cpp (diff)llvm.src/lib/Transforms/Utils/LoopUnrollPeel.cpp