1. [InstCombine] add ptr difference tests; NFC (details)
  2. [InstCombine] improve fold of pointer differences (details)
  3. [SelectionDAG][X86][ARM] Teach ExpandIntRes_ABS to use sra+add+xor expansion when ADDCARRY is supported. (details)
  4. [SCCP] Compute ranges for supported intrinsics (details)
  5. [KnownBits] Avoid some copies (NFC) (details)
  6. Reland [SimplifyCFG][LoopRotate] SimplifyCFG: disable common instruction hoisting by default, enable late in pipeline (details)
Commit 70207816e35771459d053ab9faf75a50a4cb92fb by spatel
[InstCombine] add ptr difference tests; NFC
The file was modifiedllvm/test/Transforms/InstCombine/sub-gep.ll
Commit 8b300679192b317aa91a28e781fcf60d4416b0d6 by spatel
[InstCombine] improve fold of pointer differences

This was supposed to be an NFC cleanup, but there's
a real logic difference (did not drop 'nsw') visible
in some tests in addition to an efficiency improvement.

This is because in the case where we have 2 GEPs,
the code was *always* swapping the operands and
negating the result. But if we have 2 GEPs, we
should *never* need swapping/negation AFAICT.

This is part of improving flags propagation noticed
with PR47430.
The file was modifiedllvm/lib/Transforms/InstCombine/InstCombineAddSub.cpp
The file was modifiedllvm/test/Transforms/InstCombine/sub.ll
The file was modifiedllvm/test/Transforms/InstCombine/sub-gep.ll
Commit da79b1eecc65171f6ca0cda9b4f1970bd1503c17 by craig.topper
[SelectionDAG][X86][ARM] Teach ExpandIntRes_ABS to use sra+add+xor expansion when ADDCARRY is supported.

Rather than using SELECT instructions, use SRA, UADDO/ADDCARRY and
XORs to expand ABS. This is the multi-part version of the sequence
we use in LegalizeDAG.

It's also the same as the Custom sequence uses for i64 on 32-bit
and i128 on 64-bit. So we can remove the X86 customization.

Reviewed By: RKSimon

Differential Revision:
The file was modifiedllvm/test/CodeGen/X86/abs.ll
The file was modifiedllvm/lib/Target/X86/X86ISelLowering.cpp
The file was modifiedllvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
The file was modifiedllvm/test/CodeGen/Thumb2/mve-abs.ll
The file was modifiedllvm/test/CodeGen/X86/iabs.ll
Commit 9fb46a452d4e5666828c95610ceac8dcd9e4ce16 by nikita.ppv
[SCCP] Compute ranges for supported intrinsics

For intrinsics supported by ConstantRange, compute the result range
based on the argument ranges. We do this independently of whether
some or all of the input ranges are full, as we can often still
constrain the result in some way.

Differential Revision:
The file was modifiedllvm/lib/Transforms/Scalar/SCCP.cpp
The file was modifiedllvm/test/Transforms/SCCP/intrinsics.ll
Commit ddab4cd83ea31141aaada424dccf94278482ee88 by nikita.ppv
[KnownBits] Avoid some copies (NFC)

These lambdas don't need copies, use const reference.
The file was modifiedllvm/lib/Support/KnownBits.cpp
Commit bb7d3af1139c36270bc9948605e06f40e4c51541 by lebedev.ri
Reland [SimplifyCFG][LoopRotate] SimplifyCFG: disable common instruction hoisting by default, enable late in pipeline

This was reverted in 503deec2183d466dad64b763bab4e15fd8804239
because it caused gigantic increase (3x) in branch mispredictions
in certain benchmarks on certain CPU's,

It has since been investigated and here are the results:
> It's an amazingly severe regression, but it's also all due to branch
> mispredicts (about 3x without this). The code layout looks ok so there's
> probably something else to deal with. I'm not sure there's anything we can
> reasonably do so we'll just have to take the hit for now and wait for
> another code reorganization to make the branch predictor a bit more happy :)
> Thanks for giving us some time to investigate and feel free to recommit
> whenever you'd like.
> -eric

So let's just reland this.
Original commit message:

I've been looking at missed vectorizations in one codebase.
One particular thing that stands out is that some of the loops
reach vectorizer in a rather mangled form, with weird PHI's,
and some of the loops aren't even in a rotated form.

After taking a more detailed look, that happened because
the loop's headers were too big by then. It is evident that
SimplifyCFG's common code hoisting transform is at fault there,
because the pattern it handles is precisely the unrotated
loop basic block structure.

Surprizingly, `SimplifyCFGOpt::HoistThenElseCodeToIf()` is enabled
by default, and is always run, unlike it's friend, common code sinking
transform, `SinkCommonCodeFromPredecessors()`, which is not enabled
by default and is only run once very late in the pipeline.

I'm proposing to harmonize this, and disable common code hoisting
until //late// in pipeline. Definition of //late// may vary,
here currently i've picked the same one as for code sinking,
but i suppose we could enable it as soon as right after
loop rotation happens.

Experimentation shows that this does indeed unsurprizingly help,
more loops got rotated, although other issues remain elsewhere.

Now, this undoubtedly seriously shakes phase ordering.
This will undoubtedly be a mixed bag in terms of both compile- and
run- time performance, codesize. Since we no longer aggressively
hoist+deduplicate common code, we don't pay the price of said hoisting
(which wasn't big). That may allow more loops to be rotated,
so we pay that price. That, in turn, that may enable all the transforms
that require canonical (rotated) loop form, including but not limited to
vectorization, so we pay that too. And in general, no deduplication means
more [duplicate] instructions going through the optimizations. But there's still
late hoisting, some of them will be caught late.

As per benchmarks i've run {F12360204}, this is mostly within the noise,
there are some small improvements, some small regressions.
One big regression i saw i fixed in rG8d487668d09fb0e4e54f36207f07c1480ffabbfd, but i'm sure
this will expose many more pre-existing missed optimizations, as usual :S thoughts on this:
* this does regress compile-time by +0.5% geomean (unsurprizingly)
* size impact varies; for ThinLTO it's actually an improvement

The largest fallout appears to be in GVN's load partial redundancy
elimination, it spends *much* more time in
Non-local `MemoryDependenceResults` is widely-known to be, uh, costly.
There does not appear to be a proper solution to this issue,
other than silencing the compile-time performance regression
by tuning cut-off thresholds in `MemoryDependenceResults`,
at the cost of potentially regressing run-time performance.
D84609 attempts to move in that direction, but the path is unclear
and is going to take some time.

If we look at stats before/after diffs, some excerpts:
* RawSpeed (the target) {F12360200}
  * -14 (-73.68%) loops not rotated due to the header size (yay)
  * -272 (-0.67%) `"Number of live out of a loop variables"` - good for vectorizer
  * -3937 (-64.19%) common instructions hoisted
  * +561 (+0.06%) x86 asm instructions
  * -2 basic blocks
  * +2418 (+0.11%) IR instructions
* vanilla test-suite + RawSpeed + darktable  {F12360201}
  * -36396 (-65.29%) common instructions hoisted
  * +1676 (+0.02%) x86 asm instructions
  * +662 (+0.06%) basic blocks
  * +4395 (+0.04%) IR instructions

It is likely to be sub-optimal for when optimizing for code size,
so one might want to change tune pipeline by enabling sinking/hoisting
when optimizing for size.

Reviewed By: mkazantsev

Differential Revision:

This reverts commit 503deec2183d466dad64b763bab4e15fd8804239.
The file was modifiedllvm/lib/Transforms/Scalar/SimplifyCFGPass.cpp
The file was modifiedllvm/test/Transforms/PGOProfile/chr.ll
The file was modifiedllvm/lib/Target/AArch64/AArch64TargetMachine.cpp
The file was modifiedllvm/test/Transforms/SimplifyCFG/common-code-hoisting.ll
The file was modifiedllvm/test/Transforms/PhaseOrdering/loop-rotation-vs-common-code-hoisting.ll
The file was modifiedllvm/include/llvm/Transforms/Utils/SimplifyCFGOptions.h
The file was modifiedllvm/lib/Target/Hexagon/HexagonTargetMachine.cpp
The file was modifiedllvm/lib/Passes/PassBuilder.cpp
The file was modifiedllvm/lib/Target/ARM/ARMTargetMachine.cpp
The file was modifiedllvm/lib/Transforms/IPO/PassManagerBuilder.cpp