FailedChanges

Summary

  1. Implement std::condition_variable via pthread_cond_clockwait() where available std::condition_variable is currently implemented via pthread_cond_timedwait() on systems that use pthread. This is problematic, since that function waits by default on CLOCK_REALTIME and libc++ does not provide any mechanism to change from this default. Due to this, regardless of if condition_variable::wait_until() is called with a chrono::system_clock or chrono::steady_clock parameter, condition_variable::wait_until() will wait using CLOCK_REALTIME. This is not accurate to the C++ standard as calling condition_variable::wait_until() with a chrono::steady_clock parameter should use CLOCK_MONOTONIC. This is particularly problematic because CLOCK_REALTIME is a bad choice as it is subject to discontinuous time adjustments, that may cause condition_variable::wait_until() to immediately timeout or wait indefinitely. This change fixes this issue with a new POSIX function, pthread_cond_clockwait() proposed on http://austingroupbugs.net/view.php?id=1216. The new function is similar to pthread_cond_timedwait() with the addition of a clock parameter that allows it to wait using either CLOCK_REALTIME or CLOCK_MONOTONIC, thus allowing condition_variable::wait_until() to wait using CLOCK_REALTIME for chrono::system_clock and CLOCK_MONOTONIC for chrono::steady_clock. pthread_cond_clockwait() is implemented in glibc (2.30 and later) and Android's bionic (Android API version 30 and later). This change additionally makes wait_for() and wait_until() with clocks other than chrono::system_clock use CLOCK_MONOTONIC.<Paste>
  2. [Clang][Codegen] Disable arm_acle.c test. This test is broken by design. Clang codegen tests should not depend on llvm middle-end behaviour, they should *only* test clang codegen. Yet this test runs whole optimization pipeline. I've really tried to fix it, but there isn't just a few things that depend on passes, but everything there does.
  3. [Clang][Codegen] Relax available-externally-suppress.c test That test is broken by design. It depends on llvm middle-end behavior. No clang codegen test should be doing that. This one is salvageable by relaxing check lines.
  4. [X86][AVX] matchShuffleWithSHUFPD - add support for zeroable operands Determine if all of the uses of LHS/RHS operands can be replaced with a zero vector.
  5. [ARM] A predicate cast of a predicate cast is a predicate cast The adds some very basic folding of PREDICATE_CASTS, removing cases when they are chained together. These would already be removed eventually, as these are lowered to copies. This just allows it to happen earlier, which can help other simplifications. Differential Revision: https://reviews.llvm.org/D67591
  6. [OPENMP]Fix parsing/sema for function templates with declare simd. Need to return original declaration group with FunctionTemplateDecl, not the inner FunctionDecl, to correctly handle parsing of directives with the templates parameters.
  7. [SimplifyCFG] FoldTwoEntryPHINode(): consider *total* speculation cost, not per-BB cost Summary: Previously, if the threshold was 2, we were willing to speculatively execute 2 cheap instructions in both basic blocks (thus we were willing to speculatively execute cost = 4), but weren't willing to speculate when one BB had 3 instructions and other one had no instructions, even thought that would have total cost of 3. This looks inconsistent to me. I don't think `cmov`-like instructions will start executing until both of it's inputs are available: https://godbolt.org/z/zgHePf So i don't see why the existing behavior is the correct one. Also, let's add it's own `cl::opt` for this threshold, with default=4, so it is not stricter than the previous threshold: will allow to fold when there are 2 BB's each with cost=2. And since the logic has changed, it will also allow to fold when one BB has cost=3 and other cost=1, or there is only one BB with cost=4. This is an alternative solution to D65148: This fix is mainly motivated by `signbit-like-value-extension.ll` test. That pattern comes up in JPEG decoding, see e.g. `Figure F.12 – Extending the sign bit of a decoded value in V` of `ITU T.81` (JPEG specification). That branch is not predictable, and it is within the innermost loop, so the fact that that pattern ends up being stuck with a branch instead of `select` (i.e. `CMOV` for x86) is unlikely to be beneficial. This has great results on the final assembly (vanilla test-suite + RawSpeed): (metric pass - D67240) | metric | old | new | delta | % | | x86-mi-counting.NumMachineFunctions | 37720 | 37721 | 1 | 0.00% | | x86-mi-counting.NumMachineBasicBlocks | 773545 | 771181 | -2364 | -0.31% | | x86-mi-counting.NumMachineInstructions | 7488843 | 7486442 | -2401 | -0.03% | | x86-mi-counting.NumUncondBR | 135770 | 135543 | -227 | -0.17% | | x86-mi-counting.NumCondBR | 423753 | 422187 | -1566 | -0.37% | | x86-mi-counting.NumCMOV | 24815 | 25731 | 916 | 3.69% | | x86-mi-counting.NumVecBlend | 17 | 17 | 0 | 0.00% | We significantly decrease basic block count, notably decrease instruction count, significantly decrease branch count and very significantly increase `cmov` count. Performance-wise, unsurprisingly, this has great effect on target RawSpeed benchmark. I'm seeing 5 **major** improvements: ``` Benchmark Time CPU Time Old Time New CPU Old CPU New ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_pvalue 0.0000 0.0000 U Test, Repetitions: 49 vs 49 Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_mean -0.3064 -0.3064 226.9913 157.4452 226.9800 157.4384 Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_median -0.3057 -0.3057 226.8407 157.4926 226.8282 157.4828 Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_stddev -0.4985 -0.4954 0.3051 0.1530 0.3040 0.1534 Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_pvalue 0.0000 0.0000 U Test, Repetitions: 49 vs 49 Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_mean -0.1747 -0.1747 80.4787 66.4227 80.4771 66.4146 Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_median -0.1742 -0.1743 80.4686 66.4542 80.4690 66.4436 Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_stddev +0.6089 +0.5797 0.0670 0.1078 0.0673 0.1062 Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_pvalue 0.0000 0.0000 U Test, Repetitions: 49 vs 49 Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_mean -0.1598 -0.1598 171.6996 144.2575 171.6915 144.2538 Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_median -0.1598 -0.1597 171.7109 144.2755 171.7018 144.2766 Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_stddev +0.4024 +0.3850 0.0847 0.1187 0.0848 0.1175 Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_pvalue 0.0000 0.0000 U Test, Repetitions: 49 vs 49 Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_mean -0.0550 -0.0551 280.3046 264.8800 280.3017 264.8559 Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_median -0.0554 -0.0554 280.2628 264.7360 280.2574 264.7297 Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_stddev +0.7005 +0.7041 0.2779 0.4725 0.2775 0.4729 Canon/EOS 5DS/2K4A9929.CR2/threads:8/process_time/real_time_pvalue 0.0000 0.0000 U Test, Repetitions: 49 vs 49 Canon/EOS 5DS/2K4A9929.CR2/threads:8/process_time/real_time_mean -0.0354 -0.0355 316.7396 305.5208 316.7342 305.4890 Canon/EOS 5DS/2K4A9929.CR2/threads:8/process_time/real_time_median -0.0354 -0.0356 316.6969 305.4798 316.6917 305.4324 Canon/EOS 5DS/2K4A9929.CR2/threads:8/process_time/real_time_stddev +0.0493 +0.0330 0.3562 0.3737 0.3563 0.3681 ``` That being said, it's always best-effort, so there will likely be cases where this worsens things. Reviewers: efriedma, craig.topper, dmgreen, jmolloy, fhahn, Carrot, hfinkel, chandlerc Reviewed By: jmolloy Subscribers: xbolva00, hiraditya, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D67318
  8. [InstCombine] remove unneeded one-use checks for icmp fold Related folds were added in: rL125734 ...the code comment about register pressure is discussed in more detail in: https://bugs.llvm.org/show_bug.cgi?id=2698 But 10 years later, perf testing bzip2 with this change now shows a slight (0.2% average) improvement on Haswell although that's probably within test noise. Given that this is IR canonicalization, we shouldn't be worried about register pressure though; the backend should be able to adjust for that as needed. This is part of solving PR43310 the theoretically right way: https://bugs.llvm.org/show_bug.cgi?id=43310 ...ie, if we don't cripple basic transforms, then we won't need to add special-case code to detect larger patterns. rL371940 and rL371981 are related patches in this series.
Revision 372016 by danalbert:
Implement std::condition_variable via pthread_cond_clockwait() where available

std::condition_variable is currently implemented via
pthread_cond_timedwait() on systems that use pthread. This is
problematic, since that function waits by default on CLOCK_REALTIME
and libc++ does not provide any mechanism to change from this
default.

Due to this, regardless of if condition_variable::wait_until() is
called with a chrono::system_clock or chrono::steady_clock parameter,
condition_variable::wait_until() will wait using CLOCK_REALTIME. This
is not accurate to the C++ standard as calling
condition_variable::wait_until() with a chrono::steady_clock parameter
should use CLOCK_MONOTONIC.

This is particularly problematic because CLOCK_REALTIME is a bad
choice as it is subject to discontinuous time adjustments, that may
cause condition_variable::wait_until() to immediately timeout or wait
indefinitely.

This change fixes this issue with a new POSIX function,
pthread_cond_clockwait() proposed on
http://austingroupbugs.net/view.php?id=1216. The new function is
similar to pthread_cond_timedwait() with the addition of a clock
parameter that allows it to wait using either CLOCK_REALTIME or
CLOCK_MONOTONIC, thus allowing condition_variable::wait_until() to
wait using CLOCK_REALTIME for chrono::system_clock and CLOCK_MONOTONIC
for chrono::steady_clock.

pthread_cond_clockwait() is implemented in glibc (2.30 and later) and
Android's bionic (Android API version 30 and later).

This change additionally makes wait_for() and wait_until() with clocks
other than chrono::system_clock use CLOCK_MONOTONIC.<Paste>
Change TypePath in RepositoryPath in Workspace
The file was modified/libcxx/trunk/include/__config (diff)libcxx.src/include/__config
The file was modified/libcxx/trunk/include/__mutex_base (diff)libcxx.src/include/__mutex_base
The file was modified/libcxx/trunk/test/std/thread/thread.condition/thread.condition.condvar/wait_until.pass.cpp (diff)libcxx.src/test/std/thread/thread.condition/thread.condition.condvar/wait_until.pass.cpp
Revision 372015 by lebedevri:
[Clang][Codegen] Disable arm_acle.c test.

This test is broken by design. Clang codegen tests should not depend
on llvm middle-end behaviour, they should *only* test clang codegen.
Yet this test runs whole optimization pipeline.
I've really tried to fix it, but there isn't just a few things
that depend on passes, but everything there does.
Change TypePath in RepositoryPath in Workspace
The file was modified/cfe/trunk/test/CodeGen/arm_acle.c (diff)clang.src/test/CodeGen/arm_acle.c
Revision 372014 by lebedevri:
[Clang][Codegen] Relax available-externally-suppress.c test

That test is broken by design.
It depends on llvm middle-end behavior.
No clang codegen test should be doing that.
This one is salvageable by relaxing check lines.
Change TypePath in RepositoryPath in Workspace
The file was modified/cfe/trunk/test/CodeGen/available-externally-suppress.c (diff)clang.src/test/CodeGen/available-externally-suppress.c
Revision 372013 by rksimon:
[X86][AVX] matchShuffleWithSHUFPD - add support for zeroable operands

Determine if all of the uses of LHS/RHS operands can be replaced with a zero vector.
Change TypePath in RepositoryPath in Workspace
The file was modified/llvm/trunk/lib/Target/X86/X86ISelLowering.cpp (diff)llvm.src/lib/Target/X86/X86ISelLowering.cpp
The file was modified/llvm/trunk/test/CodeGen/X86/vector-shuffle-256-v4.ll (diff)llvm.src/test/CodeGen/X86/vector-shuffle-256-v4.ll
The file was modified/llvm/trunk/test/CodeGen/X86/vector-shuffle-512-v8.ll (diff)llvm.src/test/CodeGen/X86/vector-shuffle-512-v8.ll
Revision 372012 by dmgreen:
[ARM] A predicate cast of a predicate cast is a predicate cast

The adds some very basic folding of PREDICATE_CASTS, removing cases when they
are chained together. These would already be removed eventually, as these are
lowered to copies. This just allows it to happen earlier, which can help other
simplifications.

Differential Revision: https://reviews.llvm.org/D67591
Change TypePath in RepositoryPath in Workspace
The file was modified/llvm/trunk/lib/Target/ARM/ARMISelLowering.cpp (diff)llvm.src/lib/Target/ARM/ARMISelLowering.cpp
The file was modified/llvm/trunk/test/CodeGen/Thumb2/mve-masked-ldst.ll (diff)llvm.src/test/CodeGen/Thumb2/mve-masked-ldst.ll
The file was modified/llvm/trunk/test/CodeGen/Thumb2/mve-pred-bitcast.ll (diff)llvm.src/test/CodeGen/Thumb2/mve-pred-bitcast.ll
The file was modified/llvm/trunk/test/CodeGen/Thumb2/mve-pred-loadstore.ll (diff)llvm.src/test/CodeGen/Thumb2/mve-pred-loadstore.ll
Revision 372011 by abataev:
[OPENMP]Fix parsing/sema for function templates with declare simd.

Need to return original declaration group with FunctionTemplateDecl, not
the inner FunctionDecl, to correctly handle parsing of directives with
the templates parameters.
Change TypePath in RepositoryPath in Workspace
The file was modified/cfe/trunk/lib/Sema/SemaOpenMP.cpp (diff)clang.src/lib/Sema/SemaOpenMP.cpp
The file was modified/cfe/trunk/test/OpenMP/declare_simd_ast_print.cpp (diff)clang.src/test/OpenMP/declare_simd_ast_print.cpp
Revision 372009 by lebedevri:
[SimplifyCFG] FoldTwoEntryPHINode(): consider *total* speculation cost, not per-BB cost

Summary:
Previously, if the threshold was 2, we were willing to speculatively
execute 2 cheap instructions in both basic blocks (thus we were willing
to speculatively execute cost = 4), but weren't willing to speculate
when one BB had 3 instructions and other one had no instructions,
even thought that would have total cost of 3.

This looks inconsistent to me.
I don't think `cmov`-like instructions will start executing
until both of it's inputs are available: https://godbolt.org/z/zgHePf
So i don't see why the existing behavior is the correct one.

Also, let's add it's own `cl::opt` for this threshold,
with default=4, so it is not stricter than the previous threshold:
will allow to fold when there are 2 BB's each with cost=2.
And since the logic has changed, it will also allow to fold when
one BB has cost=3 and other cost=1, or there is only one BB with cost=4.

This is an alternative solution to D65148:
This fix is mainly motivated by `signbit-like-value-extension.ll` test.
That pattern comes up in JPEG decoding, see e.g.
`Figure F.12 – Extending the sign bit of a decoded value in V`
of `ITU T.81` (JPEG specification).
That branch is not predictable, and it is within the innermost loop,
so the fact that that pattern ends up being stuck with a branch
instead of `select` (i.e. `CMOV` for x86) is unlikely to be beneficial.

This has great results on the final assembly (vanilla test-suite + RawSpeed): (metric pass - D67240)
| metric                                 |     old |     new | delta |      % |
| x86-mi-counting.NumMachineFunctions    |   37720 |   37721 |     1 |  0.00% |
| x86-mi-counting.NumMachineBasicBlocks  |  773545 |  771181 | -2364 | -0.31% |
| x86-mi-counting.NumMachineInstructions | 7488843 | 7486442 | -2401 | -0.03% |
| x86-mi-counting.NumUncondBR            |  135770 |  135543 |  -227 | -0.17% |
| x86-mi-counting.NumCondBR              |  423753 |  422187 | -1566 | -0.37% |
| x86-mi-counting.NumCMOV                |   24815 |   25731 |   916 |  3.69% |
| x86-mi-counting.NumVecBlend            |      17 |      17 |     0 |  0.00% |

We significantly decrease basic block count, notably decrease instruction count,
significantly decrease branch count and very significantly increase `cmov` count.

Performance-wise, unsurprisingly, this has great effect on
target RawSpeed benchmark. I'm seeing 5 **major** improvements:
```
Benchmark                                                                                             Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_pvalue                                 0.0000          0.0000      U Test, Repetitions: 49 vs 49
Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_mean                                  -0.3064         -0.3064      226.9913      157.4452      226.9800      157.4384
Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_median                                -0.3057         -0.3057      226.8407      157.4926      226.8282      157.4828
Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_stddev                                -0.4985         -0.4954        0.3051        0.1530        0.3040        0.1534
Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_pvalue                                  0.0000          0.0000      U Test, Repetitions: 49 vs 49
Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_mean                                   -0.1747         -0.1747       80.4787       66.4227       80.4771       66.4146
Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_median                                 -0.1742         -0.1743       80.4686       66.4542       80.4690       66.4436
Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_stddev                                 +0.6089         +0.5797        0.0670        0.1078        0.0673        0.1062
Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_pvalue                                 0.0000          0.0000      U Test, Repetitions: 49 vs 49
Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_mean                                  -0.1598         -0.1598      171.6996      144.2575      171.6915      144.2538
Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_median                                -0.1598         -0.1597      171.7109      144.2755      171.7018      144.2766
Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_stddev                                +0.4024         +0.3850        0.0847        0.1187        0.0848        0.1175
Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_pvalue                                  0.0000          0.0000      U Test, Repetitions: 49 vs 49
Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_mean                                   -0.0550         -0.0551      280.3046      264.8800      280.3017      264.8559
Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_median                                 -0.0554         -0.0554      280.2628      264.7360      280.2574      264.7297
Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_stddev                                 +0.7005         +0.7041        0.2779        0.4725        0.2775        0.4729
Canon/EOS 5DS/2K4A9929.CR2/threads:8/process_time/real_time_pvalue                                  0.0000          0.0000      U Test, Repetitions: 49 vs 49
Canon/EOS 5DS/2K4A9929.CR2/threads:8/process_time/real_time_mean                                   -0.0354         -0.0355      316.7396      305.5208      316.7342      305.4890
Canon/EOS 5DS/2K4A9929.CR2/threads:8/process_time/real_time_median                                 -0.0354         -0.0356      316.6969      305.4798      316.6917      305.4324
Canon/EOS 5DS/2K4A9929.CR2/threads:8/process_time/real_time_stddev                                 +0.0493         +0.0330        0.3562        0.3737        0.3563        0.3681
```

That being said, it's always best-effort, so there will likely
be cases where this worsens things.

Reviewers: efriedma, craig.topper, dmgreen, jmolloy, fhahn, Carrot, hfinkel, chandlerc

Reviewed By: jmolloy

Subscribers: xbolva00, hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D67318
Change TypePath in RepositoryPath in Workspace
The file was modified/llvm/trunk/lib/Transforms/Utils/SimplifyCFG.cpp (diff)llvm.src/lib/Transforms/Utils/SimplifyCFG.cpp
The file was modified/llvm/trunk/test/Transforms/IndVarSimplify/loop_evaluate_1.ll (diff)llvm.src/test/Transforms/IndVarSimplify/loop_evaluate_1.ll
The file was modified/llvm/trunk/test/Transforms/PGOProfile/chr.ll (diff)llvm.src/test/Transforms/PGOProfile/chr.ll
The file was modified/llvm/trunk/test/Transforms/SimplifyCFG/PhiEliminate3.ll (diff)llvm.src/test/Transforms/SimplifyCFG/PhiEliminate3.ll
The file was modified/llvm/trunk/test/Transforms/SimplifyCFG/SpeculativeExec.ll (diff)llvm.src/test/Transforms/SimplifyCFG/SpeculativeExec.ll
The file was modified/llvm/trunk/test/Transforms/SimplifyCFG/X86/speculate-cttz-ctlz.ll (diff)llvm.src/test/Transforms/SimplifyCFG/X86/speculate-cttz-ctlz.ll
The file was modified/llvm/trunk/test/Transforms/SimplifyCFG/X86/switch_to_lookup_table.ll (diff)llvm.src/test/Transforms/SimplifyCFG/X86/switch_to_lookup_table.ll
The file was modified/llvm/trunk/test/Transforms/SimplifyCFG/safe-abs.ll (diff)llvm.src/test/Transforms/SimplifyCFG/safe-abs.ll
The file was modified/llvm/trunk/test/Transforms/SimplifyCFG/safe-low-bit-extract.ll (diff)llvm.src/test/Transforms/SimplifyCFG/safe-low-bit-extract.ll
The file was modified/llvm/trunk/test/Transforms/SimplifyCFG/signbit-like-value-extension.ll (diff)llvm.src/test/Transforms/SimplifyCFG/signbit-like-value-extension.ll
The file was modified/llvm/trunk/test/Transforms/SimplifyCFG/speculate-math.ll (diff)llvm.src/test/Transforms/SimplifyCFG/speculate-math.ll
Revision 372007 by spatel:
[InstCombine] remove unneeded one-use checks for icmp fold

Related folds were added in:
rL125734
...the code comment about register pressure is discussed in
more detail in:
https://bugs.llvm.org/show_bug.cgi?id=2698

But 10 years later, perf testing bzip2 with this change now
shows a slight (0.2% average) improvement on Haswell although
that's probably within test noise.

Given that this is IR canonicalization, we shouldn't be worried
about register pressure though; the backend should be able to
adjust for that as needed.

This is part of solving PR43310 the theoretically right way:
https://bugs.llvm.org/show_bug.cgi?id=43310
...ie, if we don't cripple basic transforms, then we won't
need to add special-case code to detect larger patterns.

rL371940 and rL371981 are related patches in this series.
Change TypePath in RepositoryPath in Workspace
The file was modified/llvm/trunk/lib/Transforms/InstCombine/InstCombineCompares.cpp (diff)llvm.src/lib/Transforms/InstCombine/InstCombineCompares.cpp
The file was modified/llvm/trunk/test/Transforms/InstCombine/icmp-add.ll (diff)llvm.src/test/Transforms/InstCombine/icmp-add.ll