FailedChanges

Summary

  1. [NFC][LoopIdiom] Move one bcmp test into the proper place (details)
  2. [NFC][LoopIdiom] Add bcmp loop idiom miscompile test from PR43206. (details)
  3. [LoopIdiomRecognize] Recommit: BCmp loop idiom recognition (details)
  4. [clang-format] Proposal for clang-format to give compiler style warnings (details)
Commit c41e9f6bbfd0d58655464edd8b691a6e24ec781d by lebedev.ri
[NFC][LoopIdiom] Move one bcmp test into the proper place
llvm-svn: 374660
The file was modifiedllvm/test/Transforms/LoopIdiom/bcmp-basic.ll
The file was modifiedllvm/test/Transforms/LoopIdiom/bcmp-negative-tests.ll
Commit 45539737ddb7bf0e6ade616cccc6502a830e1d50 by lebedev.ri
[NFC][LoopIdiom] Add bcmp loop idiom miscompile test from PR43206.
The transform forgot to check SCEV loop scopes.
https://bugs.llvm.org/show_bug.cgi?id=43206
llvm-svn: 374661
The file was modifiedllvm/test/Transforms/LoopIdiom/bcmp-negative-tests.ll
Commit 76cdcf25b883751d83402baea6316772aa73865c by lebedev.ri
[LoopIdiomRecognize] Recommit: BCmp loop idiom recognition
Summary: This is a recommit, this originally landed in rL370454 but was
subsequently reverted in  rL370788 due to
https://bugs.llvm.org/show_bug.cgi?id=43206 The reduced testcase was
added to bcmp-negative-tests.ll as @pr43206_different_loops - we must
ensure that the SCEV's we got are both for the same loop we are
currently investigating.
Original commit message:
@mclow.lists brought up this issue up in IRC. It is a reasonably common
problem to compare some two values for equality. Those may be just some
integers, strings or arrays of integers.
In C, there is `memcmp()`, `bcmp()` functions. In C++, there exists
`std::equal()` algorithm. One can also write that function manually.
libstdc++'s `std::equal()` is specialized to directly call `memcmp()`
for various types, but not `std::byte` from C++2a.
https://godbolt.org/z/mx2ejJ
libc++ does not do anything like that, it simply relies on simple C++'s
`operator==()`. https://godbolt.org/z/er0Zwf (GOOD!)
So likely, there exists a certain performance opportunities. Let's
compare performance of naive `std::equal()` (no `memcmp()`) with one
that is using `memcmp()` (in this case, compiled with modified
compiler). {F8768213}
```
#include <algorithm>
#include <cmath>
#include <cstdint>
#include <iterator>
#include <limits>
#include <random>
#include <type_traits>
#include <utility>
#include <vector>
#include "benchmark/benchmark.h"
template <class T> bool equal(T* a, T* a_end, T* b) noexcept {
for (; a != a_end; ++a, ++b) {
   if (*a != *b) return false;
}
return true;
}
template <typename T> std::vector<T> getVectorOfRandomNumbers(size_t
count) {
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<T> dis(std::numeric_limits<T>::min(),
                                      std::numeric_limits<T>::max());
std::vector<T> v;
v.reserve(count);
std::generate_n(std::back_inserter(v), count,
                 [&dis, &gen]() { return dis(gen); });
assert(v.size() == count);
return v;
}
struct Identical {
template <typename T>
static std::pair<std::vector<T>, std::vector<T>> Gen(size_t count) {
   auto Tmp = getVectorOfRandomNumbers<T>(count);
   return std::make_pair(Tmp, std::move(Tmp));
}
};
struct InequalHalfway {
template <typename T>
static std::pair<std::vector<T>, std::vector<T>> Gen(size_t count) {
   auto V0 = getVectorOfRandomNumbers<T>(count);
   auto V1 = V0;
   V1[V1.size() / size_t(2)]++;  // just change the value.
   return std::make_pair(std::move(V0), std::move(V1));
}
};
template <class T, class Gen> void BM_bcmp(benchmark::State& state) {
const size_t Length = state.range(0);
  const std::pair<std::vector<T>, std::vector<T>> Data =
     Gen::template Gen<T>(Length);
const std::vector<T>& a = Data.first;
const std::vector<T>& b = Data.second;
assert(a.size() == Length && b.size() == a.size());
  benchmark::ClobberMemory();
benchmark::DoNotOptimize(a);
benchmark::DoNotOptimize(a.data());
benchmark::DoNotOptimize(b);
benchmark::DoNotOptimize(b.data());
  for (auto _ : state) {
   const bool is_equal = equal(a.data(), a.data() + a.size(), b.data());
   benchmark::DoNotOptimize(is_equal);
}
state.SetComplexityN(Length);
state.counters["eltcnt"] =
     benchmark::Counter(Length,
benchmark::Counter::kIsIterationInvariant);
state.counters["eltcnt/sec"] =
     benchmark::Counter(Length,
benchmark::Counter::kIsIterationInvariantRate);
const size_t BytesRead = 2 * sizeof(T) * Length;
state.counters["bytes_read/iteration"] =
     benchmark::Counter(BytesRead, benchmark::Counter::kDefaults,
                        benchmark::Counter::OneK::kIs1024);
state.counters["bytes_read/sec"] = benchmark::Counter(
     BytesRead, benchmark::Counter::kIsIterationInvariantRate,
     benchmark::Counter::OneK::kIs1024);
}
template <typename T> static void
CustomArguments(benchmark::internal::Benchmark* b) {
const size_t L2SizeBytes = []() {
   for (const benchmark::CPUInfo::CacheInfo& I :
        benchmark::CPUInfo::Get().caches) {
     if (I.level == 2) return I.size;
   }
   return 0;
}();
// What is the largest range we can check to always fit within given L2
cache?
const size_t MaxLen = L2SizeBytes / /*total bufs*/ 2 /
                       /*maximal elt size*/ sizeof(T) / /*safety
margin*/ 2;
b->RangeMultiplier(2)->Range(1, MaxLen)->Complexity(benchmark::oN);
}
BENCHMARK_TEMPLATE(BM_bcmp, uint8_t, Identical)
   ->Apply(CustomArguments<uint8_t>); BENCHMARK_TEMPLATE(BM_bcmp,
uint16_t, Identical)
   ->Apply(CustomArguments<uint16_t>); BENCHMARK_TEMPLATE(BM_bcmp,
uint32_t, Identical)
   ->Apply(CustomArguments<uint32_t>); BENCHMARK_TEMPLATE(BM_bcmp,
uint64_t, Identical)
   ->Apply(CustomArguments<uint64_t>);
BENCHMARK_TEMPLATE(BM_bcmp, uint8_t, InequalHalfway)
   ->Apply(CustomArguments<uint8_t>); BENCHMARK_TEMPLATE(BM_bcmp,
uint16_t, InequalHalfway)
   ->Apply(CustomArguments<uint16_t>); BENCHMARK_TEMPLATE(BM_bcmp,
uint32_t, InequalHalfway)
   ->Apply(CustomArguments<uint32_t>); BENCHMARK_TEMPLATE(BM_bcmp,
uint64_t, InequalHalfway)
   ->Apply(CustomArguments<uint64_t>);
```
{F8768210}
```
$ ~/src/googlebenchmark/tools/compare.py --no-utest benchmarks
build-{old,new}/test/llvm-bcmp-bench RUNNING:
build-old/test/llvm-bcmp-bench --benchmark_out=/tmp/tmpb6PEUx 2019-04-25
21:17:11 Running build-old/test/llvm-bcmp-bench Run on (8 X 4000 MHz CPU
s) CPU Caches:
L1 Data 16K (x8)
L1 Instruction 64K (x4)
L2 Unified 2048K (x4)
L3 Unified 8192K (x1) Load Average: 0.65, 3.90, 4.14
---------------------------------------------------------------------------------------------------
Benchmark                                         Time             CPU 
Iterations UserCounters...
---------------------------------------------------------------------------------------------------
<...> BM_bcmp<uint8_t, Identical>/512000           432131 ns     
432101 ns         1613 bytes_read/iteration=1000k
bytes_read/sec=2.20706G/s eltcnt=825.856M eltcnt/sec=1.18491G/s
BM_bcmp<uint8_t, Identical>_BigO               0.86 N          0.86 N
BM_bcmp<uint8_t, Identical>_RMS                   8 %             8 %
<...> BM_bcmp<uint16_t, Identical>/256000          161408 ns     
161409 ns         4027 bytes_read/iteration=1000k
bytes_read/sec=5.90843G/s eltcnt=1030.91M eltcnt/sec=1.58603G/s
BM_bcmp<uint16_t, Identical>_BigO              0.67 N          0.67 N
BM_bcmp<uint16_t, Identical>_RMS                 25 %            25 %
<...> BM_bcmp<uint32_t, Identical>/128000           81497 ns      
81488 ns         8415 bytes_read/iteration=1000k
bytes_read/sec=11.7032G/s eltcnt=1077.12M eltcnt/sec=1.57078G/s
BM_bcmp<uint32_t, Identical>_BigO              0.71 N          0.71 N
BM_bcmp<uint32_t, Identical>_RMS                 42 %            42 %
<...> BM_bcmp<uint64_t, Identical>/64000            50138 ns      
50138 ns        10909 bytes_read/iteration=1000k
bytes_read/sec=19.0209G/s eltcnt=698.176M eltcnt/sec=1.27647G/s
BM_bcmp<uint64_t, Identical>_BigO              0.84 N          0.84 N
BM_bcmp<uint64_t, Identical>_RMS                 27 %            27 %
<...> BM_bcmp<uint8_t, InequalHalfway>/512000      192405 ns     
192392 ns         3638 bytes_read/iteration=1000k
bytes_read/sec=4.95694G/s eltcnt=1.86266G eltcnt/sec=2.66124G/s
BM_bcmp<uint8_t, InequalHalfway>_BigO          0.38 N          0.38 N
BM_bcmp<uint8_t, InequalHalfway>_RMS              3 %             3 %
<...> BM_bcmp<uint16_t, InequalHalfway>/256000     127858 ns     
127860 ns         5477 bytes_read/iteration=1000k
bytes_read/sec=7.45873G/s eltcnt=1.40211G eltcnt/sec=2.00219G/s
BM_bcmp<uint16_t, InequalHalfway>_BigO         0.50 N          0.50 N
BM_bcmp<uint16_t, InequalHalfway>_RMS             0 %             0 %
<...> BM_bcmp<uint32_t, InequalHalfway>/128000      49140 ns      
49140 ns        14281 bytes_read/iteration=1000k
bytes_read/sec=19.4072G/s eltcnt=1.82797G eltcnt/sec=2.60478G/s
BM_bcmp<uint32_t, InequalHalfway>_BigO         0.40 N          0.40 N
BM_bcmp<uint32_t, InequalHalfway>_RMS            18 %            18 %
<...> BM_bcmp<uint64_t, InequalHalfway>/64000       32101 ns      
32099 ns        21786 bytes_read/iteration=1000k
bytes_read/sec=29.7101G/s eltcnt=1.3943G eltcnt/sec=1.99381G/s
BM_bcmp<uint64_t, InequalHalfway>_BigO         0.50 N          0.50 N
BM_bcmp<uint64_t, InequalHalfway>_RMS             1 %             1 %
RUNNING: build-new/test/llvm-bcmp-bench --benchmark_out=/tmp/tmpQ46PP0
2019-04-25 21:19:29 Running build-new/test/llvm-bcmp-bench Run on (8 X
4000 MHz CPU s) CPU Caches:
L1 Data 16K (x8)
L1 Instruction 64K (x4)
L2 Unified 2048K (x4)
L3 Unified 8192K (x1) Load Average: 1.01, 2.85, 3.71
---------------------------------------------------------------------------------------------------
Benchmark                                         Time             CPU 
Iterations UserCounters...
---------------------------------------------------------------------------------------------------
<...> BM_bcmp<uint8_t, Identical>/512000            18593 ns      
18590 ns        37565 bytes_read/iteration=1000k
bytes_read/sec=51.2991G/s eltcnt=19.2333G eltcnt/sec=27.541G/s
BM_bcmp<uint8_t, Identical>_BigO               0.04 N          0.04 N
BM_bcmp<uint8_t, Identical>_RMS                  37 %            37 %
<...> BM_bcmp<uint16_t, Identical>/256000           18950 ns      
18948 ns        37223 bytes_read/iteration=1000k
bytes_read/sec=50.3324G/s eltcnt=9.52909G eltcnt/sec=13.511G/s
BM_bcmp<uint16_t, Identical>_BigO              0.08 N          0.08 N
BM_bcmp<uint16_t, Identical>_RMS                 34 %            34 %
<...> BM_bcmp<uint32_t, Identical>/128000           18627 ns      
18627 ns        37895 bytes_read/iteration=1000k
bytes_read/sec=51.198G/s eltcnt=4.85056G eltcnt/sec=6.87168G/s
BM_bcmp<uint32_t, Identical>_BigO              0.16 N          0.16 N
BM_bcmp<uint32_t, Identical>_RMS                 35 %            35 %
<...> BM_bcmp<uint64_t, Identical>/64000            18855 ns      
18855 ns        37458 bytes_read/iteration=1000k
bytes_read/sec=50.5791G/s eltcnt=2.39731G eltcnt/sec=3.3943G/s
BM_bcmp<uint64_t, Identical>_BigO              0.32 N          0.32 N
BM_bcmp<uint64_t, Identical>_RMS                 33 %            33 %
<...> BM_bcmp<uint8_t, InequalHalfway>/512000        9570 ns       
9569 ns        73500 bytes_read/iteration=1000k
bytes_read/sec=99.6601G/s eltcnt=37.632G eltcnt/sec=53.5046G/s
BM_bcmp<uint8_t, InequalHalfway>_BigO          0.02 N          0.02 N
BM_bcmp<uint8_t, InequalHalfway>_RMS             29 %            29 %
<...> BM_bcmp<uint16_t, InequalHalfway>/256000       9547 ns       
9547 ns        74343 bytes_read/iteration=1000k
bytes_read/sec=99.8971G/s eltcnt=19.0318G eltcnt/sec=26.8159G/s
BM_bcmp<uint16_t, InequalHalfway>_BigO         0.04 N          0.04 N
BM_bcmp<uint16_t, InequalHalfway>_RMS            29 %            29 %
<...> BM_bcmp<uint32_t, InequalHalfway>/128000       9396 ns       
9394 ns        73521 bytes_read/iteration=1000k
bytes_read/sec=101.518G/s eltcnt=9.41069G eltcnt/sec=13.6255G/s
BM_bcmp<uint32_t, InequalHalfway>_BigO         0.08 N          0.08 N
BM_bcmp<uint32_t, InequalHalfway>_RMS            30 %            30 %
<...> BM_bcmp<uint64_t, InequalHalfway>/64000        9499 ns       
9498 ns        73802 bytes_read/iteration=1000k
bytes_read/sec=100.405G/s eltcnt=4.72333G eltcnt/sec=6.73808G/s
BM_bcmp<uint64_t, InequalHalfway>_BigO         0.16 N          0.16 N
BM_bcmp<uint64_t, InequalHalfway>_RMS            28 %            28 %
Comparing build-old/test/llvm-bcmp-bench to
build-new/test/llvm-bcmp-bench Benchmark                               
                 Time             CPU      Time Old      Time New     
CPU Old       CPU New
---------------------------------------------------------------------------------------------------------------------------------------
<...> BM_bcmp<uint8_t, Identical>/512000                      -0.9570  
     -0.9570        432131         18593        432101         18590
<...> BM_bcmp<uint16_t, Identical>/256000                     -0.8826  
     -0.8826        161408         18950        161409         18948
<...> BM_bcmp<uint32_t, Identical>/128000                     -0.7714  
     -0.7714         81497         18627         81488         18627
<...> BM_bcmp<uint64_t, Identical>/64000                      -0.6239  
     -0.6239         50138         18855         50138         18855
<...> BM_bcmp<uint8_t, InequalHalfway>/512000                 -0.9503  
     -0.9503        192405          9570        192392          9569
<...> BM_bcmp<uint16_t, InequalHalfway>/256000                -0.9253  
     -0.9253        127858          9547        127860          9547
<...> BM_bcmp<uint32_t, InequalHalfway>/128000                -0.8088  
     -0.8088         49140          9396         49140          9394
<...> BM_bcmp<uint64_t, InequalHalfway>/64000                 -0.7041  
     -0.7041         32101          9499         32099          9498
```
What can we tell from the benchmark?
* Performance of naive equality check somewhat improves with element
size,
maxing out at eltcnt/sec=1.58603G/s for uint16_t, or
bytes_read/sec=19.0209G/s
for uint64_t. I think, that instability implies performance problems.
* Performance of `memcmp()`-aware benchmark always maxes out at around
bytes_read/sec=51.2991G/s for every type. That is 2.6x the throughput
of the
naive variant!
* eltcnt/sec metric for the `memcmp()`-aware benchmark maxes out at
eltcnt/sec=27.541G/s for uint8_t (was: eltcnt/sec=1.18491G/s, so 24x)
and
linearly decreases with element size.
For uint64_t, it's ~4x+ the elements/second.
* The call obvious is more pricey than the loop, with small element
count.
As it can be seen from the full output {F8768210}, the `memcmp()` is
almost
universally worse, independent of the element size (and thus buffer
size) when
element count is less than 8.
So all in all, bcmp idiom does indeed pose untapped performance
headroom. This diff does implement said idiom recognition. I think a
reasonable test coverage is present, but do tell if there is anything
obvious missing.
Now, quality. This does succeed to build and pass the test-suite, at
least without any non-bundled elements. {F8768216} {F8768217} This
transform fires 91 times:
```
$ /build/test-suite/utils/compare.py -m loop-idiom.NumBCmp
result-new.json Tests: 1149 Metric: loop-idiom.NumBCmp
Program                                         result-new
MultiSourc...Benchmarks/7zip/7zip-benchmark    79.00
MultiSource/Applications/d/make_dparser         3.00
SingleSource/UnitTests/vla                      2.00
MultiSource/Applications/Burg/burg              1.00
MultiSourc.../Applications/JM/lencod/lencod     1.00
MultiSource/Applications/lemon/lemon            1.00
MultiSource/Benchmarks/Bullet/bullet            1.00
MultiSourc...e/Benchmarks/MallocBench/gs/gs     1.00
MultiSourc...gs-C/TimberWolfMC/timberwolfmc     1.00
MultiSourc...Prolangs-C/simulator/simulator     1.00
``` The size changes are: I'm not sure what's going on with
SingleSource/UnitTests/vla.test yet, did not look.
```
$ /build/test-suite/utils/compare.py -m size..text result-{old,new}.json
--filter-hash Tests: 1149 Same hash: 907 (filtered out) Remaining: 242
Metric: size..text
Program                                        result-old result-new
diff test-suite...ingleSource/UnitTests/vla.test   753.00     833.00   
10.6% test-suite...marks/7zip/7zip-benchmark.test   1001697.00 966657.00
-3.5% test-suite...ngs-C/simulator/simulator.test   32369.00   32321.00
  -0.1% test-suite...plications/d/make_dparser.test   89585.00 
89505.00   -0.1% test-suite...ce/Applications/Burg/burg.test   40817.00
40785.00   -0.1% test-suite.../Applications/lemon/lemon.test   47281.00
  47249.00   -0.1% test-suite...TimberWolfMC/timberwolfmc.test 
250065.00  250113.00   0.0% test-suite...chmarks/MallocBench/gs/gs.test
149889.00  149873.00  -0.0% test-suite...ications/JM/lencod/lencod.test
  769585.00  769569.00  -0.0%
test-suite.../Benchmarks/Bullet/bullet.test   770049.00  770049.00 
0.0% test-suite...HMARK_ANISTROPIC_DIFFUSION/128    NaN        NaN     
nan% test-suite...HMARK_ANISTROPIC_DIFFUSION/256    NaN        NaN    
  nan% test-suite...CHMARK_ANISTROPIC_DIFFUSION/64    NaN        NaN   
   nan% test-suite...CHMARK_ANISTROPIC_DIFFUSION/32    NaN        NaN  
    nan% test-suite...ENCHMARK_BILATERAL_FILTER/64/4    NaN        NaN 
     nan% Geomean difference                                           
      nan%
        result-old    result-new       diff count  1.000000e+01
10.00000      10.000000 mean   3.152090e+05  311695.40000  0.006749 std
  3.790398e+05  372091.42232  0.036605 min    7.530000e+02  833.00000  
-0.034981 25%    4.243300e+04  42401.00000  -0.000866 50%  
1.197370e+05  119689.00000 -0.000392 75%    6.397050e+05  639705.00000
-0.000005 max    1.001697e+06  966657.00000  0.106242
```
I don't have timings though.
And now to the code. The basic idea is to completely replace the whole
loop. If we can't fully kill it, don't transform. I have left one or two
comments in the code, so hopefully it can be understood.
Also, there is a few TODO's that i have left for follow-ups:
* widening of `memcmp()`/`bcmp()`
* step smaller than the comparison size
* Metadata propagation
* more than two blocks as long as there is still a single backedge?
* ???
Reviewers: reames, fhahn, mkazantsev, chandlerc, craig.topper, courbet
Reviewed By: courbet
Subscribers: miyuki, hiraditya, xbolva00, nikic, jfb, gchatelet,
courbet, llvm-commits, mclow.lists
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D61144
llvm-svn: 374662
The file was modifiedllvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
The file was modifiedllvm/test/Transforms/LoopIdiom/bcmp-widening.ll
The file was modifiedllvm/test/Transforms/LoopIdiom/bcmp-debugify-remarks.ll
The file was modifiedllvm/test/Transforms/LoopIdiom/bcmp-basic.ll
The file was modifiedllvm/docs/ReleaseNotes.rst
Commit 1f20bc17d005fb37816fac95955ec4a36f9de26f by mydeveloperday
[clang-format] Proposal for clang-format to give compiler style warnings
Summary: Related somewhat to {D29039}
On seeing a quote on twitter by @invalidop
> If it's not formatted with clang-format it's a build error.
This made me want to change the way I use clang-format into a tool that
could optionally show me where my source code violates clang-format
syle.
When I'm making a change to clang-format itself, one thing I like to do
to test the change is to ensure I didn't cause a huge wave of changes,
what I want to do is simply run this on a known formatted directory and
see if any new differences arrive in a manner I'm used to.
This started me thinking that we should allow build systems to run
clang-format on a whole tree and emit compiler style warnings about
files that fail clang-format in a form that would make them as a warning
in most build systems and because those build systems range in their
construction I don't think its unreasonable to NOT expect them to have
to do the directory searching or parsing the output replacements
themselves, but simply transform that into an error code when there are
changes required.
I am starting this by suggesing adding a -n or -dry-run command line
argument which would emit a warning/error of the form
Support for various common compiler command line argumuments like
'-Werror' and '-ferror-limit' could make this very flexible to be
integrated into build systems and CI systems.
```
> $ /usr/bin/clang-format --dry-run ClangFormat.cpp -ferror-limit=3
-fcolor-diagnostics
> ClangFormat.cpp:54:29: warning: code should be clang-formatted
[-Wclang-format-violations]
> static cl::list<std::string>
>                             ^
> ClangFormat.cpp:55:20: warning: code should be clang-formatted
[-Wclang-format-violations]
> LineRanges("lines", cl::desc("<start line>:<end line> - format a range
of\n"
>                    ^
> ClangFormat.cpp:55:77: warning: code should be clang-formatted
[-Wclang-format-violations]
> LineRanges("lines", cl::desc("<start line>:<end line> - format a range
of\n"
>                                                                      
     ^
```
Reviewers: mitchell-stellar, klimek, owenpan
Reviewed By: klimek
Subscribers: mgorny, cfe-commits
Tags: #clang-format, #clang-tools-extra, #clang
Differential Revision: https://reviews.llvm.org/D68554
llvm-svn: 374663
The file was addedclang/test/Format/dry-run.cpp
The file was modifiedclang/tools/clang-format/ClangFormat.cpp
The file was addedclang/test/Format/dry-run-alias.cpp
The file was modifiedclang/tools/clang-format/CMakeLists.txt