Changes

Summary

  1. Bump the value of __STDC_VERSION__ in -std=c2x mode (details)
  2. This patch supports the following checks for THREADPRIVATE Directive: (details)
  3. [X86][Costmodel] Load/store i16 Stride=6 VF=32 interleaving costs (details)
  4. [X86][Costmodel] Load/store i32 Stride=3 VF=32 interleaving costs (details)
  5. [X86][Costmodel] Load/store i32 Stride=4 VF=32 interleaving costs (details)
  6. [X86][Costmodel] Load/store i64 Stride=2 VF=32 interleaving costs (details)
  7. [X86][Costmodel] Load/store i64 Stride=4 VF=16 interleaving costs (details)
  8. [ConstantRange] Add fast signed multiply (details)
Commit c8be7743acc7e8ea32ba9985c1d57c38f0eab010 by aaron
Bump the value of __STDC_VERSION__ in -std=c2x mode

Previously, we reported the same value as for C17, now we report 202000L, which
is the same value currently used by GCC.

Once C23 ships, this value will be bumped to the correct date.
The file was addedclang/test/Preprocessor/c2x.c
The file was modifiedclang/lib/Frontend/InitPreprocessor.cpp
The file was modifiedclang/docs/ReleaseNotes.rst
Commit dd8c8d4b7cee7cb58b40e0456d656d68a31ef3b4 by qiaopeixin
This patch supports the following checks for THREADPRIVATE Directive:
```
[5.1] 2.21.2 THREADPRIVATE Directive
A variable that appears in a threadprivate directive must be declared in
the scope of a module or have the SAVE attribute, either explicitly or
implicitly.
A variable that appears in a threadprivate directive must not be an
element of a common block or appear in an EQUIVALENCE statement.
```

This patch supports the following checks for DECLARE TARGET Directive:
```
[5.1] 2.14.7 Declare Target Directive
A variable that is part of another variable (as an array, structure
element or type parameter inquiry) cannot appear in a declare
target directive.
A variable that appears in a declare target directive must be declared
in the scope of a module or have the SAVE attribute, either explicitly
or implicitly.
A variable that appears in a declare target directive must not be an
element of a common block or appear in an EQUIVALENCE statement.
```

As Fortran 2018 standard [8.5.16] states, a variable, common block, or
procedure pointer declared in the scoping unit of a main program,
module, or submodule implicitly has the SAVE attribute, which may be
confirmed by explicit specification.

Reviewed By: kiranchandramohan

Differential Revision: https://reviews.llvm.org/D109864
The file was modifiedflang/lib/Semantics/check-omp-structure.h
The file was modifiedflang/test/Semantics/omp-declarative-directive.f90
The file was addedflang/test/Semantics/omp-declare-target02.f90
The file was modifiedflang/lib/Semantics/check-omp-structure.cpp
The file was addedflang/test/Semantics/omp-threadprivate02.f90
The file was addedflang/test/Semantics/omp-declare-target01.f90
Commit 887acf6842cb48e7c51728ed8d81fc5ab0425403 by lebedev.ri
[X86][Costmodel] Load/store i16 Stride=6 VF=32 interleaving costs

A few more tuples are being queried after D111546. Might be good to model them,
They all require a lot of manual assembly surgery.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/YTeT9M7fW - for intels `Block RThroughput: <=212.0`; for ryzens, `Block RThroughput: <=64.0`
So could pick cost of `212`

For store we have:
https://godbolt.org/z/vc954KEGP - for intels `Block RThroughput: <=90.0`; for ryzens, `Block RThroughput: <=24.0`
So we could pick cost of `90`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111940
The file was modifiedllvm/test/Analysis/CostModel/X86/interleaved-load-i16-stride-6.ll
The file was modifiedllvm/lib/Target/X86/X86TargetTransformInfo.cpp
The file was modifiedllvm/test/Analysis/CostModel/X86/interleaved-store-i16-stride-6.ll
Commit 4b76a74b4283362f69748c4d0a5bc22b1237ced0 by lebedev.ri
[X86][Costmodel] Load/store i32 Stride=3 VF=32 interleaving costs

A few more tuples are being queried after D111546. Might be good to model them,
They all require a lot of manual assembly surgery.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/s5b6E6jsP - for intels `Block RThroughput: <=32.0`; for ryzens, `Block RThroughput: <=24.0`
So could pick cost of `32`

For store we have:
https://godbolt.org/z/efh99d93b - for intels `Block RThroughput: <=48.0`; for ryzens, `Block RThroughput: <=32.0`
So we could pick cost of `48`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111942
The file was modifiedllvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-3-indices-0uu.ll
The file was modifiedllvm/lib/Target/X86/X86TargetTransformInfo.cpp
The file was modifiedllvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-3-indices-01u.ll
The file was modifiedllvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-3.ll
The file was modifiedllvm/test/Analysis/CostModel/X86/interleaved-store-i32-stride-3.ll
The file was modifiedllvm/test/Analysis/CostModel/X86/interleaved-load-f32-stride-3.ll
The file was modifiedllvm/test/Analysis/CostModel/X86/interleaved-store-f32-stride-3.ll
Commit 3a6a9f74d3a59beb359a9968ac27dcf97d072b3a by lebedev.ri
[X86][Costmodel] Load/store i32 Stride=4 VF=32 interleaving costs

A few more tuples are being queried after D111546. Might be good to model them,
They all require a lot of manual assembly surgery.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/11rcvdreP - for intels `Block RThroughput: <=68.0`; for ryzens, `Block RThroughput: <=48.0`
So could pick cost of `68`

For store we have:
https://godbolt.org/z/6aM11fWcP - for intels `Block RThroughput: <=64.0`; for ryzens, `Block RThroughput: <=32.0`
So we could pick cost of `64`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111943
The file was modifiedllvm/test/Analysis/CostModel/X86/interleaved-load-f32-stride-4.ll
The file was modifiedllvm/test/Analysis/CostModel/X86/interleaved-store-i32-stride-4.ll
The file was modifiedllvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-4-indices-01uu.ll
The file was modifiedllvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-4.ll
The file was modifiedllvm/test/Analysis/CostModel/X86/interleaved-store-f32-stride-4.ll
The file was modifiedllvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-4-indices-012u.ll
The file was modifiedllvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-4-indices-0uuu.ll
The file was modifiedllvm/lib/Target/X86/X86TargetTransformInfo.cpp
Commit 3274ce3a287dcd4d02b4d2c7a2bf60e942836e06 by lebedev.ri
[X86][Costmodel] Load/store i64 Stride=2 VF=32 interleaving costs

A few more tuples are being queried after D111546. Might be good to model them,
They all require a lot of manual assembly surgery.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/MTaKboejM - for intels `Block RThroughput: =32.0`; for ryzens, `Block RThroughput: <=16.0`
So could pick cost of `32`

For store we have:
https://godbolt.org/z/v7xPj3Wd4 - for intels `Block RThroughput: =32.0`; for ryzens, `Block RThroughput: <=32.0`
So we could pick cost of `32`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111944
The file was modifiedllvm/test/Analysis/CostModel/X86/interleaved-load-i64-stride-2.ll
The file was modifiedllvm/test/Analysis/CostModel/X86/interleaved-store-f64-stride-2.ll
The file was modifiedllvm/lib/Target/X86/X86TargetTransformInfo.cpp
The file was modifiedllvm/test/Analysis/CostModel/X86/interleaved-load-f64-stride-2.ll
The file was modifiedllvm/test/Analysis/CostModel/X86/interleaved-store-i64-stride-2.ll
Commit 91373bf12ec66591addf56b9f447ec9befd6ddae by lebedev.ri
[X86][Costmodel] Load/store i64 Stride=4 VF=16 interleaving costs

A few more tuples are being queried after D111546. Might be good to model them,
They all require a lot of manual assembly surgery.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/9bnKrefcG - for intels `Block RThroughput: =40.0`; for ryzens, `Block RThroughput: =16.0`
So could pick cost of `40`

For store we have:
https://godbolt.org/z/5s3s14dEY - for intels `Block RThroughput: =40.0`; for ryzens, `Block RThroughput: =16.0`
So we could pick cost of `40`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111945
The file was modifiedllvm/lib/Target/X86/X86TargetTransformInfo.cpp
The file was modifiedllvm/test/Analysis/CostModel/X86/interleaved-store-f64-stride-4.ll
The file was modifiedllvm/test/Analysis/CostModel/X86/interleaved-load-i64-stride-4.ll
The file was modifiedllvm/test/Analysis/CostModel/X86/interleaved-load-f64-stride-4.ll
The file was modifiedllvm/test/Analysis/CostModel/X86/interleaved-store-i64-stride-4.ll
Commit 274b2439f8392796e04e366ce5ff47434bd077e1 by nikita.ppv
[ConstantRange] Add fast signed multiply

The multiply() implementation is very slow -- it performs six
multiplications in double the bitwidth, which means that it will
typically work on allocated APInts and bypass fast-path
implementations. Add an additional implementation that doesn't
try to produce anything better than a full range if overflow is
possible. At least for the BasicAA use-case, we really don't care
about more precise modeling of overflow behavior. The current
use of multiply() is fine while the implementation is limited to
a single index, but extending it to the multiple-index case makes
the compile-time impact untenable.
The file was modifiedllvm/lib/IR/ConstantRange.cpp
The file was modifiedllvm/include/llvm/IR/ConstantRange.h
The file was modifiedllvm/unittests/IR/ConstantRangeTest.cpp
The file was modifiedllvm/lib/Analysis/BasicAliasAnalysis.cpp