[X86] Lower the cost of avx512 horizontal bool and/or reductions to (details)
[X86] Lower the cost of avx512 horizontal bool and/or reductions to 2*log2(bitwidth)+1 for legal types. This better represents the kshift+binop we'd get for each stage before the final extract. Its likely we'll do even better by doing a kmov and a cmp with a GPR, but this is a good start. The default handling was costing a worst case single source permute shuffle of the vector before the binop. This worst case assumes the shuffle might have to be emulated with extracts and inserts. But since we know we're doing a reduction we can assume we'll get kshift lowering. There's still some room for improvement here, but this is much better than it was.