Use native _Float16 for AVX512FP16 and update vectorization.

This allows us to do faster native scalar operations.  Also
updated half/quarter packets to use the native type if available.

Benchmark improvement:
```
Comparing ./2910_without_float16 to ./2910_with_float16
Benchmark                                               Time             CPU      Time Old      Time New       CPU Old       CPU New
------------------------------------------------------------------------------------------------------------------------------------
BM_CalcMat<float>/10000/768/500                      -0.0041         -0.0040      58276392      58039442      58273420      58039582
BM_CalcMat<_Float16>/10000/768/500                   +0.0073         +0.0073     642506339     647214446     642481384     647188303
BM_CalcMat<Eigen::half>/10000/768/500                -0.3170         -0.3170      92511115      63182101      92506771      63179258
BM_CalcVec<float>/10000/768/500                      +0.0022         +0.0022       5198157       5209469       5197913       5209334
BM_CalcVec<_Float16>/10000/768/500                   +0.0025         +0.0026      10133324      10159111      10132641      10158507
BM_CalcVec<Eigen::half>/10000/768/500                -0.7760         -0.7760      45337937      10156952      45336532      10156389
OVERALL_GEOMEAN                                      -0.2677         -0.2677             0             0             0             0
```

Fixes #2910.
15 files changed