Performs a matrix fused multiply-accumulate operation on 16x16x128 submatrices for f8f6f4 data types.
More...
#include <amd_xdlops.hpp>
|
| template<class FloatC> |
| static __device__ void | Run (const f8x32_t ®_a, const f8x32_t ®_b, FloatC ®_c) |
| template<class FloatC> |
| static __device__ void | Run (const bf8x32_t ®_a, const bf8x32_t ®_b, FloatC ®_c) |
| template<class FloatC> |
| static __device__ void | Run (const bf8x32_t ®_a, const f8x32_t ®_b, FloatC ®_c) |
| template<class FloatC> |
| static __device__ void | Run (const f8x32_t ®_a, const bf8x32_t ®_b, FloatC ®_c) |
| template<class FloatC> |
| static __device__ void | Run (const f4x32_t ®_a, const f4x32_t ®_b, FloatC ®_c) |
| template<class FloatC> |
| static __device__ void | Run (const f6x32_t ®_a, const f6x32_t ®_b, FloatC ®_c) |
| template<class FloatC> |
| static __device__ void | Run (const bf6x32_t ®_a, const bf6x32_t ®_b, FloatC ®_c) |
Performs a matrix fused multiply-accumulate operation on 16x16x128 submatrices for f8f6f4 data types.
- Note
- Calls scaled version of the instruction as the original instruction is not supported in the backend. That is the intended use. There is a backend optimization to select the unscaled operation if the scale is 0.
◆ Run() [1/7]
◆ Run() [2/7]
◆ Run() [3/7]
◆ Run() [4/7]
◆ Run() [5/7]
◆ Run() [6/7]
◆ Run() [7/7]
The documentation for this struct was generated from the following file: