FPGAs combine HBM2 memory and reconfigurable pipeline logic to efficiently perform memory access patterns that other parallel architectures might struggle with. This is demonstrated here using a 2D FFT which requires a matrix transpose. We can overlap the transpose operation with computation, with minimal buffering, making it almost free from a throughput perspective. Heavily optimized FPGA FFTs can significantly benefit from reduced precision, but for easier comparison with other technologies we are using floating point for this example.
BittWare previously created a 2D FFT kernel for FPGAs using Intel’s OpenCL compiler. We have now rewritten that code for the same 520N-MX card to leverage Intel’s oneAPI programming model, specifically its SYCL programming language. We achieved similar performance to the OpenCL version.
The peak HBM2 performance (Stratix 10 MX) for a batch 1 implementation, with two independent 2D FFT kernels in the same device, is 291 GBytes/Sec. When pipelining/batching a peak bandwidth of 337 GBytes/Sec is possible.
The key benefit we find for using high level tools is the significant reduction in development time.