Efficient FPGA Implementation of Conjugate Gradient Methods for Laplacian System using HLS
/ Authors
/ Abstract
In conjugate gradient method to solve partial differential equations, matrix vector operations are required in each iteration; these matrix vector operations can be implemented as 5 point finite difference stencil operations on the grid without explicitely constructing the matrix. We show that a pipelined and superscalar design using high level synthesis (HLS) leads to a significant reduction in latencies for two variants of conjugate gradient methods considered. When comparing these two, we show that the second method has roughly two times lower latency than the former given the same degree of superscalarity. These reductions in latencies for the newer variant of CG is due to parallel implementations of stencil operation on subdomains of the grid, and due to overlap of these stencil operations with dot product operations. In a superscalar design for the stencil operation, the computational domain needs to be partitioned, and boundary data needs to be copied, which requires padding. We propose a novel traversal of the grid for 2D domain decomposition that leads to reduction in latency cost involved with padding. The FPGA implementation of CG is roughly 7 times faster than state-of- the-art sequential implementation, and roughly 4 times faster than state-of-the-art CUDA library parallel implementation for the linear system of dimension 10000 x 10000.
Journal: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays