Opt4GPTQ: Co-Optimizing Memory and Computation for 4-bit GPTQ Quantized LLM Inference on Heterogeneous Platforms
/ Authors
/ Abstract
The increasing adoption of large language models (LLMs) on heterogeneous computing platforms poses significant challenges to achieving high inference efficiency. To address these efficiency bottlenecks across diverse platforms, this paper proposes Opt4GPTQ, a practical optimization method designed for 4-bit GPTQ quantized LLMs inference on heterogeneous AI accelerators. Built upon the vLLM serving system, Opt4GPTQ integrates three platform-level optimization strategies: Shared Memory Buffering Optimization (SMB-Opt), which caches frequently accessed data in shared memory and employs single-threaded writes; Vectorized Memory Loading Optimization (VML-Opt), which utilizes vectorized memory operations for efficient data loading; and Inline Assembly Optimization (ILA-Opt), which directly leverages hardwarenative vector half-precision addition and fused multiply-accumulate instructions. Experimental results show that Opt4GPTQ effectively improves performance across various models while maintaining original model accuracy, achieving throughput gains of up to 84.42%. This work highlights the critical role of platformlevel engineering in enabling efficient LLMs inference on emerging architectures and provides valuable methodologies for future heterogeneous platform adaptation.
Journal: ArXiv