FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design

Published in USENIX Annual Technical Conference (ATC), 2024

Designed and implemented GPU kernel with unified Tensor Core support for various quantization bit-widths. Developed end-to-end support for quantized inference, yielding $1.69\times-2.65\times$ throughput improvement on LLaMA-70B. Widely adopted by industrial frameworks including Microsoft/DeepSpeed and PyTorch/AO.

Download Code/Project Here

Recommended citation: Haojun Xia, Zhen Zheng, Xiaoxia Wu, Shiyang Chen, Zhewei Yao, Stephen Youn, Arash Bakhtiari, Michael Wyatt, Donglin Zhuang, Zhongzhu Zhou, Olatunji Ruwase, Yuxiong He, Shuaiwen Leon Song. "FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design." USENIX Annual Technical Conference (ATC), 2024.
Download Paper