Conference Talk: Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs
Talk, USENIX Annual Technical Conference (ATC '24), Santa Clara, CA, USA
Presented our full-paper research on FP6-LLM at the USENIX Annual Technical Conference (ATC ‘24). The presentation covered the core compiler and runtime co-design to enable unified Tensor Core support for 6-bit quantized LLM inference, demonstrating an end-to-end throughput speedup of up to 2.65x on LLaMA-70B models.