Kitty: Accurate and Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boost

Published in Annual Conference on Machine Learning and Systems (MLSys), 2026

Algorithm-system co-design for accurate 2-bit KV cache quantization, reducing GPU memory consumption and increasing inference throughput during LLM inference. Modeled and optimized for efficient large language model serving.

Download Code/Project Here

Recommended citation: Haojun Xia, Xiaoxia Wu, Jisen Li, Tsai-chuan Wu, Junxiong Wang, Jue Wang, Chenxi Li, Aman Singhal, Alay Dilipbhai Shah, Alpay Ariyak, Donglin Zhuang, Zhongzhu Zhou, Ben Athiwaratkun, Zhen Zheng, Shuaiwen Song. "Kitty: Accurate and Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boost." Annual Conference on Machine Learning and Systems (MLSys), 2026.
Download Paper

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

Dr. Haojun Xia

Share on