Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

Published in International Conference on Very Large Data Bases (VLDB), 2024

Identified and analyzed the HBM bandwidth bottleneck during LLM inference. Created a highly efficient LLM acceleration framework providing runtime support for inference with unstructured sparsity, reducing costs by up to 50% for models like OPT-175B.

Download Code/Project Here

Recommended citation: Haojun Xia, Zhen Zheng, Yuchao Li, Donglin Zhuang, Zhongzhu Zhou, Xiafei Qiu, Yong Li, Wei Lin, Shuaiwen Leon Song. "Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity." International Conference on Very Large Data Bases (VLDB), 2024.
Download Paper