Posts by Collection

portfolio

publications

LAP: A Lightweight Automata Processor for Pattern Matching Tasks

Published in Design, Automation and Test in Europe Conference (DATE), 2021

Designed and instantiated a lightweight pattern-matching hardware processor on real FPGA boards, integrating it with ARM CPUs.

Recommended citation: Haojun Xia, Lei Gong, Chao Wang, Xianglan Chen, Xuehai Zhou. "LAP: A Lightweight Automata Processor for Pattern Matching Tasks." In Proceedings of Design, Automation and Test in Europe Conference (DATE), 2021.
Download Paper

η-LSTM: Co-Designing Highly-Efficient Large LSTM Training via Exploiting Memory-Saving and Architectural Design Opportunities

Published in International Symposium on Computer Architecture (ISCA), 2021

Prototyped an efficient hardware architecture for large-scale LSTM network training, utilizing variable compression and cell skipping.

Recommended citation: Xingyao Zhang, Haojun Xia, Donglin Zhuang, Hao Sun, Xin Fu, Michael Taylor, Shuaiwen Leon Song. "η-LSTM: Co-Designing Highly-Efficient Large LSTM Training via Exploiting Memory-Saving and Architectural Design Opportunities." In Proceedings of International Symposium on Computer Architecture (ISCA), 2021.

Master’s Thesis: The design and implementation of a lightweight automata processor

Published in University of Science and Technology of China (USTC), 2021

This thesis focuses on the architecture design, Verilog HDL prototyping, and hardware-software co-design of a high-performance, memory-efficient lightweight automata processing engine for large-scale pattern matching.

Recommended citation: Haojun Xia. "The design and implementation of a lightweight automata processor." Master's Thesis, University of Science and Technology of China, 2021.

Shift-BNN: Highly-Efficient Probabilistic Bayesian Neural Network Training via Memory-Friendly Pattern Retrieving

Published in IEEE/ACM International Symposium on Microarchitecture (MICRO), 2021

Co-designed a hardware accelerator and memory optimization strategy for highly-efficient Bayesian Neural Network (BNN) training on server and edge devices.

Recommended citation: Qiyu Wan, Haojun Xia, Xingyao Zhang, Lening Wang, Shuaiwen Leon Song, Xin Fu. "Shift-BNN: Highly-Efficient Probabilistic Bayesian Neural Network Training via Memory-Friendly Pattern Retrieving." In Proceedings of the 54th IEEE/ACM International Symposium on Microarchitecture (MICRO), 2021.
Download Paper

FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design

Published in USENIX Annual Technical Conference (ATC), 2024

We designed and implemented a GPU kernel with unified Tensor Core support for various quantization bit-widths, achieving up to 2.65x throughput improvement on LLaMA-70B.

Recommended citation: Haojun Xia, Zhen Zheng, Xiaoxia Wu, Shiyang Chen, Zhewei Yao, Stephen Youn, Arash Bakhtiari, Michael Wyatt, Donglin Zhuang, Zhongzhu Zhou, Olatunji Ruwase, Yuxiong He, Shuaiwen Leon Song. "FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design." USENIX Annual Technical Conference (ATC), 2024.
Download Paper

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

Published in International Conference on Very Large Data Bases (VLDB), 2024

This work develops a highly efficient LLM acceleration framework that provides runtime support for LLM inference with unstructured sparsity, reducing inference costs by up to 50%.

Recommended citation: Haojun Xia, Zhen Zheng, Yuchao Li, Donglin Zhuang, Zhongzhu Zhou, Xiafei Qiu, Yong Li, Wei Lin, Shuaiwen Leon Song. "Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity." International Conference on Very Large Data Bases (VLDB), 2024.
Download Paper

Kitty: Accurate and Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boost

Published in Annual Conference on Machine Learning and Systems (MLSys), 2026

This paper introduces an algorithm-system co-design for accurate 2-bit KV cache quantization, significantly reducing GPU memory consumption and increasing inference throughput during LLM inference.

Recommended citation: Haojun Xia, Xiaoxia Wu, Jisen Li, Tsai-chuan Wu, Junxiong Wang, Jue Wang, Chenxi Li, Aman Singhal, Alay Dilipbhai Shah, Alpay Ariyak, Donglin Zhuang, Zhongzhu Zhou, Ben Athiwaratkun, Zhen Zheng, Shuaiwen Song. "Kitty: Accurate and Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boost." Annual Conference on Machine Learning and Systems (MLSys), 2026.
Download Paper

PhD Thesis: Compression-Driven Memory-Efficient and High-Throughput GPU Systems for LLM Inference

Published in The University of Sydney (USYD), 2026

My doctoral dissertation focuses on alleviating the memory wall and communication bottlenecks in Large Language Model (LLM) inference through algorithm-system co-design, including low-bit quantization and unstructured sparsity acceleration.

Recommended citation: Haojun Xia. "Compression-Driven Memory-Efficient and High-Throughput GPU Systems for LLM Inference." PhD Thesis, The University of Sydney, 2026.

talks

teaching

Teaching Assistant - Digital Circuit Experiments

Undergraduate Lab Course, University of Science and Technology of China (USTC), 2016

Served as a Lab Teaching Assistant for Digital Circuit Experiments in Fall 2016. Guided undergraduate students through hands-on hardware laboratory sessions, troubleshooting hardware prototype testing, and grading experimental reports.

Teaching Assistant - Digital Circuit Theory

Undergraduate Course, University of Science and Technology of China (USTC), 2017

Served as a Teaching Assistant for the Digital Circuit Theory course in Fall 2017. Assisted students in mastering fundamental logic design, sequential circuits, and theoretical building blocks of digital hardware.

Teaching Assistant - Computer Architecture

Undergraduate Course, University of Science and Technology of China (USTC), 2018

Served as a Teaching Assistant for the Computer Architecture course in Spring 2018. Responsible for guiding students through advanced computer architecture topics, managing course logistics, and evaluating academic performance.

Teaching Assistant - Computer Architecture

Undergraduate Course, University of Science and Technology of China (USTC), 2019

Served as a Teaching Assistant for the Computer Architecture course in Spring 2019. Duties included holding office hours, grading assignments, and helping undergraduate students understand core computer architecture concepts.