CV

Education

Ph.D. in Computer Science, The University of Sydney (USYD), 2022 – 2026
- Advisor: Shuaiwen Leon Song
- Thesis: Compression-Driven Memory-Efficient and High-Throughput GPU Systems for LLM Inference.
M.S. in Computer Architecture, University of Science and Technology of China (USTC), 2018 – 2021
- Thesis: The design and implementation of a lightweight Automata Processor.
- Honors: First Class Scholarship
B.S. in Computer Science and Technology, University of Science and Technology of China (USTC), 2014 – 2018
- Thesis: FPGA Based CNN Accelerator Design.
- Honors: Talent Program

Work Experience

Research Software Engineer Google Research
- Nov 2025 – Present (Full-time, On-site)
- Duties: Optimizing production-level TPU-based LLM inference systems.
Research Consultant Together AI
- Mar 2024 – Sept 2024 (Part-time, Remote)
- Duties: Optimizing the LLM inference system, identifying and mitigating performance issues in LoRA inference, and developing the FP8 MHA Decoding GPU kernel using OpenAI Triton.
Research Intern Alibaba Cloud
- Feb 2022 – Aug 2023 (Remote, Un-paid)
- Duties: Extension of the former research project on large-scale ML model acceleration frameworks.
Research Intern Alibaba Cloud
- Aug 2021 – Jan 2022 (On-site)
- Duties: Part of the Alibaba Innovative Research (AIR) program. Investigated SOTA system support for LLMs, and R&D a novel large-scale ML model acceleration framework.

Skills

Research Interests: Performance Optimization, Machine Learning System, Runtime Systems, Computer Architecture, Domain Specific Architectures, GPU/TPU Kernel Design
Programming Languages: JAX, Pallas (TPU), C/C++, Python, CUDA (GPU), Triton (GPU), Verilog HDL (FPGA)
Software & Frameworks: Machine Learning Frameworks (e.g., PyTorch, Huggingface Transformers, Faster Transformer), Embedded System Design and Implementation (e.g., Xilinx FPGA + ARM CPUs)

Selected Publications

(Note: Complete list generated automatically from _publications/ folder)

[MLSys’26] Haojun Xia, Xiaoxia Wu, Jisen Li, et al. “Kitty: Accurate and Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boost.”
[ATC’24] Haojun Xia, Zhen Zheng, Xiaoxia Wu, et al. “FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design.”
[VLDB’24] Haojun Xia, Zhen Zheng, Yuchao Li, et al. “Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity.”
[OSDI’24] Donglin Zhuang, Zhen Zheng, Haojun Xia, et al. “MonoNN: Enabling a New Monolithic Optimization Space for Neural Network Inference Tasks on Modern GPU-Centric Architectures.”

PhD Thesis: Compression-Driven Memory-Efficient and High-Throughput GPU Systems for LLM Inference

Haojun Xia. "Compression-Driven Memory-Efficient and High-Throughput GPU Systems for LLM Inference." PhD Thesis, The University of Sydney, 2026.

Kitty: Accurate and Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boost

Haojun Xia, Xiaoxia Wu, Jisen Li, Tsai-chuan Wu, Junxiong Wang, Jue Wang, Chenxi Li, Aman Singhal, Alay Dilipbhai Shah, Alpay Ariyak, Donglin Zhuang, Zhongzhu Zhou, Ben Athiwaratkun, Zhen Zheng, Shuaiwen Song. "Kitty: Accurate and Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boost." Annual Conference on Machine Learning and Systems (MLSys), 2026.

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

Haojun Xia, Zhen Zheng, Yuchao Li, Donglin Zhuang, Zhongzhu Zhou, Xiafei Qiu, Yong Li, Wei Lin, Shuaiwen Leon Song. "Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity." International Conference on Very Large Data Bases (VLDB), 2024.

FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design

Haojun Xia, Zhen Zheng, Xiaoxia Wu, Shiyang Chen, Zhewei Yao, Stephen Youn, Arash Bakhtiari, Michael Wyatt, Donglin Zhuang, Zhongzhu Zhou, Olatunji Ruwase, Yuxiong He, Shuaiwen Leon Song. "FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design." USENIX Annual Technical Conference (ATC), 2024.

Shift-BNN: Highly-Efficient Probabilistic Bayesian Neural Network Training via Memory-Friendly Pattern Retrieving

Qiyu Wan, Haojun Xia, Xingyao Zhang, Lening Wang, Shuaiwen Leon Song, Xin Fu. "Shift-BNN: Highly-Efficient Probabilistic Bayesian Neural Network Training via Memory-Friendly Pattern Retrieving." In Proceedings of the 54th IEEE/ACM International Symposium on Microarchitecture (MICRO), 2021.

Master’s Thesis: The design and implementation of a lightweight automata processor

Haojun Xia. "The design and implementation of a lightweight automata processor." Master's Thesis, University of Science and Technology of China, 2021.

η-LSTM: Co-Designing Highly-Efficient Large LSTM Training via Exploiting Memory-Saving and Architectural Design Opportunities

Xingyao Zhang, Haojun Xia, Donglin Zhuang, Hao Sun, Xin Fu, Michael Taylor, Shuaiwen Leon Song. "η-LSTM: Co-Designing Highly-Efficient Large LSTM Training via Exploiting Memory-Saving and Architectural Design Opportunities." In Proceedings of International Symposium on Computer Architecture (ISCA), 2021.

LAP: A Lightweight Automata Processor for Pattern Matching Tasks

Haojun Xia, Lei Gong, Chao Wang, Xianglan Chen, Xuehai Zhou. "LAP: A Lightweight Automata Processor for Pattern Matching Tasks." In Proceedings of Design, Automation and Test in Europe Conference (DATE), 2021.

Talks

Conference Talk: Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs

July 12, 2024

Talk at USENIX Annual Technical Conference (ATC '24), Santa Clara, CA, USA

Teaching

Teaching Assistant, USTC
- Mar 2019 – Jun 2019: Computer Architecture 2019
- Mar 2018 – Jun 2018: Computer Architecture 2018
- Sep 2017 – Dec 2017: Digital Circuit Theory 2017
- Sep 2016 – Dec 2016: Digital Circuit Experiments 2016

Honors and Achievements

2022 – 2026: Faculty of Engineering Research Scholarship, PhD study, USYD
2021: Outstanding Graduates (Top 15%), Master’s study, USTC
2020: Suzhou Park Scholarship, Master’s study, USTC
2018 – 2021: First Class Academic Scholarship, Master’s study, USTC
2018: Yang Yuanqing Education Fund - Top Research Scholarship, Bachelor’s study, USTC
2018: Outstanding Graduates (Top 15%), Bachelor’s study, USTC
2014 – 2018: Talent program in computer science and technology, USTC

Dr. Haojun Xia

CV