Hi, this is Wei Huang(黄炜)’s website! I am currently a Ph.D advised by Prof.Xiaojuan Qi and Prof.Shiming Zhang. I am also co-supervised by Prof.Zhongrui Wang. Previously I obtained my bachelor’s degree in computer science (Jun 2023) from Beihang University where I was advised by Prof.Si Liu and also worked with Prof.Xianglong Liu.

⛵ Now, I am fortunate to be collaborating closely with Dr. Yukang Chen and Dr. Ligeng Zhu on the Efficient-Large-Model, led by Prof. Song Han

I’m currently conducting some research in efficient/tiny deep learning and application, including:

🚀 Efficient AI: The efficiency of the Large Language/Vision-Language Model and Diffusion Model (e.g. model quantization/binarization).

🔥 Brain-mimic Computing: Neuromorphic computing and hardware acceleration (e.g. spiking neural network-SNN).

Edged AI: Edged AI for wearable and digital health.

🔥 News

  • 2025.05:  🎉🎉 One paper for structural mixed-precision low-bit quantization for LLMs (SliM-LLM) is accepted by ICML’25!
  • 2025.02:  🎉🎉 One paper for efficient fine-grained chain-of-thought video understanding framework (VideoEspresso) is accepted by CVPR’25, Oral Paper!
  • 2025.01:  🎉🎉 One paper for MoE-LLM compression (MC-MoE: MoE-LLM compression) and two papers (InfoMax: data pruning; From-Layers-to-States: dynamic neural network layer) for data efficiency and dynamic neural networks are accepted by ICLR’25!
  • 2024.12:  🎉🎉 One Technical Report is accepted by Visual Intelligence
  • 2024.12:  🎉🎉 One Review on AI in wearable diabetes management is accepted by Advanced Intelligent Systems
  • 2024.05:  🎉🎉 One paper for snn security on rram is accepted by ICCAD’24!
  • 2024.04:  🎉🎉 BiLLM is accepted by ICML’24!
  • 2024.02:  Release BiLLM: Pushing the Limit of Post-Training Quantization for LLMs, the first post-training quantization work pushing the LLMs to nearly 1-bit. Please check our paper and code!

💬 Invited Talks and Report

  • 2024.05: BiLLM was reported by IEEE Spectrum. Thanks to Matthew for the interview and report. Please see the link.
  • 2024.05: AI-Time online talk on BiLLM. Please see the video.
  • 2024.04: Our emperical study How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study (new version: An Empirical Study of LLaMA3 Quantization: From LLMs to MLLMs) was reported by QbitAI (量子位). Please see the link.
  • 2024.03: Our BiLLM: Pushing the Limit of Post-Training Quantization for LLMs was reported by QbitAI (量子位). Please see the link.

📝 Publications

ICML 2025
sym

SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models sym

Wei Huang, Haotong Qin, Yangdong Liu, Yawei Li, Qinshuo Liu, Xianglong Liu, Luca Benini, Michele Magno, Shiming Zhang, Xiaojuan Qi

  • A novel scheme that observes and proves the structure-clustering of salient elements in LLMs weight matrix.
  • The first group-wise mixed-precision quantization framework for LLMs.
  • Serve as a plug-and-play approach to GPTQ/Omniquant/…, improving the inference-friendly method under low-bit quantization.
[paper] [code] [abstract]
CVPR 2025 Oral
sym

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection sym

Songhao Han, Wei Huang, Hairong Shi, Le Zhuo, Xiu Su, Shifeng Zhang, Xu Zhou, Xiaojuan Qi, Yue Liao, Si Liu

  • A novel dataset designed to enhance video reasoning by addressing the limitations of existing datasets in terms of scale and granularity.
  • We proposed a Hybrid LVLMs Collaboration framework achieving cost-effective and accurate video reasoning, outperforming baseline models on the majority of tasks across our proposed benchmark.
  • VideoEspresso sets a new starting point in video reasoning, offering rich annotations that facilitate advanced multimodal understanding.
ICLR 2025
sym

MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More sym

Wei Huang, Yue Liao, Jianhui Liu, Ruifei He, Haoru Tan, Shiming Zhang, Hongsheng Li, Si Liu, Xiaojuan Qi

  • MC-MoE for accurate weight-only quantization (Weight=1.5~2.5bit).
  • MC-MoE for efficient online dynamic pruning (additional compression ratio > 10%)
  • MC-MoE integrates static quantization and dynamic pruning to collaboratively achieve extreme compression for MoE-LLMs with less accuracy loss, ensuring an optimal trade-off between performance and efficiency.
  • For instance, at 2.54 bits, MC-MoE compresses 76.6% of the model, with only a 3.8% average accuracy loss. During dynamic inference, we further reduce activated parameters by 15%, with a performance drop of less than 0.6%.
[paper] [code] [abstract]
ICLR 2025
sym

Data Pruning by Information Maximization

Haoru Tan, Sitong Wu, Wei Huang, Shizhen Zhao, Xiaojuan Qi

  • A new coreset algorithm designed to maximize overall information by accounting for each sample’s individual contribution while reducing information overlap, with a simultaneous focus on maintaining diversity and importance.
  • An efficient gradient-based solver enhanced by sparsification techniques and dataset partitioning strategies to make InfoMax scale to large-scale datasets.
  • InfoMax exhibits the best performance and consistently outperforms the state-of-the-art schemes in a series of tasks, including image classification, an vision-language pre-training, large language model supervised fine-tuning experiments.
ICLR 2025
sym

From Layers to States: A State Space Model Perspective to Deep Neural Network Layer Dynamics

Qinshuo Liu, Weiqin Zhao, Wei Huang, Yanwen Fang, Lequan Yu, Guodong Li

  • For a deep neural network, we treat the outputs from layers as states of a continuous process and attempt to leverage the SSM to design the aggregation of layers. To our best knowledge, this is the first time such a perspective has been presented.
  • This leads to a proposed lightweight module, the Selective State Space Model Layer Aggregation (S6LA) module, and it conceptualizes a neural network as a selective state space model(S6), hence solving the layer interactions by the long sequence modelling selective mechanism.
  • Compared with other SOTA convolutional and transformer-based layer aggregation models, S6LA demonstrates superior performance in classification, detection, and instance segmentation tasks.
ICCAD 2024
sym

SNNGX: Securing Spiking Neural Networks with Genetic XOR Encryption on RRAM-based Neuromorphic Accelerator sym

Kwunhang Wong, Songqi Wang, Wei Huang, Xinyuan Zhang, Yangu He, Karl M.H. Lai, Yuzhong Jiao, Ning Lin, Xiaojuan Qi, Xiaoming Chen, Zhongrui Wang

  • The first IP protection scheme specifically for SNNs, leveraging a genetic algorithm combined with classic XOR encryption to secure the networks against unauthorized access and tampering.
  • A flexible solution for securing SNNs across various applications, especially in critical domains like biomedical applications where model security is paramount..
[paper] [code] [abstract]
Visual Intelligence
sym

An Empirical Study of LLaMA3 Quantization: From LLMs to MLLMs sym

Wei Huang, Xingyu Zheng, Xudong Ma, Haotong Qin, Chengtao Lv, Hong Chen, Jie Luo, Xiaojuan Qi, Xianglong Liu, Michele Magno

  • Explore the performance of LLaMA3 series models under existing post-training quantization and LoRA-finetuning methods.
  • Point out the significant performance loss of MLLMs based on LLaMA3 under low-bit post-training quantization.
  • Highlights the significant performance gap under low bit-width that needs to be bridged in future developments.
[paper] [code] [abstract]
ICML 2024
sym

BiLLM: Pushing the Limit of Post-Training Quantization for LLMs sym

Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, Xiaojuan Qi

  • Compress LLM weights to as low as 1.08-1.1 bit and exceeds the performance of previous quantization methods at 2-bit or even 3-bit.
  • Implements high-performance binary LLM in PTQ mode, efficiently achieving 1bit LLM compression without additional training and backpropagation.
[paper] [code] [abstract]
Arxiv
sym

On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks

Wei Huang, Haotong Qin, Yangdong Liu, Jingzhuo Liang, Yulun Zhang, Ying Li, Xianglong Liu

  • Combine IP-core-level chip runtime clock and power awareness with network sensitivity, achieving a better balance of computational efficiency and accuracy on edge devices.
  • Allow target networks to be compressed and deployed with high accuracy on edge chips with limited computational resources and ultra-low power consumption.
  • Efficiently perform online quantization and optimization without additional devices or data access.
[paper] [abstract]

📖 Educations

  • 2023.09 - (now), Ph.D. Student in Department of Electrical Electronic Engineering, The University of HongKong.
  • 2019.09 - 2023.06, B.Eng. in Computer Science, School of Computer Science and Engineering, Beihang University.

🗒️ Academic Services

  • Conference: ICLR, Neurips, ICML, ECCV, AISTATS, ICCV
  • Journal: Neural Networks.
  • Program Committee member for Practical Deep Learning Workshop, IEEE CAI 2024.

🎖 Honors and Awards

-2023 Outstanding Graduate, Beihang University.

-2023 Outstanding Project of the 16th National College Student Innovation and Entrepreneurship Competition, China.

-2022 Outstanding Project of the 15th National College Student Innovation and Entrepreneurship Competition, China.

💻 Internships & Teaching Services

  • 2025.06 - 2025.09, Multimodal Large Language Model Intern, NVIDIA.
  • 2022.09 - 2023.01, AI algorithm internship on model inference acceleration, Enflame, China.
  • 2022.08 - 2023.01, TA for Frontiers in Artificial Intelligence, Beihang University.
  • 2022.08 - 2023.01, TA for Computer Hardware Basics, the head of TA team, Beihang University.
  • 2021.08 - 2022.01, TA for Computer Hardware Basics, the head of TA team, Beihang University.
  • 2021.03 - 2021.06, TA for Discrete Mathematics, the head of TA team, Beihang University.