Hi, this is Wei Huang(黄炜)’s website! I am currently a Ph.D advised by Prof.Xiaojuan Qi and Prof.Shiming Zhang. I am also co-supervised by Prof.Zhongrui Wang. Previously I obtained my bachelor’s degree in computer science (Jun 2023) from Beihang University where I was advised by Prof.Si Liu and also worked with Prof.Xianglong Liu.

I’m currently conducting some research in efficient/tiny deep learning and application, including:

🚀 Efficient AI: The efficiency of the Large Language/Vision-Language Model and Diffusion Model (e.g. model quantization/binarization). ⌚ Wearable AI: AI for wearable and digital health, AI for sensitive organic electrochemical transistor (OECT). 🔥 Brain-mimic Computing: Neuromorphic computing and hardware acceleration (e.g. spiking neural network-SNN).

🔥 News

2025.05: 🎉🎉 One paper for long-term glucose prediction (Uncertainty-Estimate ProbSparse-Transformer) is accepted by Advanced Intelligent Systems
2025.05: 🎉🎉 One paper for structural mixed-precision low-bit quantization for LLMs (SliM-LLM) is accepted by ICML’25!
2025.02: 🎉🎉 One paper for efficient fine-grained chain-of-thought video understanding framework (VideoEspresso) is accepted by CVPR’25, Oral Paper 0.73%!
2025.01: 🎉🎉 One paper for MoE-LLM compression (MC-MoE: MoE-LLM compression) and two papers (InfoMax: data pruning; From-Layers-to-States: dynamic neural network layer) for data efficiency and dynamic neural networks are accepted by ICLR’25!
2024.12: 🎉🎉 One Technical Report is accepted by Visual Intelligence
2024.12: 🎉🎉 One Review on AI in wearable diabetes management is accepted by Advanced Intelligent Systems
2024.05: 🎉🎉 One paper for snn security on rram is accepted by ICCAD’24!
2024.04: 🎉🎉 BiLLM is accepted by ICML’24!
2024.02: Release BiLLM: Pushing the Limit of Post-Training Quantization for LLMs, the first post-training quantization work pushing the LLMs to nearly 1-bit. Please check our paper and code!

💬 Invited Talks and Report

2025.06: AI-Time online talk on VideoEspresso. Please see the video.
2024.05: BiLLM was reported by IEEE Spectrum. Thanks to Matthew for the interview and report. Please see the link.
2024.05: AI-Time online talk on BiLLM. Please see the video.
2024.04: Our emperical study How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study (new version: An Empirical Study of LLaMA3 Quantization: From LLMs to MLLMs) was reported by QbitAI (量子位). Please see the link.
2024.03: Our BiLLM: Pushing the Limit of Post-Training Quantization for LLMs was reported by QbitAI (量子位). Please see the link.

📝 Publications

ICML 2025

SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

Wei Huang, Haotong Qin, Yangdong Liu, Yawei Li, Qinshuo Liu, Xianglong Liu, Luca Benini, Michele Magno, Shiming Zhang, Xiaojuan Qi

A novel scheme that observes and proves the structure-clustering of salient elements in LLMs weight matrix.
The first group-wise mixed-precision quantization framework for LLMs.
Serve as a plug-and-play approach to GPTQ/Omniquant/…, improving the inference-friendly method under low-bit quantization.

[paper] [code] [abstract]

CVPR 2025 Oral

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

Songhao Han, Wei Huang, Hairong Shi, Le Zhuo, Xiu Su, Shifeng Zhang, Xu Zhou, Xiaojuan Qi, Yue Liao, Si Liu

A novel dataset designed to enhance video reasoning by addressing the limitations of existing datasets in terms of scale and granularity.
We proposed a Hybrid LVLMs Collaboration framework achieving cost-effective and accurate video reasoning, outperforming baseline models on the majority of tasks across our proposed benchmark.
VideoEspresso sets a new starting point in video reasoning, offering rich annotations that facilitate advanced multimodal understanding.

[paper] [code] [abstract]

ICLR 2025

MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More

Wei Huang, Yue Liao, Jianhui Liu, Ruifei He, Haoru Tan, Shiming Zhang, Hongsheng Li, Si Liu, Xiaojuan Qi

MC-MoE for accurate weight-only quantization (Weight=1.5～2.5bit).
MC-MoE for efficient online dynamic pruning (additional compression ratio > 10%)
MC-MoE integrates static quantization and dynamic pruning to collaboratively achieve extreme compression for MoE-LLMs with less accuracy loss, ensuring an optimal trade-off between performance and efficiency.
For instance, at 2.54 bits, MC-MoE compresses 76.6% of the model, with only a 3.8% average accuracy loss. During dynamic inference, we further reduce activated parameters by 15%, with a performance drop of less than 0.6%.

[paper] [code] [abstract]

ICLR 2025

Data Pruning by Information Maximization

Haoru Tan, Sitong Wu, Wei Huang, Shizhen Zhao, Xiaojuan Qi

A new coreset algorithm designed to maximize overall information by accounting for each sample’s individual contribution while reducing information overlap, with a simultaneous focus on maintaining diversity and importance.
An efficient gradient-based solver enhanced by sparsification techniques and dataset partitioning strategies to make InfoMax scale to large-scale datasets.
InfoMax exhibits the best performance and consistently outperforms the state-of-the-art schemes in a series of tasks, including image classification, an vision-language pre-training, large language model supervised fine-tuning experiments.

[paper] [code] [abstract]

ICLR 2025

From Layers to States: A State Space Model Perspective to Deep Neural Network Layer Dynamics

Qinshuo Liu, Weiqin Zhao, Wei Huang, Yanwen Fang, Lequan Yu, Guodong Li

For a deep neural network, we treat the outputs from layers as states of a continuous process and attempt to leverage the SSM to design the aggregation of layers. To our best knowledge, this is the first time such a perspective has been presented.
This leads to a proposed lightweight module, the Selective State Space Model Layer Aggregation (S6LA) module, and it conceptualizes a neural network as a selective state space model(S6), hence solving the layer interactions by the long sequence modelling selective mechanism.
Compared with other SOTA convolutional and transformer-based layer aggregation models, S6LA demonstrates superior performance in classification, detection, and instance segmentation tasks.

[paper] [code] [abstract]

ICCAD 2024

SNNGX: Securing Spiking Neural Networks with Genetic XOR Encryption on RRAM-based Neuromorphic Accelerator

Kwunhang Wong, Songqi Wang, Wei Huang, Xinyuan Zhang, Yangu He, Karl M.H. Lai, Yuzhong Jiao, Ning Lin, Xiaojuan Qi, Xiaoming Chen, Zhongrui Wang

The first IP protection scheme specifically for SNNs, leveraging a genetic algorithm combined with classic XOR encryption to secure the networks against unauthorized access and tampering.
A flexible solution for securing SNNs across various applications, especially in critical domains like biomedical applications where model security is paramount..

[paper] [code] [abstract]

Visual Intelligence

An Empirical Study of LLaMA3 Quantization: From LLMs to MLLMs

Wei Huang, Xingyu Zheng, Xudong Ma, Haotong Qin, Chengtao Lv, Hong Chen, Jie Luo, Xiaojuan Qi, Xianglong Liu, Michele Magno

Explore the performance of LLaMA3 series models under existing post-training quantization and LoRA-finetuning methods.
Point out the significant performance loss of MLLMs based on LLaMA3 under low-bit post-training quantization.
Highlights the significant performance gap under low bit-width that needs to be bridged in future developments.

[paper] [code] [abstract]

ICML 2024

BiLLM: Pushing the Limit of Post-Training Quantization for LLMs

Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, Xiaojuan Qi

Compress LLM weights to as low as 1.08-1.1 bit and exceeds the performance of previous quantization methods at 2-bit or even 3-bit.
Implements high-performance binary LLM in PTQ mode, efficiently achieving 1bit LLM compression without additional training and backpropagation.

[paper] [code] [abstract]

Arxiv

On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks

Wei Huang, Haotong Qin, Yangdong Liu, Jingzhuo Liang, Yulun Zhang, Ying Li, Xianglong Liu

Combine IP-core-level chip runtime clock and power awareness with network sensitivity, achieving a better balance of computational efficiency and accuracy on edge devices.
Allow target networks to be compressed and deployed with high accuracy on edge chips with limited computational resources and ultra-low power consumption.
Efficiently perform online quantization and optimization without additional devices or data access.

[paper] [abstract]

📖 Educations

2023.09 - (now), Ph.D. Student in Department of Electrical Electronic Engineering, The University of HongKong.
2019.09 - 2023.06, B.Eng. in Computer Science, School of Computer Science and Engineering, Beihang University.

🗒️ Academic Services

Conference Reviewer: ICLR, Neurips, ICML, ECCV, AISTATS, ICCV
Journal Reviewer: Neural Networks.
Program Committee member for Practical Deep Learning Workshop, IEEE CAI 2024.

🎖 Honors and Awards

-2023 Outstanding Graduate, Beihang University.

-2023 Outstanding Project of the 16th National College Student Innovation and Entrepreneurship Competition, China.

-2022 Outstanding Project of the 15th National College Student Innovation and Entrepreneurship Competition, China.

💻 Internships & Teaching Services

2025.06 - Now, Multimodal Large Language Model Intern, NVIDIA.
2022.09 - 2023.01, AI algorithm internship on model inference acceleration, Enflame, China.
2022.08 - 2023.01, TA for Frontiers in Artificial Intelligence, Beihang University.
2022.08 - 2023.01, TA for Computer Hardware Basics, the head of TA team, Beihang University.
2021.08 - 2022.01, TA for Computer Hardware Basics, the head of TA team, Beihang University.
2021.03 - 2021.06, TA for Discrete Mathematics, the head of TA team, Beihang University.