Hi, this is Wei Huang(黄炜)’s website!

I am currently a Ph.D in HKU, supervised by Prof.Xiaojuan Qi, Prof.Shiming Zhang

I obtained my bachelor’s degree in Jun 2023, supervised by Prof.Si Liu.

Now, I am fortunate to intern at NVIDIA Research, working with Dr.Yukang Chen and supervised by Prof.Song Han. I am also guided by Dr.Hongxu Yin and Dr.Sifei Liu

I focus on efficient & tiny deep learning for lightweight, long-sequence, and fast AI.

This direction covers, but is not limited to, the following topics:

🚀 Efficient Compression: The compression of LLMs, VLM, and Diffusion Model (ultra low-bit quantization, pruning, and sparsity).

🧠 Efficient Reasoning: Reinforcement Learning for long-sequence & low-cost LLMs and VLMs’ reasoning.

🎬 Efficient Generation: Real-time and interactive long-video generation.

Wearable AI: Edged AI for wearable context, and for sensitive organic electrochemical transistor (OECT).

🔥 Brain-mimic Computing: Neuromorphic computing and hardware acceleration (e.g. spiking neural network-SNN).

🔥 News

  • 2025.09:  🎉🎉 Two papers are accepted by Neurips’25! One for scaling long-video reasoning (Long-RL: Scaling RL to Long Videos) and one for unified reasoning model (Mindomni: Unleashing reasoning generation in vision language models with rgpo). All the codes are opensourced now!
  • 2025.05:  🎉🎉 One paper for structural mixed-precision low-bit quantization for LLMs (SliM-LLM) is accepted by ICML’25! All the codes are opensourced now!
  • 2025.02:  🎉🎉 One paper for efficient fine-grained chain-of-thought video understanding framework (VideoEspresso) is accepted by CVPR’25, Oral Paper 0.73%! All the codes are opensourced now!
  • 2025.01:  🎉🎉 Three papers are accepted by ICLR’25! One for MoE-LLM compression (MC-MoE: MoE-LLM compression) and two papers (InfoMax: data pruning; From-Layers-to-States: dynamic neural network layer) for data efficiency and dynamic neural networks. All the codes are opensourced now!
  • 2024.12:  🎉🎉 One Technical Report is accepted by Visual Intelligence
  • 2024.05:  🎉🎉 One paper for snn security on rram is accepted by ICCAD’24! All the codes are opensourced now!
  • 2024.04:  🎉🎉 One paper for post-training binary quantization of LLMs is accepted by ICML’24! All the codes are opensourced now!

💬 Invited Talks and Report

  • 2025.07: Our Scaling RL to Long Videos was reported by 机器之心. Please see the link.
  • 2025.06: AI-Time online talk on VideoEspresso. Please see the video.
  • 2024.05: BiLLM was reported by IEEE Spectrum. Thanks to Matthew for the interview and report. Please see the link.
  • 2024.05: AI-Time online talk on BiLLM. Please see the video.
  • 2024.04: Our emperical study How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study (new version: An Empirical Study of LLaMA3 Quantization: From LLMs to MLLMs) was reported by QbitAI (量子位). Please see the link.
  • 2024.03: Our BiLLM: Pushing the Limit of Post-Training Quantization for LLMs was reported by QbitAI (量子位). Please see the link.

📝 Publications

Neurips 2025
sym

Scaling RL to Long Videos sym

Yukang Chen*, Wei Huang*, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, Song Han

  • MR-SP infrastructure with sequence parallelism and vLLM-based cached-embedding rollouts, enabling up to 8,192 frames, hour-long RL on 8×A100, 2.1× speedup, and strong results (VideoMME 65.1% no subs, 71.1% with subs) via LongVILA-R1-7B.
  • Two-stage pipeline combining CoT-SFT and RL to scale reasoning for long-horizon video understanding.
  • LongVideo-Reason (104K long-video QA pairs) with high-quality chain-of-thought annotations across diverse domains.
[paper] [code] [abstract]
ICML 2025
sym

SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models sym

Wei Huang, Haotong Qin, Yangdong Liu, Yawei Li, Qinshuo Liu, Xianglong Liu, Luca Benini, Michele Magno, Shiming Zhang, Xiaojuan Qi

  • A novel scheme that observes and proves the structure-clustering of salient elements in LLMs weight matrix.
  • The first group-wise mixed-precision quantization framework for LLMs.
  • Serve as a plug-and-play approach to GPTQ/Omniquant/…, improving the inference-friendly method under low-bit quantization.
[paper] [code] [abstract]
CVPR 2025 Oral
sym

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection sym

Songhao Han, Wei Huang, Hairong Shi, Le Zhuo, Xiu Su, Shifeng Zhang, Xu Zhou, Xiaojuan Qi, Yue Liao, Si Liu

  • A novel dataset designed to enhance video reasoning by addressing the limitations of existing datasets in terms of scale and granularity.
  • We proposed a Hybrid LVLMs Collaboration framework achieving cost-effective and accurate video reasoning, outperforming baseline models on the majority of tasks across our proposed benchmark.
  • VideoEspresso sets a new starting point in video reasoning, offering rich annotations that facilitate advanced multimodal understanding.
ICLR 2025
sym

MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More sym

Wei Huang, Yue Liao, Jianhui Liu, Ruifei He, Haoru Tan, Shiming Zhang, Hongsheng Li, Si Liu, Xiaojuan Qi

  • MC-MoE for accurate weight-only quantization (Weight=1.5~2.5bit).
  • MC-MoE for efficient online dynamic pruning (additional compression ratio > 10%)
  • MC-MoE integrates static quantization and dynamic pruning to collaboratively achieve extreme compression for MoE-LLMs with less accuracy loss, ensuring an optimal trade-off between performance and efficiency.
  • For instance, at 2.54 bits, MC-MoE compresses 76.6% of the model, with only a 3.8% average accuracy loss. During dynamic inference, we further reduce activated parameters by 15%, with a performance drop of less than 0.6%.
[paper] [code] [abstract]
Visual Intelligence
sym

An Empirical Study of LLaMA3 Quantization: From LLMs to MLLMs sym

Wei Huang, Xingyu Zheng, Xudong Ma, Haotong Qin, Chengtao Lv, Hong Chen, Jie Luo, Xiaojuan Qi, Xianglong Liu, Michele Magno

  • Explore the performance of LLaMA3 series models under existing post-training quantization and LoRA-finetuning methods.
  • Point out the significant performance loss of MLLMs based on LLaMA3 under low-bit post-training quantization.
  • Highlights the significant performance gap under low bit-width that needs to be bridged in future developments.
[paper] [code] [abstract]
ICML 2024
sym

BiLLM: Pushing the Limit of Post-Training Quantization for LLMs sym

Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, Xiaojuan Qi

  • Compress LLM weights to as low as 1.08-1.1 bit and exceeds the performance of previous quantization methods at 2-bit or even 3-bit.
  • Implements high-performance binary LLM in PTQ mode, efficiently achieving 1bit LLM compression without additional training and backpropagation.
[paper] [code] [abstract]
Arxiv
sym

On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks

Wei Huang, Haotong Qin, Yangdong Liu, Jingzhuo Liang, Yulun Zhang, Ying Li, Xianglong Liu

  • Combine IP-core-level chip runtime clock and power awareness with network sensitivity, achieving a better balance of computational efficiency and accuracy on edge devices.
  • Allow target networks to be compressed and deployed with high accuracy on edge chips with limited computational resources and ultra-low power consumption.
  • Efficiently perform online quantization and optimization without additional devices or data access.
[paper] [abstract]

📖 Educations

  • 2023.09 - (now), Ph.D. Student in Department of Electrical Electronic Engineering, The University of HongKong.
  • 2019.09 - 2023.06, B.Eng. in Computer Science, School of Computer Science and Engineering, Beihang University.

🗒️ Academic Services

  • Conference Reviewer: ICLR, Neurips, ICML, ECCV, AISTATS, ICCV
  • Journal Reviewer: Neural Networks.
  • Program Committee member for Practical Deep Learning Workshop, IEEE CAI 2024.

🎖 Honors and Awards

-2023 Outstanding Graduate, Beihang University.

-2023 Outstanding Project of the 16th National College Student Innovation and Entrepreneurship Competition, China.

-2022 Outstanding Project of the 15th National College Student Innovation and Entrepreneurship Competition, China.

💻 Internships & Teaching Services

  • 2025.06 - Now, Multimodal Large Language Model Intern, NVIDIA.
  • 2022.09 - 2023.01, AI algorithm internship on model inference acceleration, Enflame, China.
  • 2022.08 - 2023.01, TA for Frontiers in Artificial Intelligence, Beihang University.
  • 2022.08 - 2023.01, TA for Computer Hardware Basics, the head of TA team, Beihang University.
  • 2021.08 - 2022.01, TA for Computer Hardware Basics, the head of TA team, Beihang University.
  • 2021.03 - 2021.06, TA for Discrete Mathematics, the head of TA team, Beihang University.