LLaVA-FA: Learning Fourier Approximation for Compressing Large Multimodal Models¶

Conference: ICLR 2026 arXiv: 2602.00135 Code: None Area: Multimodal Large Language Models Keywords: Model Compression, Fourier Transform, Low-Rank Decomposition, Quantization, Multimodal Language Models

TL;DR¶

This paper proposes LLaVA-FA, an efficient compression method for large multimodal models (LMMs) that performs joint low-rank and quantization weight approximation in the frequency domain. By exploiting the decorrelation property and conjugate symmetry of the Fourier transform, the method achieves more compact and accurate weight representations. It further introduces PolarQuant (polar coordinate quantization) and ODC (Optional Diagonal Calibration), surpassing existing efficient multimodal models on multiple benchmarks with minimal active parameters and computational cost.

Background & Motivation¶

Large multimodal models (LMMs) demonstrate strong performance on vision-language tasks, but their substantial computational and memory requirements hinder practical deployment. For instance, training LLaVA-70B requires over 800 GPU hours on A100s.

Limitations of existing compression methods:

Decoupled low-rank decomposition and quantization: Existing methods (e.g., LoRD, ASVD, LQER) treat low-rank decomposition and quantization independently. The low-rank selection stage is oblivious to subsequent quantization noise, causing reconstruction errors to compound.

Underutilization of multimodal redundancy: Unlike pure-text LLMs, large vision-language models additionally carry cross-modal adapters from image encoders. Adapter rank inflation across visual domains makes the same low-rank-plus-quantization scheme still "bloated" for multimodal models.

Dependence on calibration data: Many compression methods require large-scale calibration datasets to estimate Hessian matrices.

Core Problem: How to aggressively compress learnable parameters while preserving the performance of multimodal models?

Key Observation: The Fourier transform possesses strong expressive power for data compression — extremely sparse spectral information can reconstruct high-fidelity signals. Crucially, even for weight matrices lacking spatial semantics, the Fourier transform effectively handles approximation problems. The authors find that: - LMM weight matrices exhibit more compact singular value distributions in the frequency domain - At the same rank, the cumulative approximation error of frequency-domain low-rank approximation is smaller than in the spatial domain - The conjugate symmetry of the Fourier transform saves nearly half of the learnable parameters

Method¶

Overall Architecture¶

The core idea of LLaVA-FA is to shift weight matrix compression from the spatial domain to the frequency domain: 1. Apply Discrete Fourier Transform (DFT) to weight matrices 2. Perform low-rank decomposition in the frequency domain 3. Apply PolarQuant to the resulting complex-valued matrices 4. Optionally apply ODC to eliminate the need for large-scale calibration data

Key Designs¶

Frequency-domain low-rank decomposition: Exploiting the decorrelation capability of DFT
- Transform weight matrix \(W\) to \(\hat{W} = \text{DFT}(W)\)
- Apply truncated SVD to \(\hat{W}\) in the frequency domain to obtain the low-rank approximation \(\hat{W}_r = U_r \Sigma_r V_r^H\)
- Theoretical guarantee: The Frobenius error of same-rank low-rank approximation in the frequency domain is smaller than in the spatial domain (formally proved in the paper)
- Conjugate symmetry: The DFT of a real-valued matrix satisfies conjugate symmetry, requiring only half the coefficients to be stored, further compressing parameter count
- Energy compaction: DFT concentrates information into a small number of frequency components, making truncation more effective
PolarQuant (Polar Coordinate Quantization): A quantization scheme designed for complex-valued matrices
- Frequency-domain low-rank decomposition yields complex-valued matrices, making conventional real-valued quantization schemes inapplicable
- PolarQuant decomposes each complex number \(z = r e^{i\theta}\) into magnitude \(r\) and phase \(\theta\)
- Uniform quantization is applied to magnitude and phase separately, each with an independent scaling factor
- Compared to directly quantizing real and imaginary parts, the polar representation preserves complex number structure and avoids severe phase information loss at low bit-widths
- Supports ultra-low bit-width quantization down to 2–4 bits
ODC (Optional Diagonal Calibration): A calibration scheme requiring no large-scale calibration data
- Traditional methods require full Hessian matrices to calibrate compression errors, incurring high computational cost and demanding large representative datasets
- ODC leverages the empirical observation that Hessian matrices of deep networks are often diagonally dominant or low-rank structured
- Row/column means are used to approximate the full Hessian, substantially reducing computational complexity
- The compression pipeline becomes fully calibration-data-free, or requires only a minimal amount of data

Loss & Training¶

LLaVA-FA is a post-training compression method that requires no additional fine-tuning or training: - Compression is applied directly to pre-trained weights via DFT → truncated low-rank decomposition → PolarQuant - ODC calibration is also a one-time forward computation - The compressed model is directly deployable for inference without retraining

This makes the compression pipeline extremely lightweight and practical.

Key Experimental Results¶

Main Results¶

The method is evaluated on multiple vision-language benchmarks covering both understanding and hallucination tasks:

Method	Active Params	VQAv2	GQA	SQA	POPE	Avg
LLaVA-1.5 (Original)	7B	Baseline	Baseline	Baseline	Baseline	Baseline
ASVD + Q	~2B	Lower	Lower	Lower	Lower	Lower
LQER + Q	~2B	Medium	Medium	Medium	Medium	Medium
LLaVA-FA	Fewest	Highest	Highest	Highest	Highest	Highest

LLaVA-FA surpasses all existing efficient multimodal models on all benchmarks while maintaining the fewest active parameters and lowest computational cost.

Ablation Study¶

Configuration	Result	Note
Spatial-domain vs. frequency-domain low-rank	Frequency domain superior	Smaller reconstruction error at same rank
Real/imaginary quantization vs. PolarQuant	PolarQuant superior	Preserves complex structure; more complete phase information
Without ODC vs. with ODC	ODC provides clear gains	Particularly significant at low bit-widths
Low-rank only vs. low-rank + quantization	Joint approach optimal	Joint optimization in frequency domain avoids error compounding
Different bit-widths	4-bit offers best trade-off	2-bit incurs notable degradation; 4-bit approaches full precision

Key Findings¶

Frequency-domain low-rank approximation genuinely outperforms the spatial domain: This is not merely an empirical observation — the paper provides a theoretical proof that the decorrelation property of DFT accelerates singular value decay.
Conjugate symmetry yields "free" 2× compression: Parameter count is halved with no precision loss whatsoever.
PolarQuant is critical at low bit-widths: At 2–4 bits, directly quantizing real and imaginary parts severely distorts phase information.
ODC eliminates the calibration data bottleneck: The diagonal approximation is sufficiently accurate for deep network Hessians, making compression truly plug-and-play.

Highlights & Insights¶

A pioneering frequency-domain compression perspective: Shifting neural network weight compression from the spatial domain to the frequency domain is a novel direction, fully exploiting three key properties of the Fourier transform: decorrelation, conjugate symmetry, and energy compaction.
Integration of theory and practice: The method is supported not only by empirical results but also by rigorous mathematical proofs that frequency-domain low-rank approximation outperforms its spatial-domain counterpart.
End-to-end consistent design philosophy: From low-rank decomposition to quantization to calibration, every step is handled consistently in the frequency/complex domain, avoiding information loss from cross-domain conversion.
Extremely low deployment barrier: Post-training compression, no calibration data, and no fine-tuning make the method highly practical.
Targeted design for multimodal models: The paper explicitly identifies that multimodal models face greater compression challenges than pure-text LLMs due to cross-modal adapter redundancy.

Limitations & Future Work¶

Frequency-to-spatial domain conversion at inference: Although storage is compressed, the inference pipeline may incur additional DFT/IDFT computational overhead.
Primarily validated on the LLaVA series: Applicability to other architectures (e.g., Qwen-VL, InternVL) requires further investigation.
Ultra-low bit-width (1–2 bit) performance: Performance degradation remains noticeable at extremely low bit-widths; combining with knowledge distillation or similar techniques may be necessary.
Hardware support: Complex arithmetic and polar coordinate quantization may lack native acceleration support in existing inference engines.
Adaptation to dynamic scenarios: Fixed truncation rank may not be optimal across all layers; adaptive rank selection strategies are worth exploring.

LoRA / QLoRA: Pioneering work on low-rank adaptation, but operating in the spatial domain.
ASVD / FWSVD: SVD-based weight compression that handles low-rank decomposition and quantization separately.
LQER: Joint low-rank-plus-quantization method, but in the spatial domain.
GPTQ / AWQ: Quantization-only schemes without low-rank decomposition.

The core insight of this paper is that mature frequency-domain analysis tools from signal processing can be introduced into deep learning model compression, opening an entirely new research direction. Future work may extend this to other transform domains (e.g., wavelet transforms) or other model architectures.

Rating¶

Novelty: ⭐⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐
Value: ⭐⭐⭐⭐