Highly Efficient and Effective LLMs with Multi-Boolean Architectures¶

Conference: ICLR 2026
arXiv: 2505.22811
Code: None
Area: Model Compression
Keywords: Weight Binarization, Boolean Parameters, Ultra-Low-Bit Quantization, Large Language Models, Direct Finetuning

TL;DR¶

This paper proposes a novel framework that represents LLM weights as multi-kernel Boolean parameters, enabling, for the first time, direct finetuning of large language models entirely within the Boolean domain—without requiring full-precision latent weights. The approach simultaneously surpasses existing ultra-low-bit quantization and binarization methods in both representational capacity and computational efficiency.

Background & Motivation¶

Weight binarization is a powerful strategy for reducing the complexity of large language models, compressing weights from 32-bit floating point to 1 bit. This theoretically yields a 32× compression ratio and significant inference acceleration (replacing multiplications with additions and subtractions).

The fundamental dilemma of existing binarization methods:

Post-training binarization: - Straightforward in approach—directly binarizes trained weights - Incurs severe performance degradation—the information loss from 1-bit quantization is substantial, causing sharp drops in model quality - For large language models, such degradation is often unacceptable

Training-aware binarization: - Performs binarization during training/finetuning, adjusting binary weights via gradient signals - Requires maintaining full-precision latent weights to accumulate gradients - Forward passes use binary weights; backward passes update full-precision weights - Problem: latent weights introduce additional complexity and memory overhead, severely limiting efficiency gains - The representational capacity of binary weights is inherently limited (each weight has only two states: +1/−1)

Key Challenge: Post-training methods are too coarse; training-aware methods are too cumbersome. Can one finetune directly in the Boolean domain without full-precision latent weights?

Method¶

Overall Architecture¶

The paper proposes representing the weight matrices of an LLM as a weighted combination of multiple Boolean matrices—the multi-kernel Boolean architecture. Each weight element is no longer a single value from \(\{+1, -1\}\), but a linear combination of multiple Boolean kernels, substantially enhancing representational capacity while preserving the Boolean nature of computation.

Key Designs¶

Multi-Kernel Boolean Parameter Representation:
- Standard binarization: \(W \approx \alpha \cdot B\), where \(B \in \{-1, +1\}^{m \times n}\) and \(\alpha\) is a scaling factor
- Multi-kernel Boolean: \(W \approx \sum_{k=1}^{K} \alpha_k \cdot B_k\), using a weighted sum of \(K\) Boolean matrices \(B_k\)
- Each \(B_k\) is an independent Boolean matrix; \(\alpha_k\) is the corresponding scaling factor
- The combination of \(K\) kernels can represent \(2^K\) distinct weight levels (vs. 2 levels in single-kernel binarization)
- \(K=2\) is equivalent to approximately 2-bit quantization; \(K=3\) to approximately 3-bit
- Key advantage: matrix multiplication \(Wx\) decomposes into \(K\) Boolean matrix–vector products, efficiently implementable via XNOR+popcount
Direct Finetuning in the Boolean Domain: This is the paper's most central contribution.
- Challenge: Boolean variables \(\{-1, +1\}\) are discrete and cannot be directly optimized via continuous gradient descent
- Conventional approach: Maintain full-precision latent weights \(W_{\text{latent}} \in \mathbb{R}\), derive Boolean weights via \(\text{sign}(W_{\text{latent}})\), and update \(W_{\text{latent}}\) through gradients (straight-through estimator, STE)
- Proposed approach: Completely eliminates latent weights and updates directly in the Boolean domain
- Core Idea: Boolean flip operations are modeled as probabilistic events; the decision to flip each Boolean element is based on the expected improvement in the loss function with respect to that flip
- This avoids the gradient bias introduced by STE and eliminates the need to store full-precision weights
Efficient Optimization of Scaling Factors:
- Once the Boolean matrices \(B_k\) are fixed, the scaling factors \(\alpha_k\) can be solved rapidly via a closed-form least-squares solution
- Alternating optimization: update \(B\) with \(\alpha\) fixed, then update \(\alpha\) with \(B\) fixed
- Convergence is fast, typically requiring only a few iterations
Group Quantization Strategy:
- Weight matrices are partitioned into groups by column or by block, with each group assigned independent scaling factors
- Improves granularity, achieving significant accuracy gains at the cost of a small number of additional parameters (scaling factors)
- Group size serves as a trade-off parameter between accuracy and compression ratio

Loss & Training¶

Finetuning employs the standard language modeling cross-entropy loss:

\[\mathcal{L} = -\sum_{t} \log P(x_t | x_{<t}; \{B_k, \alpha_k\})\]

Training procedure: 1. Initialization: starting from pretrained weights, determine the initial multi-kernel Boolean matrices via SVD or greedy search 2. Finetuning: alternately optimize Boolean matrices and scaling factors on a small-scale dataset 3. Large-scale training data is not required; typically a few thousand samples suffice for effective finetuning 4. Memory consumption during finetuning is far lower than in conventional training-aware methods (no full-precision latent weights required)

Key Experimental Results¶

Main Results¶

Evaluated on multiple LLM architectures (including the LLaMA family), measuring perplexity and downstream task performance:

Method	Bit-Width	LLaMA-7B PPL ↓	LLaMA-13B PPL ↓	Compression Ratio
Full Precision	16 bit	Baseline	Baseline	1×
GPTQ	3 bit	Moderate	Moderate	~5×
RTN	2 bit	High	High	~8×
BiLLM	1 bit	Very High	Very High	~16×
OneBit	1 bit	High	High	~16×
Multi-Kernel Boolean (K=2)	~1.5 bit	Low	Low	~10×
Multi-Kernel Boolean (K=3)	~2 bit	Lowest	Lowest	~8×

In the ultra-low-bit (1–2 bit) regime, the multi-kernel Boolean method significantly outperforms all existing binarization and quantization techniques.

Ablation Study¶

Configuration	Effect
K=1 (standard binarization)	Worst performance, highest compression
K=2	Large performance improvement, approaching 2-bit quantization
K=3	Further improvement, competitive with 3-bit GPTQ
With vs. without latent weights	Direct Boolean-domain finetuning performs no worse than latent-weight-based methods
Group size 128 vs. 256 vs. full-layer	Smaller groups yield higher accuracy; 128 is the common choice
Different initialization strategies	SVD initialization outperforms random initialization

Key Findings¶

Multi-kernel Boolean substantially improves representational capacity: The perplexity reduction from \(K=1\) to \(K=2\) far exceeds the gain from moving from 2-bit to 3-bit quantization
Eliminating latent weights is viable: Direct finetuning in the Boolean domain not only avoids performance loss but also simplifies the training pipeline and reduces memory consumption
Advantage is most pronounced at ultra-low bit-widths: In the 1–2 bit range, multi-kernel Boolean methods hold the largest margin over conventional quantization
Cross-architecture generalization: The method performs consistently across models of different scales, including LLaMA-7B and LLaMA-13B
Training efficiency is notable: Eliminating full-precision latent weights reduces finetuning memory consumption by approximately 50%

Highlights & Insights¶

Breaking the representational bottleneck of binarization: By combining multiple kernels, Boolean parameters are extended from 2 discrete values to \(2^K\) levels—a simple yet effective idea
First direct finetuning in the Boolean domain: Eliminating reliance on full-precision latent weights represents an important theoretical and practical breakthrough, as it enables the entire training and inference pipeline to operate at extreme low precision
Hardware-friendly: Multi-kernel Boolean multiplication reduces to \(K\) XNOR+popcount operations, enabling extremely high throughput on dedicated hardware
Theoretical elegance: Multi-kernel binarization can be viewed as a structured form of low-bit quantization in which each quantization level is determined by a Boolean combination, offering an alternative theoretical perspective
Strong practicality: Requires minimal finetuning data, low memory footprint, and straightforward deployment

Limitations & Future Work¶

Inference speedup depends on specialized hardware: Although XNOR+popcount is theoretically fast, existing GPU hardware support for Boolean operations is limited, and practical speedups may fall short of theoretical projections
Diminishing returns for larger K: Improvements beyond \(K=4\) may be marginal, while additional kernels increase parallelism demands
Validated only on language models: Applicability to vision models, multimodal models, and other domains requires further experimentation
Impact of finetuning data selection: The paper does not thoroughly analyze how different finetuning datasets affect final performance
Integration with knowledge distillation: Guiding a Boolean student model with a full-precision teacher may further improve performance
Boolean activations: The current work binarizes weights only; activations remain full-precision. Full Booleanization (weights + activations) is a more aggressive direction for future exploration

BiLLM / OneBit: Existing LLM binarization methods using single-kernel Boolean representations, which suffer from severe performance degradation
GPTQ / AWQ: Post-training quantization methods supporting 3–4 bits, with poor support for binarization
BinaryBERT / BiBERT: Early work on binarizing BERT, but at a scale far smaller than modern LLMs
QLoRA: Quantization combined with low-rank adaptation, but quantization precision is generally no lower than 4 bits

The core insight of this paper is to treat quantization as a representation problem in Boolean space, rather than a simple precision truncation. Multi-kernel Boolean parameters are fundamentally a structured approach to maximizing representational capacity under an extreme low-bit budget. Future directions include adaptive kernel allocation (different \(K\) for different layers), integration with sparsification, and empirical measurement of actual acceleration on inference hardware.

Rating¶

Novelty: ⭐⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐
Value: ⭐⭐⭐⭐