Highly Efficient and Effective LLMs with Multi-Boolean Architectures¶
Conference: ICLR 2026
arXiv: 2505.22811
Code: None
Area: Model Compression
Keywords: Weight Binarization, Boolean Parameters, Ultra-Low-Bit Quantization, Large Language Models, Direct Finetuning
TL;DR¶
This paper proposes a novel framework that represents LLM weights as multi-kernel Boolean parameters, enabling, for the first time, direct finetuning of large language models entirely within the Boolean domain—without requiring full-precision latent weights. The approach simultaneously surpasses existing ultra-low-bit quantization and binarization methods in both representational capacity and computational efficiency.
Background & Motivation¶
Weight binarization is a powerful strategy for reducing the complexity of large language models, compressing weights from 32-bit floating point to 1 bit. This theoretically yields a 32× compression ratio and significant inference acceleration (replacing multiplications with additions and subtractions).
The fundamental dilemma of existing binarization methods:
Post-training binarization: - Straightforward in approach—directly binarizes trained weights - Incurs severe performance degradation—the information loss from 1-bit quantization is substantial, causing sharp drops in model quality - For large language models, such degradation is often unacceptable
Training-aware binarization: - Performs binarization during training/finetuning, adjusting binary weights via gradient signals - Requires maintaining full-precision latent weights to accumulate gradients - Forward passes use binary weights; backward passes update full-precision weights - Problem: latent weights introduce additional complexity and memory overhead, severely limiting efficiency gains - The representational capacity of binary weights is inherently limited (each weight has only two states: +1/−1)
Key Challenge: Post-training methods are too coarse; training-aware methods are too cumbersome. Can one finetune directly in the Boolean domain without full-precision latent weights?
Method¶
Overall Architecture¶
The paper proposes representing the weight matrices of an LLM as a weighted combination of multiple Boolean matrices—the multi-kernel Boolean architecture. Each weight element is no longer a single value from \(\{+1, -1\}\), but a linear combination of multiple Boolean kernels, substantially enhancing representational capacity while preserving the Boolean nature of computation.
Key Designs¶
-
Multi-Kernel Boolean Parameter Representation:
- Standard binarization: \(W \approx \alpha \cdot B\), where \(B \in \{-1, +1\}^{m \times n}\) and \(\alpha\) is a scaling factor
- Multi-kernel Boolean: \(W \approx \sum_{k=1}^{K} \alpha_k \cdot B_k\), using a weighted sum of \(K\) Boolean matrices \(B_k\)
- Each \(B_k\) is an independent Boolean matrix; \(\alpha_k\) is the corresponding scaling factor
- The combination of \(K\) kernels can represent \(2^K\) distinct weight levels (vs. 2 levels in single-kernel binarization)
- \(K=2\) is equivalent to approximately 2-bit quantization; \(K=3\) to approximately 3-bit
- Key advantage: matrix multiplication \(Wx\) decomposes into \(K\) Boolean matrix–vector products, efficiently implementable via XNOR+popcount
-
Direct Finetuning in the Boolean Domain: This is the paper's most central contribution.
- Challenge: Boolean variables \(\{-1, +1\}\) are discrete and cannot be directly optimized via continuous gradient descent
- Conventional approach: Maintain full-precision latent weights \(W_{\text{latent}} \in \mathbb{R}\), derive Boolean weights via \(\text{sign}(W_{\text{latent}})\), and update \(W_{\text{latent}}\) through gradients (straight-through estimator, STE)
- Proposed approach: Completely eliminates latent weights and updates directly in the Boolean domain
- Core Idea: Boolean flip operations are modeled as probabilistic events; the decision to flip each Boolean element is based on the expected improvement in the loss function with respect to that flip
- This avoids the gradient bias introduced by STE and eliminates the need to store full-precision weights
-
Efficient Optimization of Scaling Factors:
- Once the Boolean matrices \(B_k\) are fixed, the scaling factors \(\alpha_k\) can be solved rapidly via a closed-form least-squares solution
- Alternating optimization: update \(B\) with \(\alpha\) fixed, then update \(\alpha\) with \(B\) fixed
- Convergence is fast, typically requiring only a few iterations
-
Group Quantization Strategy:
- Weight matrices are partitioned into groups by column or by block, with each group assigned independent scaling factors
- Improves granularity, achieving significant accuracy gains at the cost of a small number of additional parameters (scaling factors)
- Group size serves as a trade-off parameter between accuracy and compression ratio
Loss & Training¶
Finetuning employs the standard language modeling cross-entropy loss:
Training procedure: 1. Initialization: starting from pretrained weights, determine the initial multi-kernel Boolean matrices via SVD or greedy search 2. Finetuning: alternately optimize Boolean matrices and scaling factors on a small-scale dataset 3. Large-scale training data is not required; typically a few thousand samples suffice for effective finetuning 4. Memory consumption during finetuning is far lower than in conventional training-aware methods (no full-precision latent weights required)
Key Experimental Results¶
Main Results¶
Evaluated on multiple LLM architectures (including the LLaMA family), measuring perplexity and downstream task performance:
| Method | Bit-Width | LLaMA-7B PPL ↓ | LLaMA-13B PPL ↓ | Compression Ratio |
|---|---|---|---|---|
| Full Precision | 16 bit | Baseline | Baseline | 1× |
| GPTQ | 3 bit | Moderate | Moderate | ~5× |
| RTN | 2 bit | High | High | ~8× |
| BiLLM | 1 bit | Very High | Very High | ~16× |
| OneBit | 1 bit | High | High | ~16× |
| Multi-Kernel Boolean (K=2) | ~1.5 bit | Low | Low | ~10× |
| Multi-Kernel Boolean (K=3) | ~2 bit | Lowest | Lowest | ~8× |
In the ultra-low-bit (1–2 bit) regime, the multi-kernel Boolean method significantly outperforms all existing binarization and quantization techniques.
Ablation Study¶
| Configuration | Effect |
|---|---|
| K=1 (standard binarization) | Worst performance, highest compression |
| K=2 | Large performance improvement, approaching 2-bit quantization |
| K=3 | Further improvement, competitive with 3-bit GPTQ |
| With vs. without latent weights | Direct Boolean-domain finetuning performs no worse than latent-weight-based methods |
| Group size 128 vs. 256 vs. full-layer | Smaller groups yield higher accuracy; 128 is the common choice |
| Different initialization strategies | SVD initialization outperforms random initialization |
Key Findings¶
- Multi-kernel Boolean substantially improves representational capacity: The perplexity reduction from \(K=1\) to \(K=2\) far exceeds the gain from moving from 2-bit to 3-bit quantization
- Eliminating latent weights is viable: Direct finetuning in the Boolean domain not only avoids performance loss but also simplifies the training pipeline and reduces memory consumption
- Advantage is most pronounced at ultra-low bit-widths: In the 1–2 bit range, multi-kernel Boolean methods hold the largest margin over conventional quantization
- Cross-architecture generalization: The method performs consistently across models of different scales, including LLaMA-7B and LLaMA-13B
- Training efficiency is notable: Eliminating full-precision latent weights reduces finetuning memory consumption by approximately 50%
Highlights & Insights¶
- Breaking the representational bottleneck of binarization: By combining multiple kernels, Boolean parameters are extended from 2 discrete values to \(2^K\) levels—a simple yet effective idea
- First direct finetuning in the Boolean domain: Eliminating reliance on full-precision latent weights represents an important theoretical and practical breakthrough, as it enables the entire training and inference pipeline to operate at extreme low precision
- Hardware-friendly: Multi-kernel Boolean multiplication reduces to \(K\) XNOR+popcount operations, enabling extremely high throughput on dedicated hardware
- Theoretical elegance: Multi-kernel binarization can be viewed as a structured form of low-bit quantization in which each quantization level is determined by a Boolean combination, offering an alternative theoretical perspective
- Strong practicality: Requires minimal finetuning data, low memory footprint, and straightforward deployment
Limitations & Future Work¶
- Inference speedup depends on specialized hardware: Although XNOR+popcount is theoretically fast, existing GPU hardware support for Boolean operations is limited, and practical speedups may fall short of theoretical projections
- Diminishing returns for larger K: Improvements beyond \(K=4\) may be marginal, while additional kernels increase parallelism demands
- Validated only on language models: Applicability to vision models, multimodal models, and other domains requires further experimentation
- Impact of finetuning data selection: The paper does not thoroughly analyze how different finetuning datasets affect final performance
- Integration with knowledge distillation: Guiding a Boolean student model with a full-precision teacher may further improve performance
- Boolean activations: The current work binarizes weights only; activations remain full-precision. Full Booleanization (weights + activations) is a more aggressive direction for future exploration
Related Work & Insights¶
- BiLLM / OneBit: Existing LLM binarization methods using single-kernel Boolean representations, which suffer from severe performance degradation
- GPTQ / AWQ: Post-training quantization methods supporting 3–4 bits, with poor support for binarization
- BinaryBERT / BiBERT: Early work on binarizing BERT, but at a scale far smaller than modern LLMs
- QLoRA: Quantization combined with low-rank adaptation, but quantization precision is generally no lower than 4 bits
The core insight of this paper is to treat quantization as a representation problem in Boolean space, rather than a simple precision truncation. Multi-kernel Boolean parameters are fundamentally a structured approach to maximizing representational capacity under an extreme low-bit budget. Future directions include adaptive kernel allocation (different \(K\) for different layers), integration with sparsification, and empirical measurement of actual acceleration on inference hardware.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐
- Experimental Thoroughness: ⭐⭐⭐⭐
- Writing Quality: ⭐⭐⭐⭐
- Value: ⭐⭐⭐⭐