Skip to content

Highly Efficient and Effective LLMs with Multi-Boolean Architectures

Conference: ICLR 2026
arXiv: 2505.22811
Code: None
Area: Model Compression
Keywords: Weight Binarization, Boolean Parameters, Ultra-Low-Bit Quantization, Large Language Models, Direct Finetuning

TL;DR

This paper proposes a novel framework that represents LLM weights as multi-kernel Boolean parameters, enabling, for the first time, direct finetuning of large language models entirely within the Boolean domain—without requiring full-precision latent weights. The approach simultaneously surpasses existing ultra-low-bit quantization and binarization methods in both representational capacity and computational efficiency.

Background & Motivation

Weight binarization is a powerful strategy for reducing the complexity of large language models, compressing weights from 32-bit floating point to 1 bit. This theoretically yields a 32× compression ratio and significant inference acceleration (replacing multiplications with additions and subtractions).

The fundamental dilemma of existing binarization methods:

Post-training binarization: - Straightforward in approach—directly binarizes trained weights - Incurs severe performance degradation—the information loss from 1-bit quantization is substantial, causing sharp drops in model quality - For large language models, such degradation is often unacceptable

Training-aware binarization: - Performs binarization during training/finetuning, adjusting binary weights via gradient signals - Requires maintaining full-precision latent weights to accumulate gradients - Forward passes use binary weights; backward passes update full-precision weights - Problem: latent weights introduce additional complexity and memory overhead, severely limiting efficiency gains - The representational capacity of binary weights is inherently limited (each weight has only two states: +1/−1)

Key Challenge: Post-training methods are too coarse; training-aware methods are too cumbersome. Can one finetune directly in the Boolean domain without full-precision latent weights?

Method

Overall Architecture

The paper proposes representing the weight matrices of an LLM as a weighted combination of multiple Boolean matrices—the multi-kernel Boolean architecture. Each weight element is no longer a single value from \(\{+1, -1\}\), but a linear combination of multiple Boolean kernels, substantially enhancing representational capacity while preserving the Boolean nature of computation.

Key Designs

  1. Multi-Kernel Boolean Parameter Representation:

    • Standard binarization: \(W \approx \alpha \cdot B\), where \(B \in \{-1, +1\}^{m \times n}\) and \(\alpha\) is a scaling factor
    • Multi-kernel Boolean: \(W \approx \sum_{k=1}^{K} \alpha_k \cdot B_k\), using a weighted sum of \(K\) Boolean matrices \(B_k\)
    • Each \(B_k\) is an independent Boolean matrix; \(\alpha_k\) is the corresponding scaling factor
    • The combination of \(K\) kernels can represent \(2^K\) distinct weight levels (vs. 2 levels in single-kernel binarization)
    • \(K=2\) is equivalent to approximately 2-bit quantization; \(K=3\) to approximately 3-bit
    • Key advantage: matrix multiplication \(Wx\) decomposes into \(K\) Boolean matrix–vector products, efficiently implementable via XNOR+popcount
  2. Direct Finetuning in the Boolean Domain: This is the paper's most central contribution.

    • Challenge: Boolean variables \(\{-1, +1\}\) are discrete and cannot be directly optimized via continuous gradient descent
    • Conventional approach: Maintain full-precision latent weights \(W_{\text{latent}} \in \mathbb{R}\), derive Boolean weights via \(\text{sign}(W_{\text{latent}})\), and update \(W_{\text{latent}}\) through gradients (straight-through estimator, STE)
    • Proposed approach: Completely eliminates latent weights and updates directly in the Boolean domain
    • Core Idea: Boolean flip operations are modeled as probabilistic events; the decision to flip each Boolean element is based on the expected improvement in the loss function with respect to that flip
    • This avoids the gradient bias introduced by STE and eliminates the need to store full-precision weights
  3. Efficient Optimization of Scaling Factors:

    • Once the Boolean matrices \(B_k\) are fixed, the scaling factors \(\alpha_k\) can be solved rapidly via a closed-form least-squares solution
    • Alternating optimization: update \(B\) with \(\alpha\) fixed, then update \(\alpha\) with \(B\) fixed
    • Convergence is fast, typically requiring only a few iterations
  4. Group Quantization Strategy:

    • Weight matrices are partitioned into groups by column or by block, with each group assigned independent scaling factors
    • Improves granularity, achieving significant accuracy gains at the cost of a small number of additional parameters (scaling factors)
    • Group size serves as a trade-off parameter between accuracy and compression ratio

Loss & Training

Finetuning employs the standard language modeling cross-entropy loss:

\[\mathcal{L} = -\sum_{t} \log P(x_t | x_{<t}; \{B_k, \alpha_k\})\]

Training procedure: 1. Initialization: starting from pretrained weights, determine the initial multi-kernel Boolean matrices via SVD or greedy search 2. Finetuning: alternately optimize Boolean matrices and scaling factors on a small-scale dataset 3. Large-scale training data is not required; typically a few thousand samples suffice for effective finetuning 4. Memory consumption during finetuning is far lower than in conventional training-aware methods (no full-precision latent weights required)

Key Experimental Results

Main Results

Evaluated on multiple LLM architectures (including the LLaMA family), measuring perplexity and downstream task performance:

Method Bit-Width LLaMA-7B PPL ↓ LLaMA-13B PPL ↓ Compression Ratio
Full Precision 16 bit Baseline Baseline
GPTQ 3 bit Moderate Moderate ~5×
RTN 2 bit High High ~8×
BiLLM 1 bit Very High Very High ~16×
OneBit 1 bit High High ~16×
Multi-Kernel Boolean (K=2) ~1.5 bit Low Low ~10×
Multi-Kernel Boolean (K=3) ~2 bit Lowest Lowest ~8×

In the ultra-low-bit (1–2 bit) regime, the multi-kernel Boolean method significantly outperforms all existing binarization and quantization techniques.

Ablation Study

Configuration Effect
K=1 (standard binarization) Worst performance, highest compression
K=2 Large performance improvement, approaching 2-bit quantization
K=3 Further improvement, competitive with 3-bit GPTQ
With vs. without latent weights Direct Boolean-domain finetuning performs no worse than latent-weight-based methods
Group size 128 vs. 256 vs. full-layer Smaller groups yield higher accuracy; 128 is the common choice
Different initialization strategies SVD initialization outperforms random initialization

Key Findings

  1. Multi-kernel Boolean substantially improves representational capacity: The perplexity reduction from \(K=1\) to \(K=2\) far exceeds the gain from moving from 2-bit to 3-bit quantization
  2. Eliminating latent weights is viable: Direct finetuning in the Boolean domain not only avoids performance loss but also simplifies the training pipeline and reduces memory consumption
  3. Advantage is most pronounced at ultra-low bit-widths: In the 1–2 bit range, multi-kernel Boolean methods hold the largest margin over conventional quantization
  4. Cross-architecture generalization: The method performs consistently across models of different scales, including LLaMA-7B and LLaMA-13B
  5. Training efficiency is notable: Eliminating full-precision latent weights reduces finetuning memory consumption by approximately 50%

Highlights & Insights

  1. Breaking the representational bottleneck of binarization: By combining multiple kernels, Boolean parameters are extended from 2 discrete values to \(2^K\) levels—a simple yet effective idea
  2. First direct finetuning in the Boolean domain: Eliminating reliance on full-precision latent weights represents an important theoretical and practical breakthrough, as it enables the entire training and inference pipeline to operate at extreme low precision
  3. Hardware-friendly: Multi-kernel Boolean multiplication reduces to \(K\) XNOR+popcount operations, enabling extremely high throughput on dedicated hardware
  4. Theoretical elegance: Multi-kernel binarization can be viewed as a structured form of low-bit quantization in which each quantization level is determined by a Boolean combination, offering an alternative theoretical perspective
  5. Strong practicality: Requires minimal finetuning data, low memory footprint, and straightforward deployment

Limitations & Future Work

  1. Inference speedup depends on specialized hardware: Although XNOR+popcount is theoretically fast, existing GPU hardware support for Boolean operations is limited, and practical speedups may fall short of theoretical projections
  2. Diminishing returns for larger K: Improvements beyond \(K=4\) may be marginal, while additional kernels increase parallelism demands
  3. Validated only on language models: Applicability to vision models, multimodal models, and other domains requires further experimentation
  4. Impact of finetuning data selection: The paper does not thoroughly analyze how different finetuning datasets affect final performance
  5. Integration with knowledge distillation: Guiding a Boolean student model with a full-precision teacher may further improve performance
  6. Boolean activations: The current work binarizes weights only; activations remain full-precision. Full Booleanization (weights + activations) is a more aggressive direction for future exploration
  • BiLLM / OneBit: Existing LLM binarization methods using single-kernel Boolean representations, which suffer from severe performance degradation
  • GPTQ / AWQ: Post-training quantization methods supporting 3–4 bits, with poor support for binarization
  • BinaryBERT / BiBERT: Early work on binarizing BERT, but at a scale far smaller than modern LLMs
  • QLoRA: Quantization combined with low-rank adaptation, but quantization precision is generally no lower than 4 bits

The core insight of this paper is to treat quantization as a representation problem in Boolean space, rather than a simple precision truncation. Multi-kernel Boolean parameters are fundamentally a structured approach to maximizing representational capacity under an extreme low-bit budget. Future directions include adaptive kernel allocation (different \(K\) for different layers), integration with sparsification, and empirical measurement of actual acceleration on inference hardware.

Rating

  • Novelty: ⭐⭐⭐⭐⭐
  • Experimental Thoroughness: ⭐⭐⭐⭐
  • Writing Quality: ⭐⭐⭐⭐
  • Value: ⭐⭐⭐⭐