What Makes an Ensemble (Un)interpretable?¶

Conference: ICML 2025
arXiv: 2506.08216
Code: None
Area: Interpretability
Keywords: ensemble learning, interpretability, model complexity, feature importance, transparency

TL;DR¶

This paper systematically investigates the interpretability of ensemble learning methods—identifying what factors make ensemble models difficult to interpret and how to improve ensemble interpretability while maintaining predictive performance. It proposes a theoretical framework to quantify ensemble interpretability and practical methods to construct interpretable ensembles.

Background & Motivation¶

Background¶

Background: Ensemble methods (boosting, bagging, random forest, etc.) are among the most successful ML methods, but are often viewed as "black boxes".

Limitations of Prior Work: There is a lack of a systematic theory regarding ensemble interpretability. What factors make an ensemble uninterpretable? Does the interpretability of base models imply the interpretability of the ensemble?

Key Challenge: Ensembles improve performance by combining simple models, but the combination process itself introduces complexity.

Goal: (1) Quantify ensemble interpretability; (2) Identify influencing factors; (3) Propose interpretable ensemble methods.

Key Insight: Define ensemble interpretability metrics based on the diversity, consistency, and redundancy of base models.

Core Idea: The uninterpretability of ensembles primarily stems from inconsistent feature usage among base models—when different base models rely on different features and lack structured patterns, the ensemble becomes uninterpretable.

Method¶

Overall Architecture¶

Define interpretability metrics -> Analyze influencing factors -> Propose methods for constructing interpretable ensembles.

Key Designs¶

Interpretability Metric: Based on the consistency of feature importances. If all base models use similar feature rankings, the ensemble is interpretable; if they differ, it is uninterpretable. This is formalized as a concentration measure of feature importance vectors.
Analysis of Influencing Factors: (a) Number of base models—more models lead to lower interpretability (more inconsistencies); (b) Data heterogeneity—training on different subsets leads to divergent feature selection; (c) boosting vs bagging—boosting is more interpretable (sequential corrections drive feature usage to converge).
Interpretable Ensemble Methods: (a) Feature alignment regularization—encouraging base models to use similar features; (b) Pruning redundant base models—removing base models whose feature usage is inconsistent with the majority; (c) Posterior feature importance aggregation—aggregating in a weighted manner to make global explanations consistent.

Key Experimental Results¶

Interpretability Analysis¶

Ensemble Type	Interpretability Score	Accuracy
Random Forest (100 trees)	Low (0.3)	High
Bagging (10 trees)	Med (0.5)	High
AdaBoost (50 iterations)	Med-High (0.6)	High
Interpretable RF (Ours)	High (0.8)	High (drop <1%)

Ablation¶

Factor	Impact on Interpretability
Number of base models from 10→100	Interpretability decreased by 40%
Feature alignment regularization	Interpretability improved by 30%, accuracy dropped <2%
Redundant pruning	Interpretability improved by 25%, 40% of base models removed

Key Findings¶

Boosting is inherently more interpretable than bagging—sequential learning causes feature usage to converge.
Ensembles with a small number of base models (10-20 trees) exhibit significantly better interpretability than large ones (100+).
Feature alignment regularization is the most effective method—greatly improving interpretability at a tiny cost to accuracy.

Highlights & Insights¶

"Inconsistent feature usage" is the root cause of uninterpretability—this insight is concise and instructive.
Boosting is more interpretable than bagging—counterintuitive but supported by theory (sequential correction forces feature alignment).
Practical methods (such as feature alignment regularization) are simple and easy to use, allowing direct integration into existing training pipelines.

Limitations & Future Work¶

The interpretability metric is based on feature importance and may not capture all explanatory dimensions (e.g., interaction effects).
The theoretical analysis is limited to linear or decision tree base models.
Analysis of deep ensembles (e.g., multi-model ensemble in NLP) is missing.
The relationship with post-hoc explanation methods like SHAP/LIME is not fully explored.
User studies (whether humans indeed perceive ensembles with "consistent feature usage" as more interpretable) are missing.

vs SHAP (Lundberg & Lee): Post-hoc explanation for single models; this work focuses on the intrinsic interpretability of the ensemble itself.
vs Interpretable ML (Rudin 2019): Advocates for inherently interpretable models; this work finds a balance within the ensemble framework.
vs Random Forest Feature Importance: Standard methods ignore inconsistency among base models; this work uses it as a core metric.

Rating¶

Novelty: ⭐⭐⭐⭐ The systematic study of ensemble interpretability is a new direction.
Experimental Thoroughness: ⭐⭐⭐⭐ Multi-scenario validation + ablation analysis.
Writing Quality: ⭐⭐⭐⭐ Clear.
Value: ⭐⭐⭐⭐ Substantial contribution to interpretable ML.
Overall: ⭐⭐⭐⭐ Interesting and practical work.

Supplementary Analysis¶

Mathematical Definition of Interpretability Metrics¶

An ensemble E contains M base models {h_1,...,h_M}, with feature importance vectors fi_m = importance(h_m). - Feature Consistency: Consistency = 1 - Var(fi_1,...,fi_M) / max_var - Effective Number of Base Models: Similar to Shannon entropy, measuring how many models "effectively participate" in the explanation. - Redundancy: Mutual information of feature importances among base models—high redundancy means the same information is repeatedly used. - Comprehensive Interpretability Score = Consistency * (1 - Redundancy)

Why Boosting is More Interpretable than Bagging¶

Bagging (e.g., RF): Each tree is trained independently on bootstrap samples → different samples emphasize different features → inconsistent feature usage.
Boosting (e.g., AdaBoost): Sequential learning, where each round focuses on misclassified samples from the previous round → all base models eventually converge to a similar feature focus.
Mathematical Description: In Boosting, the sample weight D_t at round t causes the features that maximize loss to be prioritized by all subsequent models.
Empirical Validation: In experiments, the feature consistency score of AdaBoost is systematically higher than that of RF of the same depth.

Detailed Design of Feature Alignment Regularization¶

When training the m-th base model, a regularization term is added: - \(R_{align}(h_m) = \lambda * ||fi_m - ar{fi}||^2\) - where \(ar{fi} = (1/(m-1)) \sum_{j<m} fi_j\) is the average feature importance of the previously trained models. - lambda controls the alignment strength: large lambda → more interpretable but potentially underfitting. - In practice, lambda = 0.01-0.1 yields the best results. - Mutual information constraints can also be used instead of L2 regularization, but the computation is more complex.