🔄 Self-Supervised Learning¶

💬 ACL2026 · 1 paper notes

LLMSurgeon: Diagnosing Data Mixture of Large Language Models: LLMSurgeon formalizes the question "what data was this LLM trained on" as Data Mixture Surgery. By using the soft confusion matrix of a proxy classifier to invert the domain distribution within generated text, it estimates pre-training data mixture proportions while only requiring access to model outputs.