AutoMalDesc: Large-Scale Script Analysis for Cyber Threat Research¶

Conference: AAAI 2026 arXiv: 2511.13333 Code: https://github.com/CrowdStrike/automaldesc Area: Text Generation Keywords: Malware Analysis, Self-Improving LLM, Self-Paced Learning, Static Analysis, Script Security

TL;DR¶

This paper proposes AutoMalDesc, an automated static analysis framework that employs an iterative self-paced learning pipeline — starting from 900 expert-annotated seed samples, fine-tuning Llama-3.3-70B via LoRA to generate pseudo-labels, applying multi-stage quality filtering to obtain 101K samples, and training a V2 model — to achieve automated malware classification and behavior description across five scripting languages, improving Batch script detection accuracy from 52.7% to 82.4%.

Background & Motivation¶

State of the Field¶

Background: Cybersecurity requires static analysis of malicious scripts with natural language explanations. Existing approaches rely on YARA rules and sandbox detonation, which offer limited coverage and depend heavily on human experts.

Limitations of Prior Work: (1) Expert annotation is prohibitively costly and difficult to scale across the large volume of variants spanning five scripting languages; (2) General-purpose LLMs exhibit insufficient understanding of malware behavior (the base model achieves only 52.7% on Batch script detection); (3) Large-scale, high-quality datasets of malicious script descriptions are lacking.

Key Challenge: High-quality annotated data is scarce, yet robust malware analysis models require large and diverse training sets.

Goal: Overcome the annotation bottleneck using a small seed set combined with a self-paced training strategy.

Key Insight: Sandbox behavioral reports serve as a knowledge bridge, converting runtime behavior into training signals.

Core Idea: Beginning from 900 expert-annotated seeds, iterative self-paced learning progressively expands the training set to over 100K high-quality malicious script analysis examples.

Method¶

Overall Architecture¶

Seed set (900) → LoRA fine-tuning V1 → V1 annotates 157K unlabeled scripts → Four-stage quality filtering (→101K) → V2 training. Three tasks: malware detection, language identification, and behavior description.

Key Designs¶

Seed Dataset Construction: 900 scripts across five languages, verified via YARA rules, sandbox detonation, and expert validation. Descriptions are generated by Llama-3.3-70B conditioned on sandbox reports (\(\tau=0.3\)).
Four-Stage Quality Filtering: (1) Syntax check (parseable JSON); (2) Consensus check (label consistency across \(\tau=0.4/0.6/0.8\)); (3) Confidence check (logit probability \(\geq 90\%\)); (4) Coherence check (Phi-3.5-Mini validates summary–label alignment).
Training Configuration: V1: LoRA rank=8, \(\alpha=16\), 11 epochs; V2: rank=16, \(\alpha=32\), 13 epochs. Context length of 16K.

Loss & Training¶

Standard language model cross-entropy loss with LoRA fine-tuning.

Key Experimental Results¶

Main Results¶

Model	Bash	Batch	JS	PS	Python	Avg.
Base Malware Detection	92.4	52.7	90.8	94.2	89.8	83.1
V1 Malware Detection	93.2	77.9	90.6	95.3	91.8	89.3
V2 Malware Detection	96.3	82.4	92.2	95.3	92.6	91.5

Ablation Study¶

V3 exploration: label accuracy 91.53% (on par with V2), performance saturates after two rounds.
McNemar's test: V2 vs. V1 is statistically significant (\(p < 10^{-5}\)).

Key Findings¶

Batch script malware detection improves from near-random (52.7%) to 82.4% — the most substantial gain observed.
Performance saturates after two iterations (V3 yields no significant improvement).

Highlights & Insights¶

Industrial Deployment: Developed by CrowdStrike, this represents a real-world deployment solution at the scale of 157K scripts.
Four-Stage Filtering as the Core Quality Guarantee: Multi-temperature consensus combined with logit confidence scoring and external-model coherence verification ensures training data quality.

Limitations & Future Work¶

Hallucination is not fully resolved (V2 still produces 14 hallucinated descriptions).
Coverage is limited to five scripting languages; binary malware is not addressed.

Demonstrates the applicability of self-paced learning in domains with scarce expert knowledge, with potential transfer to other security analysis tasks.

Rating¶

Novelty: ⭐⭐⭐⭐ The pipeline of sandbox reports → LLM seeds → self-paced expansion is novel and practical.
Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated on 3,600 test samples with multi-round iteration comparisons and McNemar's statistical testing.
Writing Quality: ⭐⭐⭐⭐ Clear industrial paper style.
Value: ⭐⭐⭐⭐⭐ Directly serves the cybersecurity industry; code and data are open-sourced.