Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models
Gongbo Zhang, Wen Wang, Ye Tian, Li Yuan
“A distillation framework enables training small diffusion language models from larger ones with different architectures and tokenizers by adapting to timestep-dependent noise and cross-tokenizer objectives.”
Why it matters
Diffusion language models are appealing because they generate all tokens in parallel (faster inference than autoregressive models) and see bidirectional context. But they still demand billions of parameters to be competitive. Distillation usually cuts inference cost by reducing steps within one model type, or shrinks the model within one architecture. The real bottleneck nobody solved: the teacher and student almost never share the same tokenizer or attention scheme in practice, yet knowledge transfer across those gaps was a dead zone. This matters because practitioners want small, fast diffusion models deployed on edge hardware, but they inherit pretrained teachers with arbitrary design choices. TIDE unblocks that workflow.
Method
- — Setup: Distill 8B dense and 16B mixture-of-experts (MoE) diffusion LLM teachers into a 0.6B student. Teachers and students differ in architecture (e.g., attention type), attention mechanisms, and tokenizers. Evaluation on eight benchmarks including code (HumanEval), math (MATH), and language understanding (MMLU).
- — TIDAL (Timestep-Importance Dependent Adaptive Loss): Modulate distillation loss weight across both training progress and diffusion timestep. Key insight is that early timesteps (high noise) make teacher predictions unreliable, so gradients from those steps should be downweighted. Late timesteps (low noise) are more trustworthy. Also modulate by curriculum: early training stages may benefit from higher distillation weight.
- — CompDemo (Complementary Demonstration): Enrich the teacher's context by splitting masks complementarily before each forward pass. Under heavy masking (where student predicts many tokens), the teacher can struggle. CompDemo patches this by ensuring the teacher sees diverse mask patterns during distillation, improving its ability to provide useful gradients in sparse-prediction regimes.
- — Reverse CALM (Chunk-level Approximate Likelihood Matching inverted): Adaptation of CALM to handle tokenizer mismatch. Instead of matching chunk-level likelihoods directly (which requires aligned vocabularies), Reverse CALM inverts the objective, yielding bounded gradients and filtering noise from both ends. This lets the student learn from a teacher with a different BPE or SentencePiece vocabulary.
- — Three components are modular: can swap TIDAL for uniform weighting, CompDemo for standard masking, or Reverse CALM for single-tokenizer approaches. Authors test ablations to isolate each.
- — Training via knowledge distillation loss that backprops from student to teacher outputs. Diffusion timesteps sampled uniformly; TIDAL reweights gradients post-hoc. No mention of compute cost (e.g., wall-clock training time or GPU hours).
- — Assumes teacher and student both use diffusion (masked language modeling) training. Does not address distillation from autoregressive teachers to diffusion students or vice versa, though mention of heterogeneous pipelines suggests some architectural flexibility.
- — Benchmarks: HumanEval, MATH, MMLU, GSM8K, and four others (exact list not in abstract). Baseline comparisons include autoregressive (AR) 0.6B model, presumably dense only.
Result
The distilled 0.6B student outperforms the baseline by an average of 1.53 points across eight benchmarks. Code generation (HumanEval) shows the largest gap: 48.78 vs. 32.3 for an autoregressive baseline of the same size. This is a substantial margin (16.5 points), suggesting diffusion's parallel structure and bidirectional context are particularly valuable for code tasks. On language understanding (MMLU, GSM8K implied by benchmark suite), gains are more modest but consistent. The paper positions this as beating the AR baseline, not necessarily beating prior single-architecture distillation methods on speed—emphasis is on bridging the cross-architecture gap. No wall-clock inference time or distillation training cost is reported in the abstract.
Caveats
The abstract does not report how the student compares to an undistilled 0.6B diffusion baseline or to other diffusion models of similar size, making it unclear if the boost is solely from better distillation technique or partly from inheriting diffusion's structural advantage. Compute cost of distillation (e.g., how many more FLOPs than training 0.6B from scratch?) is absent. TIDAL and Reverse CALM introduce hyperparameters (e.g., timestep-weighting schedules, mask complementarity patterns) whose sensitivity is not explored in the abstract. Tokenizer mismatch is addressed, but the abstract does not clarify how large a gap the method can bridge—if teacher and student vocabularies overlap only 10%, does Reverse CALM still work? Limited to diffusion-to-diffusion transfer; extending to AR-to-diffusion or diffusion-to-AR is left open. Evaluation uses established benchmarks but no new code-generation or structured-prediction tasks designed to stress diffusion's bidirectional context.
Builds on
-
Minixhofer et al., 2025 (Universal cross-tokenizer distillation via approximate likelihood matching)
Introduces CALM (chunk-level approximate likelihood matching) for matching likelihoods across different tokenizers. TIDE adapts and inverts this to Reverse CALM, flipping the optimization direction to yield bounded gradients and dual-end noise filtering, addressing the discrete-diffusion case.
-
Shing et al., 2025 (TAID: temporally adaptive interpolated distillation for efficient knowledge transfer in language models)
Proposes adaptive weighting of distillation loss over training stages. TIDE extends this idea by adding per-timestep adaptation (TIDAL), recognizing that diffusion timesteps introduce noise-dependent reliability variation that autoregressive models don't have.
-
Arriola et al., 2025 (Block diffusion: interpolating between autoregressive and diffusion language models)
Explores hybrid AR-diffusion architectures. Relevant context for TIDE's positioning on the broader landscape of mixing model types, though TIDE focuses on pure diffusion-to-diffusion distillation rather than interpolation.
Original abstract
Diffusion large language models (dLLMs) offer parallel decoding and bidirectional context, but state-of-the-art dLLMs require billions of parameters for competitive performance. While existing distillation methods for dLLMs reduce inference steps within a single architecture, none address cross-architecture knowledge transfer, in which the teacher and student differ in architecture, attention mechanism, and tokenizer. We present TIDE, the first framework for cross-architecture dLLM distillation, comprising three modular components: (1) TIDAL, which jointly modulates distillation strength across training progress and diffusion timestep to account for the teacher's noise-dependent reliability; (2) CompDemo, which enriches the teacher's context via complementary mask splitting to improve predictions under heavy masking; and (3) Reverse CALM, a cross-tokenizer objective that inverts chunk-level likelihood matching, yielding bounded gradients and dual-end noise filtering. Distilling 8B dense and 16B MoE teachers into a 0.6B student via two heterogeneous pipelines outperforms the baseline by an average of 1.53 points across eight benchmarks, yielding notable gains in code generation, where HumanEval scores reach 48.78 compared to 32.3 for the AR baseline.