Perceptum

2604.26951v1 · Apr 29 · cs.CL · 4 min read

Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models

Gongbo Zhang, Wen Wang, Ye Tian, Li Yuan

“A distillation framework enables training small diffusion language models from larger ones with different architectures and tokenizers by adapting to timestep-dependent noise and cross-tokenizer objectives.”

Why it matters

Diffusion language models are appealing because they generate all tokens in parallel (faster inference than autoregressive models) and see bidirectional context. But they still demand billions of parameters to be competitive. Distillation usually cuts inference cost by reducing steps within one model type, or shrinks the model within one architecture. The real bottleneck nobody solved: the teacher and student almost never share the same tokenizer or attention scheme in practice, yet knowledge transfer across those gaps was a dead zone. This matters because practitioners want small, fast diffusion models deployed on edge hardware, but they inherit pretrained teachers with arbitrary design choices. TIDE unblocks that workflow.

Method

— Setup: Distill 8B dense and 16B mixture-of-experts (MoE) diffusion LLM teachers into a 0.6B student. Teachers and students differ in architecture (e.g., attention type), attention mechanisms, and tokenizers. Evaluation on eight benchmarks including code (HumanEval), math (MATH), and language understanding (MMLU).
— TIDAL (Timestep-Importance Dependent Adaptive Loss): Modulate distillation loss weight across both training progress and diffusion timestep. Key insight is that early timesteps (high noise) make teacher predictions unreliable, so gradients from those steps should be downweighted. Late timesteps (low noise) are more trustworthy. Also modulate by curriculum: early training stages may benefit from higher distillation weight.
— CompDemo (Complementary Demonstration): Enrich the teacher's context by splitting masks complementarily before each forward pass. Under heavy masking (where student predicts many tokens), the teacher can struggle. CompDemo patches this by ensuring the teacher sees diverse mask patterns during distillation, improving its ability to provide useful gradients in sparse-prediction regimes.
— Reverse CALM (Chunk-level Approximate Likelihood Matching inverted): Adaptation of CALM to handle tokenizer mismatch. Instead of matching chunk-level likelihoods directly (which requires aligned vocabularies), Reverse CALM inverts the objective, yielding bounded gradients and filtering noise from both ends. This lets the student learn from a teacher with a different BPE or SentencePiece vocabulary.
— Three components are modular: can swap TIDAL for uniform weighting, CompDemo for standard masking, or Reverse CALM for single-tokenizer approaches. Authors test ablations to isolate each.
— Training via knowledge distillation loss that backprops from student to teacher outputs. Diffusion timesteps sampled uniformly; TIDAL reweights gradients post-hoc. No mention of compute cost (e.g., wall-clock training time or GPU hours).
— Assumes teacher and student both use diffusion (masked language modeling) training. Does not address distillation from autoregressive teachers to diffusion students or vice versa, though mention of heterogeneous pipelines suggests some architectural flexibility.
— Benchmarks: HumanEval, MATH, MMLU, GSM8K, and four others (exact list not in abstract). Baseline comparisons include autoregressive (AR) 0.6B model, presumably dense only.

Result

The distilled 0.6B student outperforms the baseline by an average of 1.53 points across eight benchmarks. Code generation (HumanEval) shows the largest gap: 48.78 vs. 32.3 for an autoregressive baseline of the same size. This is a substantial margin (16.5 points), suggesting diffusion's parallel structure and bidirectional context are particularly valuable for code tasks. On language understanding (MMLU, GSM8K implied by benchmark suite), gains are more modest but consistent. The paper positions this as beating the AR baseline, not necessarily beating prior single-architecture distillation methods on speed—emphasis is on bridging the cross-architecture gap. No wall-clock inference time or distillation training cost is reported in the abstract.

Caveats

The abstract does not report how the student compares to an undistilled 0.6B diffusion baseline or to other diffusion models of similar size, making it unclear if the boost is solely from better distillation technique or partly from inheriting diffusion's structural advantage. Compute cost of distillation (e.g., how many more FLOPs than training 0.6B from scratch?) is absent. TIDAL and Reverse CALM introduce hyperparameters (e.g., timestep-weighting schedules, mask complementarity patterns) whose sensitivity is not explored in the abstract. Tokenizer mismatch is addressed, but the abstract does not clarify how large a gap the method can bridge—if teacher and student vocabularies overlap only 10%, does Reverse CALM still work? Limited to diffusion-to-diffusion transfer; extending to AR-to-diffusion or diffusion-to-AR is left open. Evaluation uses established benchmarks but no new code-generation or structured-prediction tasks designed to stress diffusion's bidirectional context.

Builds on

Minixhofer et al., 2025 (Universal cross-tokenizer distillation via approximate likelihood matching)
Introduces CALM (chunk-level approximate likelihood matching) for matching likelihoods across different tokenizers. TIDE adapts and inverts this to Reverse CALM, flipping the optimization direction to yield bounded gradients and dual-end noise filtering, addressing the discrete-diffusion case.
Shing et al., 2025 (TAID: temporally adaptive interpolated distillation for efficient knowledge transfer in language models)
Proposes adaptive weighting of distillation loss over training stages. TIDE extends this idea by adding per-timestep adaptation (TIDAL), recognizing that diffusion timesteps introduce noise-dependent reliability variation that autoregressive models don't have.
Arriola et al., 2025 (Block diffusion: interpolating between autoregressive and diffusion language models)
Explores hybrid AR-diffusion architectures. Relevant context for TIDE's positioning on the broader landscape of mixing model types, though TIDE focuses on pure diffusion-to-diffusion distillation rather than interpolation.

Original abstract

Diffusion large language models (dLLMs) offer parallel decoding and bidirectional context, but state-of-the-art dLLMs require billions of parameters for competitive performance. While existing distillation methods for dLLMs reduce inference steps within a single architecture, none address cross-architecture knowledge transfer, in which the teacher and student differ in architecture, attention mechanism, and tokenizer. We present TIDE, the first framework for cross-architecture dLLM distillation, comprising three modular components: (1) TIDAL, which jointly modulates distillation strength across training progress and diffusion timestep to account for the teacher's noise-dependent reliability; (2) CompDemo, which enriches the teacher's context via complementary mask splitting to improve predictions under heavy masking; and (3) Reverse CALM, a cross-tokenizer objective that inverts chunk-level likelihood matching, yielding bounded gradients and dual-end noise filtering. Distilling 8B dense and 16B MoE teachers into a 0.6B student via two heterogeneous pipelines outperforms the baseline by an average of 1.53 points across eight benchmarks, yielding notable gains in code generation, where HumanEval scores reach 48.78 compared to 32.3 for the AR baseline.

2604.26942v1 · Apr 29 · cs.LG · 4 min read

Hyper Input Convex Neural Networks for Shape Constrained Learning and Optimal Transport

Shayan Hundrieser, Insung Kong, Johannes Schmidt-Hieber

“A neural network architecture that provably learns convex functions with exponentially fewer parameters than prior methods, enabling scalable optimal transport.”

Why it matters

Convex function approximation is central to many machine learning pipelines—optimization landscapes, cost functions in optimal transport, and shape-constrained regression all require functions provably convex in their inputs. Standard ICNNs are the go-to architecture for this, but they suffer from poor scaling: they require prohibitively many parameters and train unreliably at scale, limiting their use in high-dimensional problems. Optimal transport—computing minimal-cost maps between distributions—is a canonical application that demands strict convexity guarantees, yet existing neural approaches struggle when dimension grows or data becomes complex (like single-cell RNA). A method that maintains convexity while actually scaling would unlock more robust learned transport maps and shape-constrained models.

Method

— Architecture design: HyCNNs combine Maxout networks (which compute element-wise max over groups of neuron outputs) with input convexity constraints. The key is ensuring that positive weights flow from inputs through hidden layers, maintaining convexity in the input variable while allowing depth-dependent nonlinearity via Maxout gates.
— Theoretical result: The authors prove that HyCNNs require O(1/ε^d) parameters to approximate a d-dimensional quadratic function to accuracy ε, whereas standard ICNNs require O(1/ε^(2d)) parameters—an exponential gap. This is the main theoretical contribution justifying the architecture.
— Comparison to ICNNs: Unlike ICNNs, which use ReLU activations and require strictly non-negative weights between layers, HyCNNs introduce Maxout gates that allow selective activation of basis functions. This breaks the bottleneck where ICNNs must use width to encode nonlinearity, not depth.
— Training recipe: Models are trained via standard gradient descent on convex regression tasks (synthetic and real). Convexity is enforced by architectural constraint, not post-hoc projection, so optimization remains tractable.
— Experiments: Synthetic quadratic and higher-order polynomial experiments validate the theoretical parameter count advantage. Convex regression and interpolation tasks benchmark against ICNNs and MLPs. Optimal transport experiments use synthetic 2D/high-D Gaussian pairs and real single-cell RNA sequencing (scRNA-seq) data, comparing against ICNN-based neural optimal transport and other baselines.
— scRNA-seq application: The authors apply HyCNNs to learn transport maps between cell populations in gene expression space, where the ground truth convexity assumption is not guaranteed—this tests generalization beyond the theory.
— Scalability control: A key control is demonstrating that HyCNNs train reliably at scale (larger networks, higher dimensions) whereas ICNN-based methods become unstable or plateau, suggesting the architecture itself mitigates training pathologies.

Result

On synthetic quadratic functions, HyCNNs match theoretical predictions: they use exponentially fewer parameters than ICNNs to reach target accuracy. On convex regression tasks, HyCNNs outperform ICNNs and unconstrained MLPs in both accuracy and stability. For optimal transport, HyCNNs consistently outperform ICNN-based neural optimal transport methods across synthetic and scRNA-seq experiments, with gains most pronounced in high dimensions. Exact numbers are not provided in the abstract, but the qualitative finding is clear: HyCNNs achieve lower approximation error with fewer parameters, and scale to realistic high-dimensional problems (single-cell RNA sequencing) where prior convex architectures fail or require excessive width. The method does not claim to beat non-convex baselines in raw prediction power, but trades off that freedom for guaranteed convexity and drastically improved sample and parameter efficiency.

Caveats

The exponential parameter improvement is proven only for quadratic functions; generalization to broader function classes is empirical. The scRNA-seq application violates the theoretical convexity assumption—ground truth transport maps between cell populations are not guaranteed convex—so improvements there may reflect architecture design rather than convexity exploitation. Optimal transport experiments lack detailed numerical baselines (e.g., Wasserstein distance values, wall-clock time comparisons) that would clarify the magnitude of improvement. Compute requirements and scalability to truly massive dimensions (e.g., >10K genes in scRNA-seq) are not discussed. The method assumes the user believes convexity is the right inductive bias; if the true target function is non-convex, the architecture is fundamentally limited. A skeptical reviewer would ask whether Maxout adds architectural complexity that offsets simplicity gains, and whether the improvements over ICNNs could be achieved by simply training ICNNs longer or with better hyperparameters.

Builds on

Amos et al., 2017 (Input Convex Neural Networks)
Introduced ICNNs to enforce convexity in neural networks by constraining weights. HyCNNs take this core idea but replace ReLU with Maxout gates to overcome the parameter inefficiency bottleneck ICNNs suffer.
Goodfellow et al., 2013 (Maxout Networks)
Proposed Maxout activation (max over grouped outputs) as an alternative to ReLU. HyCNNs combine this with convexity constraints to achieve depth-dependent expressiveness without breaking input convexity.
Koehler et al., 2020 (Neural Optimal Transport)
Applied neural networks to learn optimal transport maps. HyCNNs improve on this line of work by using a convex-respecting architecture that scales better and learns more stable transport maps, especially in high dimensions.

Original abstract

We introduce Hyper Input Convex Neural Networks (HyCNNs), a novel neural network architecture designed for learning convex functions. HyCNNs combine the principles of Maxout networks with input convex neural networks (ICNNs) to create a neural network that is always convex in the input, theoretically capable of leveraging depth, and performs reliable when trained at scale compared to ICNNs. Concretely, we prove that HyCNNs require exponentially fewer parameters than ICNNs to approximate quadratic functions up to a given precision. Throughout a series of synthetic experiments, we demonstrate that HyCNNs outperform existing ICNNs and MLPs in terms of predictive performance for convex regression and interpolation tasks. We further apply HyCNNs to learn high-dimensional optimal transport maps for synthetic examples and for single-cell RNA sequencing data, where they oftentimes outperform ICNN-based neural optimal transport methods and other baselines across a wide range of settings.

2604.26932v1 · Apr 29 · math.OC · 4 min read

Learning Over-Relaxation Policies for ADMM with Convergence Guarantees

Junan Lin, Paul J. Goulart, Luca Furieri

“Learn online relaxation parameter updates for ADMM that preserve convergence guarantees while reducing solver iterations on repeated quadratic programs.”

Why it matters

ADMM is the backbone of solvers like OSQP used in Model Predictive Control and other online optimization loops where the same problem structure repeats with different data. Practitioners know that relaxation and penalty parameters strongly affect iteration count, but tuning them requires either expensive matrix refactorizations or accepting suboptimal defaults. Recent work has learned to warm-start or tune hyperparameters for convex solvers, but most either lack convergence guarantees or target penalty updates, which trigger refactorizations. This paper delivers both a theoretical foundation and a computationally cheap learning mechanism—learning relaxation only, which doesn't force matrix recomputation in OSQP-style splitting architectures.

Method

— Setup: Focus on structured convex QPs solved repeatedly (e.g., MPC receding-horizon problems). Train on a dataset of related problem instances with varying parameter values but fixed problem class.
— Core technique: Train a neural network policy to predict time-varying relaxation parameters α(k) across ADMM iterations. The relaxation parameter multiplies the dual update step; varying it does not require matrix refactorization unlike penalty parameter ρ.
— Convergence theory: Establish that ADMM with time-varying penalty ρ(k) and relaxation α(k) converges under mild assumptions (e.g., ρ(k) bounded away from zero, α(k) in a feasible range). This extends classical fixed-parameter results.
— Training: Use supervised learning on trajectories from baseline ADMM runs. The policy observes iteration statistics (primal/dual residuals, variable values) and outputs the next relaxation parameter.
— Novelty over prior work: Prior learning-to-optimize papers target different settings (warm-starting, penalty tuning, or fixed step sizes). This paper proves convergence for adaptive relaxation and demonstrates that learning relaxation specifically is computationally cheaper than learning penalties in splitting solver architectures.
— Evaluation: Benchmark on quadratic programs (likely from control or portfolio problems). Measure iteration count and wall-clock time against baseline OSQP. Compare to hand-tuned and default relaxation schedules.
— Assumptions: Assumes access to a training set of representative problem instances and that the problem class structure is known and fixed. Requires the learned policy to respect bounds that ensure convergence.
— Control: The authors must ensure learned relaxation stays within convergence-guaranteeing bounds and that the policy generalizes to test problems outside the training set.

Result

The paper establishes convergence of ADMM with time-varying relaxation and penalty parameters under mild conditions—a theoretical contribution that enables the learning framework. On benchmark QPs, learned relaxation policies reduce both iteration count and wall-clock time compared to baseline OSQP, which uses fixed default parameters. The abstract does not give explicit speedup percentages, but the claim is that learned relaxation beats OSQP's standard settings without incurring the computational cost of penalty refactorization. This trades off some generality (the policy is trained for a specific problem class) for speed on that class—a useful trade in repeated-solve settings like MPC where the problem structure is indeed fixed.

Caveats

The convergence proof requires relaxation parameters to stay within a feasible range; whether the learned policy consistently respects this is not addressed in the abstract. Generalization to problem classes outside the training distribution is unclear—MPC problems with different dimensions or cost functions may violate the guarantees. The benchmark is described as 'quadratic programs' without naming specific problem sizes or industrial datasets, raising questions about how well this scales to high-dimensional or ill-conditioned problems. The paper does not discuss how to obtain representative training data or select the neural network architecture, which are practical hurdles. Computational savings from avoiding refactorization are solver-architecture dependent and may not apply to other ADMM implementations. Finally, it's unclear whether the learned policy is problem-class-specific enough that practitioners would need to retrain for each new control task.

Builds on

Stellato et al., 2020
OSQP is the canonical operator-splitting QP solver and the deployment target. This paper's motivation is to improve OSQP's performance by learning relaxation updates within the same architecture.
Chen et al., 2022
A foundational survey and benchmark for learning-to-optimize. This paper extends that paradigm by adding convergence guarantees and focusing on relaxation tuning rather than full algorithm learning.
Martin and Furieri, 2024
Prior work on learning optimizer hyperparameters with convergence guarantees using nonlinear system theory. This paper applies similar nonlinear-system-theoretic ideas to ADMM's relaxation parameter.
Ichnowski et al., 2021
Accelerates QP solving with reinforcement learning. This paper complements that by learning discrete relaxation updates deterministically with theoretical guarantees rather than via RL.

Original abstract

The Alternating Direction Method of Multipliers (ADMM) is a widely used method for structured convex optimization, and its practical performance depends strongly on the choice of penalty and relaxation parameters. Motivated by settings such as Model Predictive Control (MPC), where one repeatedly solves related optimization problems with fixed structure and changing parameter values, we propose learning online updates of the relaxation parameter to improve performance on problem classes of interest. This choice is computationally attractive in OSQP-like architectures, since adapting relaxation does not trigger the matrix refactorizations associated with penalty updates. We establish convergence guarantees for ADMM with time-varying penalty and relaxation parameters under mild assumptions, and show on benchmark quadratic programs that the resulting learned policies improve both iteration count and wall-clock time over baseline OSQP.

2604.26926v1 · Apr 29 · cs.LG · 4 min read

A Note on How to Remove the ln ln T Term from the Squint Bound

Francesco Orabona

“Removing a logarithmic-log factor from online learning bounds by reinterpreting algorithm design as prior modification.”

Why it matters

Online learning algorithms, especially parameter-free variants that don't require tuning learning rates, are important for practical applications where the time horizon is unknown. However, their regret bounds typically contain extra logarithmic factors—like ln ln T—that degrade performance guarantees, particularly when T (time horizon) is moderately large. Removing such factors is not merely cosmetic: it tightens the theoretical guarantees and suggests algorithms are closer to optimal. Previous work (Orabona and Pál 2016) introduced a technical fix using shifted KT potentials but didn't fully explain why it worked or how to apply it broadly. This note cracks that open by showing the mechanism is equivalent to prior selection, which is conceptually cleaner and portable to other algorithms like Squint. For practitioners, tighter bounds mean better confidence in algorithm choice; for theorists, understanding the mechanism enables systematic improvement.

Method

— Analyzes the Krichevsky–Trofimov (KT) algorithm and the Squint algorithm, both used in prediction with expert advice and parameter-free online learning settings.
— Establishes formal equivalence between shifted KT potentials (a modification introduced in prior work) and a change of prior distribution in the KT algorithm—proving these are not two different tricks but the same idea viewed differently.
— Applies the same prior-modification insight to the Squint algorithm's data-independent regret bound, which previously contained a ln ln T factor in the regret term.
— Works in the expert advice framework where the learner competes against a fixed set of experts and receives feedback after each round; derives bounds on cumulative regret relative to the best expert in hindsight.
— Uses potential function analysis and properties of logarithmic potentials to show that absorbing the ln ln T term into the prior does not compromise the overall regret guarantee.
— Focuses on data-independent bounds (holding for any sequence of losses) rather than high-probability or distribution-dependent statements, which simplifies the mathematical argument.

Result

The paper proves that the ln ln T factor in Squint's data-independent bound can be removed by adopting a modified prior, with no increase in the base regret term. Concretely, the regret bound becomes cleaner—no longer explicitly including the ln ln T factor—while maintaining the same dependence on the number of experts and time horizon T. The result is technically modest (a short note, not a full paper) but clarifies the mechanism: what looked like a problem-specific artifact is actually a choice of initialization. This matches the state-of-the-art for expert-based bounds and resolves a minor but nagging loose end in parameter-free learning theory. The work does not improve upon the best known bounds; rather, it shows how to achieve them more transparently.

Caveats

The paper is a short technical note addressing a specific algorithmic detail and does not provide new empirical evaluation or large-scale experiments. The equivalence between shifted KT potentials and prior modification, while mathematically clean, may not yield practical algorithmic improvements—the underlying regret guarantee is unchanged. The scope is limited to expert advice and prediction settings; generalization to online convex optimization or other domains is not explored. The analysis assumes a fixed, known set of experts and does not address expert addition/removal or unknown expert counts. No discussion of computational cost or how prior choice affects the constants hidden in the O(·) notation—tighter constants might matter in practice even if the asymptotic form is simpler. The note builds heavily on Orabona and Pál (2016) and is best read as a follow-up clarification; readers unfamiliar with shifted KT potentials may need to consult the original paper.

Builds on

Orabona and Pál [2016]
Introduced shifted KT potentials to remove the ln ln T factor in parameter-free learning with expert bounds. This note shows that technique is equivalent to modifying the prior, a reinterpretation that enables applying the same trick to other algorithms.
Chernov and Vovk [2010]
Foundational work on prediction with a fixed set of experts. Established baseline expert advice bounds and the framework this note uses for analysis.
Koolen and van Erven [2015]
Developed second-order quantile methods for expert prediction. Relevant to understanding state-of-the-art parameter-free bounds and the competition Squint aims to match or beat.

Original abstract

In Orabona and Pál [2016], we introduced the shifted KT potentials, to remove the $\ln \ln T$ factor in the parameter-free learning with expert bound. In this short technical note, I show that this is equivalent to changing the prior in the Krichevsky--Trofimov algorithm. Then, I show how to use the same idea to remove the $\ln \ln T$ factor in the data-independent bound for the Squint algorithm.

2604.26922v1 · Apr 29 · cs.LG · 4 min read

On the Learning Curves of Revenue Maximization

Steve Hanneke, Alkis Kalavasis, Shay Moran, Grigoris Velegkas

“Learning curves for revenue maximization show distribution-dependent decay rates ranging from arbitrarily slow to near-exponential, unlike worst-case PAC bounds.”

Why it matters

Mechanism design practitioners want to estimate optimal auction prices from samples, but prior work (Cole–Roughgarden onwards) gives worst-case bounds that assume an adversary picks the hardest distribution for each sample size. Those bounds hide the real learning dynamics and look much worse than empirical performance. Meanwhile, supervised learning has rich theory on distribution-dependent learning curves via VC dimension and covering numbers—but it does not directly transfer to revenue because the loss is not convex and the hypothesis class is different. This paper closes that gap by asking: given a real valuation distribution, how fast does an algorithm converge to optimal revenue? The answer depends on problem structure, not just sample size, and reveals when practical learning is fast vs. provably slow.

Method

— Focus on the single-item, single-buyer setting: a buyer with hidden valuation v drawn from unknown distribution D, and the seller's goal is to set a price p to maximize expected revenue.
— Study Bayes-consistent algorithms that converge to the optimal revenue for any fixed distribution D as n to ∞ (asymptotic optimality), then characterize the rate of decay as a function of n and properties of D.
— Key technical distinction from Cole–Roughgarden (2014): instead of a worst-case envelope over all distributions at each n (PAC style), plot the learning curve for a single distribution and vary n; this exposes distribution-dependent structure.
— Three regimes analyzed: (1) unrestricted distributions with finite optimal revenue—prove Bayes-consistent algorithms exist but convergence can be arbitrarily slow (lower bounds); (2) distributions where optimal revenue is achieved by a finite price p^*—show optimal rate Theta(1/sqrtn) upper and lower bounds; (3) discrete-support distributions—derive near-exponential decay, unattainable under PAC.
— Employ tools from nonparametric statistics (Antos–Kontoyiannis, Devroye–Penrod style lower bounds) and adapt minimax theory to the revenue setting; use empirical process inequalities to bound revenue concentration.
— Main assumption: the algorithm observes i.i.d. samples (v₁, ldots, v_n) from D and must output a price (or distribution over prices) with revenue close to inf_p E[max(p, v)].
— Crucially, do not assume knowledge of D's support or other distributional properties; algorithms must be data-driven, yet analysis conditions on the true distribution to separate distribution-dependent from instance-independent rates.

Result

The main results are: (1) Without support restrictions, there exists a Bayes-consistent algorithm for any distribution, but for some distributions (e.g., log-normal-like with unbounded support), no algorithm can achieve faster than arbitrarily slow convergence, even if optimal revenue is finite; lower bounds are information-theoretic, not tied to a specific algorithm. (2) If the optimal revenue is attained at a finite price p^* (true for many standard distributions like uniform, normal, or exponential), then optimal convergence rate is Theta(1/sqrtn)—matching the nonparametric density estimation rate. (3) Discrete support (e.g., valuations in \1, 2, ldots, k\): learning curves decay like (1-epsilon)^cn for some c > 0, exponentially fast, far better than the 1/sqrtn rate. Quantitatively, the discrete case separates sharply from the continuous regime. The 1/sqrtn rate for finite-price-optimal distributions is the optimal rate (upper and lower bounds match up to constants), so no algorithm can do better; this is neither o(1/sqrtn) (as in PAC classification with finite VC dimension) nor slower than polynomial decay.

Caveats

The analysis is restricted to single-item, single-buyer auctions; extension to multi-item, multi-buyer, or combinatorial settings is open and likely requires new techniques. The paper assumes the algorithm can access arbitrarily many i.i.d. samples, which is realistic for data-driven mechanism design but not always practical; sample cost (e.g., auction rounds) is not modeled. Lower bounds on slow convergence are worst-case over the class of distributions with finite optimal revenue but no finite price—the paper does not characterize which natural distributions fall into this regime, so the gap between the Bayes-consistent upper bound and lower bounds may be loose in practice. The discrete-support result is elegant but limited: real valuations are rarely truly discrete, and the result assumes the support is known to be finite; robustness to model misspecification is not addressed. The work does not address strategic buyer behavior (buyers might learn from past prices or collude), only statistical learning of a fixed D. Finally, the paper is purely theoretical; empirical validation on real auction data or even synthetic experiments is absent, so it is unclear how fast real algorithms converge in practice.

Builds on

Cole and Roughgarden, 2014
Seminal work establishing sample complexity of revenue maximization via PAC-style worst-case analysis over all valuation distributions. This paper asks the converse: given a fixed distribution, what is the learning curve? It thus inverts the Cole–Roughgarden framework from worst-case to average-case.
Bousquet, Hanneke, Moran, Van Handel, and Yehudayoff, 2021
Develops a general theory of universal learning curves in supervised learning, connecting VC dimension, covering numbers, and distribution-dependent rates. This paper applies and adapts those tools to the non-convex, non-standard loss of revenue maximization.
Alon, Babaioff, Gonczarowski, Mansour, Moran, and Yehudayoff, 2017
Proves submultiplicative concentration and uniform convergence of revenues over auctions; provides concentration machinery that this work extends to characterize learning curve decay rates and lower bounds.

Original abstract

Learning curves are a fundamental primitive in supervised learning, describing how an algorithm's performance improves with more data and providing a quantitative measure of its generalization ability. Formally, a learning curve plots the decay of an algorithm's error for a fixed underlying distribution as a function of the number of training samples. Prior work on revenue-maximizing learning algorithms, starting with the seminal work of Cole and Roughgarden [STOC, 2014], adopts a distribution-free perspective, which parallels the PAC learning framework in learning theory. This approach evaluates performance against the hardest possible sequence of valuation distributions, one for each sample size, effectively defining the upper envelope of learning curves over all possible distributions, thus leading to error bounds that do not capture the shape of the learning curves. In this work we initiate the study of learning curves for revenue maximization and provide a near-complete characterization of their rate of decay in the basic setting of a single item and a single buyer. In the absence of any restriction on the valuation distribution, we show that there exists a Bayes-consistent algorithm, meaning that its learning curve converges to zero for any arbitrary valuation distribution as the number of samples $n \to \infty$. However, this convergence must be arbitrarily slow, even if the optimal revenue is finite. In contrast, if the optimal revenue is achieved by a finite price, then the optimal rate of decay is roughly $1/\sqrt{n}$. Finally, for distributions supported on discrete sets of values, we show that learning curves decay almost exponentially fast, a rate unattainable under the PAC framework.

Caught up.

Five papers richer, with enough detail to actually remember them.

New batch tomorrow morning.

saved for later

Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models

Why it matters

Method

Result

Caveats

Builds on

Hyper Input Convex Neural Networks for Shape Constrained Learning and Optimal Transport

Why it matters

Method

Result

Caveats

Builds on

Learning Over-Relaxation Policies for ADMM with Convergence Guarantees

Why it matters

Method

Result

Caveats

Builds on

A Note on How to Remove the ln ln T Term from the Squint Bound

Why it matters

Method

Result

Caveats

Builds on

On the Learning Curves of Revenue Maximization

Why it matters

Method

Result

Caveats

Builds on

Caught up.

Shortcuts