Perceptum Subscribe
SUN MAY 03
Daily issue

Percepti's daily briefing.

Five fresh ML, stats, and signal-processing papers worth your coffee.

2604.27906v1 · Apr 30 · AI · Language

From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction

Alex Petrov, Alexander Gusak, Denis Mukha, Dima Korolev

Instead of letting AI assistants remember things by searching through old chat transcripts, this paper builds memory like a database with strict rules about what to store.

Why it matters

If you've used an AI assistant that forgets your preferences, contradicts itself, or invents details about you, you've felt the limits of today's memory designs. Production agents — think customer service bots, personal assistants, coding copilots — need to track changing facts (your address, your current project status, who reports to whom) reliably. Treating memory like a search engine over old text works for vague recall ('what did we discuss last month?') but fails for crisp questions ('what is the user's shipping address?'). This paper argues that memory should look more like a well-maintained spreadsheet than a diary, and shows that this shift gives big accuracy gains.

Method

  • — Define a schema up front: a structured list of what objects exist (e.g., 'person', 'project'), what fields each has (e.g., 'email', 'deadline'), and which fields must never be guessed.
  • — When new information arrives (a chat message, a document), run a multi-step write process: first detect what objects are present, then which fields are mentioned, then extract the actual values.
  • — Each step has validation gates that check the output against the schema. If something fails, the system retries locally instead of corrupting the record.
  • — A 'judge-in-the-loop' configuration adds an extra model pass to double-check extracted values before they're committed.
  • — Reading from memory becomes a structured query (like asking a database) rather than asking the model to re-read old text and infer answers each time.
  • — The system, called xmemory, is tested on both a structured extraction benchmark and an end-to-end memory benchmark where an agent has to handle facts, updates, deletions, and unknown values.
  • — Caveat: it requires designing schemas in advance, which is more upfront engineering than just dumping text into a vector store.

Result

On a structured extraction benchmark, the judge-in-the-loop version reaches about 90% accuracy at correctly identifying objects and about 63% at producing fully correct outputs, beating frontier models with built-in structured output features. On an end-to-end memory benchmark, xmemory hits 97.1% F1, while competing systems land between 80% and 87%. On a higher-level application task, xmemory scores 95.2% accuracy, beating both specialized memory products and harnesses built around top commercial AI assistants. The headline finding: for memory tasks that demand stable facts, the architecture of how you write and validate matters more than how big or smart the underlying model is.

Caveats

The approach assumes you can define a useful schema ahead of time, which works for known domains (CRM, support, project tracking) but is harder for open-ended exploration where you don't know what matters yet. The benchmarks are partly the authors' own, so independent replication will be important. Schema-grounded memory also has more moving parts — extraction, validation, retries — meaning higher cost per write and more engineering to maintain as needs evolve. Skeptics will point out that schemas can become stale or wrong, and that real users often say things the schema didn't anticipate. The paper does not fully address how schemas evolve over time or how the system degrades when input doesn't fit any known object type.

Builds on

  • Chhikara et al., 2025

    Mem0 is a recent production memory system for AI agents that the authors compare against as a baseline. xmemory differs by enforcing schema-grounded writes and validation gates rather than relying primarily on retrieval over stored text.

  • Lewis et al., 2020

    Retrieval-augmented generation (RAG) is the dominant pattern this paper pushes back against. RAG fetches relevant text chunks at read time; xmemory instead does the heavy interpretation at write time so reads are clean queries over verified records.

  • Maharana et al., 2024

    LoCoMo introduced a benchmark for very long-term conversational memory in agents. The authors use this lineage to evaluate end-to-end memory and argue retrieval-only systems underperform on stateful operations.

  • Madaan et al., 2023

    Self-Refine showed that letting models iteratively critique and revise their own outputs improves quality. xmemory borrows this iterative-refinement idea but channels it through schema validation rather than free-form self-feedback.

Original abstract

Persistent AI memory is often reduced to a retrieval problem: store prior interactions as text, embed them, and ask the model to recover relevant context later. This design is useful for thematic recall, but it is mismatched to the kinds of memory that agents need in production: exact facts, current state, updates and deletions, aggregation, relations, negative queries, and explicit unknowns. These operations require memory to behave less like search and more like a system of record. This paper argues that reliable external AI memory must be schema-grounded. Schemas define what must be remembered, what may be ignored, and which values must never be inferred. We present an iterative, schema-aware write path that decomposes memory ingestion into object detection, field detection, and field-value extraction, with validation gates, local retries, and stateful prompt control. The result shifts interpretation from the read path to the write path: reads become constrained queries over verified records rather than repeated inference over retrieved prose. We evaluate this design on structured extraction and end-to-end memory benchmarks. On the extraction benchmark, the judge-in-the-loop configuration reaches 90.42% object-level accuracy and 62.67% output accuracy, above all tested frontier structured-output baselines. On our end-to-end memory benchmark, xmemory reaches 97.10% F1, compared with 80.16%-87.24% across the third-party baselines. On the application-level task, xmemory reaches 95.2% accuracy, outperforming specialised memory systems, code-generated Markdown harnesses, and customer-facing frontier-model application harnesses. The results show that, for memory workloads requiring stable facts and stateful computation, architecture matters more than retrieval scale or model strength alone.

xmemory replaces retrieval-style AI memory with a schema-grounded write path that decomposes ingestion into object, field, and value extraction with validation gates, achieving 97.1% F1 on an end-to-end memory benchmark.

Why it matters

Most current 'long-term memory' for LLM agents — Mem0, LoCoMo-style systems, vanilla RAG with vector stores — is fundamentally a dense retrieval pipeline: store conversation turns, embed them, fetch top-k at query time, and let the model re-interpret. This works for thematic recall ('what did we talk about?') but breaks on the operations production agents actually need: exact-fact lookup, state updates, deletes, aggregation, relations, negative queries ('does the user have any open tickets?'), and explicit unknowns. The paper's thesis — that memory must behave like a system of record, not a search index — directly addresses where retrieval-based memory hits a wall, and the strong end-to-end numbers (97.1% F1 vs 80–87% for baselines) suggest the architectural shift is doing real work that scaling models alone won't fix.

Method

  • — Schema definition: developers (or an automated process) define an explicit schema specifying object types, fields per object, and per-field constraints. Crucially, the schema marks which fields must never be inferred — values must be extracted verbatim or left as explicit unknowns.
  • — Iterative write path decomposed into three stages: (1) object detection — identify which schema objects are referenced in an input; (2) field detection — for each detected object, determine which fields are addressed; (3) field-value extraction — extract the actual value for each detected field. Each stage is a separate prompted call with narrower scope than monolithic JSON extraction.
  • — Validation gates at each stage check outputs against the schema (type, enum membership, format constraints from JSON Schema 2020-12). On failure, the system performs local retries — re-prompting just the failed sub-task rather than rerunning the whole pipeline.
  • — Stateful prompt control: the prompts at later stages are conditioned on validated outputs from earlier stages, narrowing the search space and reducing the chance of the model fabricating fields not in the schema.
  • Judge-in-the-loop variant: an additional LLM pass evaluates extracted values for correctness, used in the highest-accuracy configuration on the extraction benchmark.
  • — Read path: queries hit the structured store as constrained operations (lookups, filters, aggregations) over verified records. Interpretation has been shifted left — to write time — so reads are deterministic given the records.
  • — Evaluation on three tracks: (a) a structured-extraction benchmark vs frontier structured-output baselines (presumably OpenAI/Anthropic/Google function-calling or JSON-mode), (b) an end-to-end memory benchmark with the operations listed above, (c) an application-level task comparing against specialized memory systems (Mem0), Markdown-file harnesses produced via code generation, and customer-facing frontier-model assistant harnesses.
  • — Assumptions: a useful schema can be defined ahead of time; the cost of multiple LLM calls per write is acceptable; the domain is stable enough that schema evolution is manageable.
  • — Caveats acknowledged in framing: this design trades upfront schema engineering and write-time compute for read-time reliability.

Result

On the structured-extraction benchmark, the judge-in-the-loop xmemory configuration reaches 90.42% object-level accuracy and 62.67% output accuracy, surpassing all tested frontier structured-output baselines. The gap between object-level and output accuracy (≈28 points) suggests the system is good at identifying what's there but stricter end-to-end extraction is harder — consistent with field-value extraction being the bottleneck. On the end-to-end memory benchmark, xmemory achieves 97.10% F1, compared to a 80.16–87.24% range across third-party baselines — a roughly 10–17 point improvement, which is large for a memory benchmark covering updates, deletes, and unknowns where retrieval systems typically fail. On the application-level task, xmemory hits 95.2% accuracy, beating specialized memory systems (Mem0-style), code-generated Markdown harnesses (a strong simple baseline where the agent writes notes to files), and customer-facing frontier-model application harnesses. The authors' takeaway: for memory workloads dominated by stable-fact and stateful-computation requirements, architectural choices dominate raw model strength or retrieval scale.

Caveats

Several concerns a teammate should flag: (1) Self-built benchmarks are part of the evaluation, so the comparison against Mem0/LoCoMo-style baselines may not be fully apples-to-apples — independent replication on shared benchmarks like LoCoMo with identical splits would strengthen the case. (2) The approach inherits all the costs of multi-stage extraction: latency and token spend per write are substantially higher than 'embed and store', and the judge-in-the-loop variant compounds this. The paper does not appear to report cost or latency curves vs accuracy. (3) Schema design is the elephant in the room — schemas need maintenance, must evolve as user needs change, and the paper cites schema-evolution work (Hernández Chillón et al.) but does not benchmark how xmemory degrades when the schema is mis-specified or incomplete. (4) Out-of-schema content is by construction discarded or stored lossily; for open-ended assistants this is a real limitation. (5) The 62.67% output accuracy on extraction is honestly not high — production use will likely require domain-specific schema tuning. (6) No reported ablations (in the abstract) on which component matters most: object-vs-field decomposition, validation gates, local retries, or the judge. (7) Performance on negative queries and explicit-unknown handling is the strongest theoretical claim but the abstract doesn't break out per-operation numbers. Likely pushback: retrieval advocates will note that hybrid systems (RAG plus structured extraction) are common and that xmemory's gains may shrink against well-tuned hybrids; large-context advocates will argue long-context models with careful prompting can match this without schema overhead.

Builds on

  • Chhikara et al., 2025

    Mem0 is a production-oriented long-term memory system for AI agents and a primary baseline. xmemory differs by treating writes as schema-validated structured extraction rather than relying on retrieval over stored memory items, and reports substantially higher end-to-end F1.

  • Lewis et al., 2020

    RAG is the architectural pattern this paper positions itself against. xmemory inverts the typical RAG balance by moving interpretation from read time (retrieve-then-generate) to write time (extract-then-validate), arguing that retrieval is mismatched to stateful memory operations.

  • Madaan et al., 2023

    Self-Refine demonstrated iterative self-feedback as a way to improve LLM outputs. xmemory uses iterative refinement but constrains it via schema validation gates and local retries, rather than free-form critique, which makes the loop converge to verifiable structure.

  • Maharana et al., 2024

    LoCoMo introduced a long-term conversational memory benchmark. The paper uses this lineage of end-to-end memory evaluation and argues that retrieval-based systems benchmarked on LoCoMo-style tasks systematically underperform on operations beyond thematic recall.

Original abstract

Persistent AI memory is often reduced to a retrieval problem: store prior interactions as text, embed them, and ask the model to recover relevant context later. This design is useful for thematic recall, but it is mismatched to the kinds of memory that agents need in production: exact facts, current state, updates and deletions, aggregation, relations, negative queries, and explicit unknowns. These operations require memory to behave less like search and more like a system of record. This paper argues that reliable external AI memory must be schema-grounded. Schemas define what must be remembered, what may be ignored, and which values must never be inferred. We present an iterative, schema-aware write path that decomposes memory ingestion into object detection, field detection, and field-value extraction, with validation gates, local retries, and stateful prompt control. The result shifts interpretation from the read path to the write path: reads become constrained queries over verified records rather than repeated inference over retrieved prose. We evaluate this design on structured extraction and end-to-end memory benchmarks. On the extraction benchmark, the judge-in-the-loop configuration reaches 90.42% object-level accuracy and 62.67% output accuracy, above all tested frontier structured-output baselines. On our end-to-end memory benchmark, xmemory reaches 97.10% F1, compared with 80.16%-87.24% across the third-party baselines. On the application-level task, xmemory reaches 95.2% accuracy, outperforming specialised memory systems, code-generated Markdown harnesses, and customer-facing frontier-model application harnesses. The results show that, for memory workloads requiring stable facts and stateful computation, architecture matters more than retrieval scale or model strength alone.

xmemory operationalizes schema-grounded memory by decomposing ingestion into staged object/field/value extraction with validation gates and local retries, shifting interpretive load from read time to write time and reporting 97.10% F1 on an end-to-end memory benchmark.

Why it matters

The agentic-memory literature has been converging on retrieval-centric designs (Mem0, LoCoMo-style benchmarks, GraphRAG variants) that inherit the well-documented pathologies of long-context inference: lost-in-the-middle (Liu et al., 2024), context rot (Hong et al., 2025), and the read-time hallucination dynamics analyzed by Kalai et al. (2025). These pathologies are particularly damaging for the operations production agents actually need — exact-fact CRUD, aggregation, negation, explicit unknowns — which are not naturally expressible as similarity search. The paper's framing (interpretation belongs on the write path) recasts memory design in terms compatible with database semantics: validated records, constrained queries, deterministic reads. If the empirical claims hold up under independent replication, this is a substantive shift in how to architect agent memory, comparable to the move from grep-style IR to relational databases for transactional workloads. The application-level result (95.2% beating specialized memory systems and frontier-assistant harnesses) is the strongest signal that the architectural argument has teeth.

Method

  • — System decomposition: writes are factored into a pipeline of (i) object detection — given an input segment, identify schema objects referenced; (ii) field detection — per detected object, identify mentioned fields; (iii) field-value extraction — extract per-field values, with per-field constraints (verbatim, enum, type, format). Each stage is a separate prompted call.
  • — Validation gates: outputs at each stage are validated against JSON Schema 2020-12 constraints. Failures trigger local retries scoped to the failing sub-task with adjusted prompts (presumably with the failure mode surfaced as feedback). This localizes the cost of failure rather than restarting the full pipeline, and avoids the global re-extraction cost characteristic of grammar-aligned decoding (Park et al., 2024) or single-shot constrained generation.
  • — Stateful prompt control: later stages are conditioned on validated outputs of earlier stages, which both narrows the candidate space and prevents the value-extraction stage from inventing fields the schema does not permit. This is a structured analogue of self-refinement (Madaan et al., 2023) but with verifiable termination criteria.
  • — Explicit-unknown discipline: the schema designates fields where inference is forbidden — values must be extractable verbatim or marked unknown. This directly engages with the hallucination-as-overconfidence framing of Kalai et al. (2025) by removing the option to fabricate.
  • — Judge-in-the-loop configuration: an additional LLM pass scores extracted values pre-commit. Used in the highest-accuracy benchmark configuration. Functions as an external verifier akin to LLM-as-judge patterns; the abstract does not specify whether the judge has access to the source segment, the schema, or both, nor whether it can trigger retries vs only veto.
  • — Read path: structured queries (lookup, filter, aggregate, negate, exists) over verified records. Reads are deterministic given the store state, which is the property that enables stateful operations like deletes and updates to compose correctly.
  • — Evaluation: three benchmarks — (a) a structured extraction benchmark with frontier structured-output baselines; (b) an end-to-end memory benchmark presumably covering CRUD, aggregation, negation, unknowns; (c) an application-level task with specialized memory systems (Mem0), code-generated Markdown harnesses (a notably strong simple baseline), and frontier-model assistant harnesses.
  • — Reported numbers: judge-in-the-loop reaches 90.42% object-level / 62.67% output accuracy on extraction; 97.10% F1 on end-to-end memory vs 80.16–87.24% baseline range; 95.2% on the application-level task.
  • — Stated assumptions: schema can be specified a priori; multi-call write cost is acceptable for the workload; domain stability is sufficient that schema evolution does not dominate operational cost.
  • — Implicit assumptions worth flagging: (i) the input segments are sufficiently bounded that field-detection has manageable cardinality; (ii) the schema captures enough of the semantic surface that out-of-schema content is genuinely discardable; (iii) the judge is uncorrelated enough with the extractor to add signal — this is the standard concern with same-family LLM-as-judge configurations.

Result

Headline metrics: 90.42% object-level / 62.67% output accuracy on structured extraction (judge-in-the-loop), claimed above 'all tested frontier structured-output baselines'; 97.10% F1 on the end-to-end memory benchmark vs 80.16–87.24% across third-party baselines; 95.2% on the application-level task vs Mem0-class systems, code-generated Markdown harnesses, and frontier assistant harnesses. Several observations: (1) The ~28-point gap between object-level (90.42%) and output (62.67%) accuracy on the extraction benchmark is the most informative single number — it locates the bottleneck firmly in field-value extraction rather than object recognition, and is consistent with prior reports that JSON-mode/structured-output models still struggle with verbatim value fidelity at scale. (2) The 10–17 point F1 gap on end-to-end memory is large enough to be meaningful even discounting benchmark-construction effects, but per-operation breakdowns (updates vs deletes vs negation vs unknowns) are not visible in the abstract and would be the most diagnostic data. (3) Outperforming code-generated Markdown harnesses is non-trivial: writing notes to a Markdown file is a surprisingly strong baseline for many memory tasks (it preserves verbatim text, supports updates via overwrite, and inherits the model's natural language reading skills), so a clean win here is the strongest evidence that schema grounding adds real capability beyond 'just write it down'. (4) Beating customer-facing frontier-model assistant harnesses suggests the gap is not closable by raw model strength alone within the tested horizon, supporting the architecture-over-scale framing that aligns with He et al. (2025) on agentic system design.

Caveats

Methodological concerns and missing analyses: (1) Benchmark provenance — at least one of the benchmarks appears to be authors' own; without released splits and protocols, the claim that xmemory beats baselines on a self-constructed benchmark is weak even if true. The community standard would be reporting on LoCoMo (Maharana et al., 2024), MEMTRACK (Deshpande et al., 2025), and Letta's benchmarking suite with identical configurations. (2) Cost/latency reporting — the architecture has a high write-side compute multiplier (3+ stages, retries, optional judge). Production adoption hinges on the cost-accuracy frontier, which the abstract does not characterize. A pareto plot vs Mem0/RAG would be essential. (3) Ablation gap — without component ablations the contribution attribution is unclear. The four candidate sources of gain (object/field decomposition, validation gates, local retries, judge) need to be teased apart; the 90.42 vs 62.67 split hints that judge is doing heavy lifting on object-level metrics specifically. (4) Schema dependence and brittleness — no reported analysis of degradation under schema misspecification, schema drift, or out-of-distribution inputs. The Hernández Chillón et al. (2024) line of NoSQL schema evolution is cited but apparently not experimentally engaged. (5) Judge correlation risk — if the judge is the same model family as the extractor, gains may be partially illusory under same-family bias; cross-family judge ablation would address this. (6) Output accuracy at 62.67% is not deployment-grade for many enterprise extraction settings, which suggests the technique still requires domain-specific schema engineering and possibly fine-tuning to be production-ready. (7) The 'never infer' discipline is rhetorically clean but operationally fragile: real users coreference, paraphrase, and abbreviate, so the boundary between 'extract verbatim' and 'lightly normalize' will leak in practice. (8) No discussion (in the abstract) of provenance tracking — Buneman et al. (2001) is cited and would be a natural fit for justifying which extraction supports which record, but operational provenance behavior is unclear. (9) The end-to-end benchmark presumably mixes operations; without per-op F1 it's hard to know whether xmemory wins uniformly or only on a subset (suspicion: gains are concentrated on negation and explicit unknowns, where retrieval baselines structurally fail). Likely pushback: retrieval advocates will argue that hybrid retrieval+extraction systems would close most of the gap; long-context advocates will argue that frontier models with structured prompting and a million-token context can avoid the schema-engineering tax; database purists will (rightly) note that this is reinventing well-understood ETL pipelines with LLMs in the loop and ask whether the LLM is even necessary in the extraction stages once the schema is fixed. Failure modes to probe: schema-incomplete inputs (information silently dropped), schema-mis-specified inputs (information mis-categorized), high-cardinality field detection (the field-detection stage's prompt grows with schema size — scaling behavior unclear), adversarial inputs designed to trick the value-extractor into spurious verbatim quoting. Follow-up directions worth pursuing: (a) automated schema induction from interaction logs to amortize the schema-engineering cost; (b) information-theoretic accounting along the lines of He et al. (2025) and Shannon (1948) — specifically, characterizing the bits of interpretation moved from read to write and the implied compression ratio; (c) integration with provenance tracking (Buneman et al., 2001) and grammar-aligned decoding (Park et al., 2024) for the value-extraction stage; (d) a principled treatment of schema evolution that preserves prior records under schema changes; (e) comparison to OneKE (Luo et al., 2025) and Dagdelen et al. (2024) on shared structured-extraction benchmarks; (f) hybrid designs where the structured store is the primary memory but a small RAG channel handles open-ended thematic recall, formalizing the boundary between system-of-record and system-of-engagement memory.

Builds on

  • Chhikara et al., 2025

    Mem0 is the canonical production-oriented LLM agent memory system and a head-to-head baseline. xmemory diverges by inverting the read/write balance: where Mem0 emphasizes scalable retrieval over stored memory items, xmemory enforces schema-validated structured writes so that reads degenerate to constrained queries.

  • Madaan et al., 2023

    Self-Refine established iterative self-feedback as an inference-time improvement loop. xmemory adapts this idea but constrains the loop with schema-validation termination criteria and localizes retries to failing sub-tasks, converting an open-ended critique loop into a bounded, verifiable refinement.

  • Kalai et al., 2025

    Argues that hallucination is partially a calibration/incentive problem — models prefer plausible answers to admitting unknowns. xmemory operationalizes a counter-incentive at the system level by making 'unknown' a first-class schema value and forbidding inference on protected fields.

  • Maharana et al., 2024

    LoCoMo formalized very-long-term conversational memory evaluation. The paper inherits this evaluation framing for end-to-end memory tasks and uses it to argue retrieval-only memory systems systematically underperform on stateful operations beyond thematic recall.

Original abstract

Persistent AI memory is often reduced to a retrieval problem: store prior interactions as text, embed them, and ask the model to recover relevant context later. This design is useful for thematic recall, but it is mismatched to the kinds of memory that agents need in production: exact facts, current state, updates and deletions, aggregation, relations, negative queries, and explicit unknowns. These operations require memory to behave less like search and more like a system of record. This paper argues that reliable external AI memory must be schema-grounded. Schemas define what must be remembered, what may be ignored, and which values must never be inferred. We present an iterative, schema-aware write path that decomposes memory ingestion into object detection, field detection, and field-value extraction, with validation gates, local retries, and stateful prompt control. The result shifts interpretation from the read path to the write path: reads become constrained queries over verified records rather than repeated inference over retrieved prose. We evaluate this design on structured extraction and end-to-end memory benchmarks. On the extraction benchmark, the judge-in-the-loop configuration reaches 90.42% object-level accuracy and 62.67% output accuracy, above all tested frontier structured-output baselines. On our end-to-end memory benchmark, xmemory reaches 97.10% F1, compared with 80.16%-87.24% across the third-party baselines. On the application-level task, xmemory reaches 95.2% accuracy, outperforming specialised memory systems, code-generated Markdown harnesses, and customer-facing frontier-model application harnesses. The results show that, for memory workloads requiring stable facts and stateful computation, architecture matters more than retrieval scale or model strength alone.

2604.28156v1 · Apr 30 · Robotics · AI · Machine Learning

FlexiTac: A Low-Cost, Open-Source, Scalable Tactile Sensing Solution for Robotic Systems

Binghao Huang, Yunzhu Li

FlexiTac is a cheap, open-source touch-sensor kit that snaps onto robot grippers so robots can feel what they're handling.

Why it matters

Robots are great at seeing but bad at feeling. Vision alone struggles with tasks like plugging in a cable, picking up a soft fruit without crushing it, or knowing when a grip is slipping. Good touch sensors usually cost a lot or require a specialist to build, which keeps tactile robotics out of reach for most labs and startups. FlexiTac is a 'plug-in' kit anyone can build with off-the-shelf parts and use today, lowering the barrier to giving robots a sense of touch and making research results easier to reproduce across different teams.

Method

  • — Sensor pads are built like a thin sandwich: two flexible printed circuit boards (FPCs) on the outside with electrode patterns printed on them, and a pressure-sensitive material called Velostat in the middle.
  • — When you press on the pad, the Velostat's electrical resistance changes at that spot. By reading the grid of electrodes, you get a heatmap of where and how hard the pad is being touched.
  • — Because the electrodes are printed directly into the FPCs (instead of being hand-wired), pads are quick to manufacture and consistent from one to the next.
  • — A small custom circuit board reads all the electrode channels at once and streams the data to a computer at 100 times per second over a normal serial cable.
  • — The pads come in different shapes - small fingertip versions and larger 'mat' versions - and bend easily, so they fit on rigid two-finger grippers, soft grippers, or larger surfaces without redesigning the robot.
  • — The authors show the system works with several modern robot-learning recipes, including combining touch with 3D vision, transferring skills between different robots, and training in simulation before deploying on the real robot.
  • — Everything (designs, firmware, software) is released as open source so others can build it themselves.

Result

The paper demonstrates that FlexiTac can be mounted on multiple robot platforms without major mechanical changes and that it plugs into modern tactile-learning pipelines: fusing touch with 3D vision for contact-aware decisions, transferring learned skills between different robot bodies, and doing 'real-to-sim-to-real' training where a policy is fine-tuned in a fast GPU-based touch simulator before going back to hardware. The abstract emphasizes practicality - low cost, fast to fabricate, 100 Hz streaming, repeatable builds - rather than head-to-head benchmark numbers, so concrete accuracy or success-rate figures are not given in the abstract itself.

Caveats

Piezoresistive films like Velostat are known to drift over time, respond differently at different temperatures, and can suffer from hysteresis (the reading depends partly on what just happened, not only the current pressure). The abstract doesn't quantify durability, calibration stability, or spatial resolution compared with camera-based sensors like GelSight or DIGIT, which generally give richer contact information. 'Low-cost and open-source' is a real win, but the proof will be in whether other labs can actually reproduce the pads with comparable performance and whether the sensors hold up over thousands of grasps. Expect pushback from groups using high-resolution vision-based tactile sensors who will want apples-to-apples comparisons on standard manipulation benchmarks.

Builds on

  • Huang et al, 2024

    Earlier work (3D-ViTac) by one of the same authors showed that combining 3D vision with tactile signals helps robots do fine manipulation. FlexiTac provides a cheaper, more scalable hardware backbone for that kind of visuo-tactile learning.

  • Bhirangi et al, 2024

    AnySkin pushed the idea of plug-and-play tactile skins for robots. FlexiTac shares the 'easy to attach, easy to replace' philosophy but uses a piezoresistive FPC-Velostat-FPC stack aimed at dense pressure maps and high fabrication throughput.

  • Lambeta et al, 2020

    DIGIT is a popular low-cost vision-based tactile sensor. FlexiTac targets a different trade-off: thinner, more flexible, easier to scale to large areas, at the cost of the rich image-like signal a camera-based sensor provides.

  • Luo et al, 2021

    Showed that piezoresistive textile sensors can capture human-environment interactions at scale. FlexiTac applies a similar materials philosophy to robot end-effectors with FPC-integrated electrodes for repeatable manufacturing.

Original abstract

We present FlexiTac, a low-cost, open-source, and scalable piezoresistive tactile sensing solution designed for robotic end-effectors. FlexiTac is a practical "plug-in" module consisting of (i) thin, flexible tactile sensor pads that provide dense tactile signals and (ii) a compact multi-channel readout board that streams synchronized measurements for real-time control and large-scale data collection. FlexiTac pads adopt a sealed three-layer laminate stack (FPC-Velostat-FPC) with electrode patterns directly integrated into flexible printed circuits, substantially improving fabrication throughput and repeatability while maintaining mechanical compliance for deployment on both rigid and soft grippers. The readout electronics use widely available, low-cost components and stream tactile signals to a host computer at 100 Hz via serial communication. Across multiple configurations, including fingertip pads and larger tactile mats, FlexiTac can be mounted on diverse platforms without major mechanical redesign. We further show that FlexiTac supports modern tactile learning pipelines, including 3D visuo-tactile fusion for contact-aware decision making, cross-embodiment skill transfer, and real-to-sim-to-real fine-tuning with GPU-parallel tactile simulation. Our project page is available at https://flexitac.github.io/.

FlexiTac is an open-source, FPC-laminated piezoresistive tactile sensor system with a 100 Hz multi-channel readout, designed as a drop-in module for modern visuo-tactile learning.

Why it matters

Tactile sensing is widely acknowledged to help in contact-rich manipulation, but the field is fragmented across incompatible hardware: vision-based gels (GelSight, DIGIT, OmniTact) give rich images but are bulky and per-finger; magnetic skins (ReSkin, AnySkin) are thin but indirect; capacitive/piezoresistive arrays scale in area but historically suffer fabrication variability. This makes it hard to share datasets, reproduce policies, or scale data collection. FlexiTac targets exactly that gap: a piezoresistive solution whose electrodes are integrated into FPCs for batch manufacturing, with a documented readout board, an open simulator hook, and demonstrations on multiple robot embodiments. If the hardware reproduces well across labs, it could become a default 'tactile commodity' for visuo-tactile learning research, similar to how RealSense became default for depth.

Method

  • — Sensor stack: a sealed three-layer laminate (FPC top electrodes - Velostat piezoresistive interlayer - FPC bottom electrodes). Pressure on a taxel locally reduces the through-thickness resistance of the Velostat, which is sensed as a voltage/current change at the crossing of a row and column electrode.
  • — Electrode integration: row/column electrode patterns are routed directly in the FPC artwork rather than glued or hand-soldered. This is the main fabrication-throughput claim - pads come out of standard FPC manufacturing rather than bench assembly, improving repeatability.
  • — Form factors: at least two configurations are described - smaller fingertip pads for parallel-jaw or soft grippers, and larger tactile mats for palm-sized or surface-mounted use. Mechanical compliance from the FPC stack lets them wrap rigid and soft end-effectors without major redesign.
  • — Readout electronics: a compact multi-channel board built from widely available, low-cost components multiplexes the row/column scan, digitizes the per-taxel signals, and streams synchronized frames to a host PC over serial at 100 Hz. The abstract does not specify channel count or bit depth.
  • — Software/learning integration: the system is shown to feed into (a) 3D visuo-tactile fusion pipelines for contact-aware policies, (b) cross-embodiment skill transfer (same sensor pads on different grippers), and (c) real-to-sim-to-real fine-tuning using a GPU-parallel tactile simulator. This last point implies a differentiable or at least batched contact-force-to-taxel rendering pipeline compatible with parallel RL/IL training.
  • — Open-sourcing: hardware designs, electronics, and software are released, with the explicit aim of reproducibility and community adoption.
  • — Calibration and signal processing details (drift compensation, hysteresis modeling, temperature behavior) are not described in the abstract; these are typical pain points for Velostat-based sensors and will matter for downstream users.

Result

The headline results in the abstract are systems-level rather than benchmark numbers. FlexiTac is shown to (i) be mountable on diverse robot platforms - rigid parallel-jaw and soft grippers, multiple end-effector geometries - without mechanical redesign; (ii) stream synchronized multi-channel tactile data at 100 Hz suitable for real-time control; and (iii) plug into three nontrivial learning regimes: 3D visuo-tactile fusion for contact-aware decision making, cross-embodiment skill transfer, and real-to-sim-to-real fine-tuning with GPU-parallel tactile simulation. The abstract does not report task success rates, spatial resolution in mm, force range/sensitivity in N, signal-to-noise figures, or unit cost numbers, so the quantitative case rests on the project page and full paper rather than the abstract. Given the body length (~5800 words), one should expect at least qualitative comparisons in those sections and likely demonstration tasks where FlexiTac-equipped grippers outperform vision-only baselines on contact-heavy manipulation.

Caveats

Several concerns are likely to come up in review. First, Velostat is well-known for drift, hysteresis, temperature sensitivity, and limited dynamic range; the abstract does not discuss calibration strategy, lifetime under repeated grasping, or how the sealed laminate handles shear and edge loading. Second, while integrating electrodes into FPCs improves repeatability vs hand-built arrays, unit-to-unit Velostat variability and aging may still be the dominant noise source - real reproducibility numbers across a batch would be the right ablation. Third, 'low-cost' and 'scalable' need quantification: BOM cost, yield, and the fabrication time per pad relative to ReSkin/AnySkin or capacitive textile arrays. Fourth, the 100 Hz figure is fine for many manipulation tasks but modest compared with vision-based tactile sensors and may be limiting for slip detection or fast contact transients; channel count and latency matter too. Fifth, the visuo-tactile learning demos need apples-to-apples comparisons - ideally the same task and policy class with FlexiTac vs DIGIT/GelSight/AnySkin - to show that the cheaper, lower-resolution signal is sufficient. Sixth, the GPU-parallel tactile simulator is a strong selling point but introduces its own sim-to-real gap that should be characterized; prior work (Narang et al, Bi et al, Church et al) shows this is nontrivial. Finally, the abstract's framing as a 'plug-in module' will invite scrutiny of how it compares with AnySkin's plug-and-play pitch, especially on durability and replacement workflow.

Builds on

  • Huang et al, 2024

    3D-ViTac, by one of the same authors, established a pipeline for fine-grained manipulation by fusing 3D vision with dense tactile signals. FlexiTac is the natural hardware companion: a cheaper, more scalable sensor that feeds the same kind of visuo-tactile policy and is explicitly demonstrated on that pipeline.

  • Bhirangi et al, 2024

    AnySkin pushed plug-and-play magnetometer-based skins as a swappable tactile module. FlexiTac shares the modular 'mount it and go' philosophy but uses a piezoresistive FPC-Velostat-FPC stack aimed at dense pressure maps over larger areas with cheaper fabrication, trading magnetic sensitivity for areal scalability.

  • Lambeta et al, 2020

    DIGIT showed that a low-cost vision-based tactile sensor can support in-hand manipulation research. FlexiTac targets the opposite end of the trade space: lower spatial richness than a camera/gel system, but thinner, more flexible, and easier to deploy across heterogeneous end-effectors and large surfaces.

  • Huang et al, 2025

    VT-Refine demonstrated bimanual assembly via simulation fine-tuning with visuo-tactile feedback. FlexiTac's claim of GPU-parallel tactile simulation and real-to-sim-to-real fine-tuning sits in the same line of work, providing the hardware and simulator infrastructure to scale that approach more broadly.

Original abstract

We present FlexiTac, a low-cost, open-source, and scalable piezoresistive tactile sensing solution designed for robotic end-effectors. FlexiTac is a practical "plug-in" module consisting of (i) thin, flexible tactile sensor pads that provide dense tactile signals and (ii) a compact multi-channel readout board that streams synchronized measurements for real-time control and large-scale data collection. FlexiTac pads adopt a sealed three-layer laminate stack (FPC-Velostat-FPC) with electrode patterns directly integrated into flexible printed circuits, substantially improving fabrication throughput and repeatability while maintaining mechanical compliance for deployment on both rigid and soft grippers. The readout electronics use widely available, low-cost components and stream tactile signals to a host computer at 100 Hz via serial communication. Across multiple configurations, including fingertip pads and larger tactile mats, FlexiTac can be mounted on diverse platforms without major mechanical redesign. We further show that FlexiTac supports modern tactile learning pipelines, including 3D visuo-tactile fusion for contact-aware decision making, cross-embodiment skill transfer, and real-to-sim-to-real fine-tuning with GPU-parallel tactile simulation. Our project page is available at https://flexitac.github.io/.

2604.28139v1 · Apr 30 · Software Engineering · AI

Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

Chenxin Li, Zhengyang Tang, Huangxin Lin, Yunlong Lin, Shijue Huang, Shengyuan Liu, Bowen Ye, Rang Li, Lei Li, Benyou Wang, Yixuan Yuan

A new benchmark tests AI assistants on realistic, regularly-updated office and computer tasks, and even the best model only finishes about two-thirds of them.

Why it matters

Companies are starting to hand real workflows over to AI agents, things like updating HR records, fixing files in a workspace, or coordinating across multiple business apps. If we only grade an agent on whether its final message sounds right, we miss whether it actually completed the job correctly. And if our test set never changes, agents (and the people training them) will quietly overfit to it. A benchmark that updates with what people actually need done, and that audits the agent's footprints, gives a much more honest picture of whether these systems are ready to be trusted with real work.

Method

  • — Builds a benchmark with two layers: a 'signal layer' that refreshes from public data about what workflows people actually want automated, and a frozen, time-stamped release snapshot so results stay reproducible
  • — Each release pulls from a Top-500 list of in-demand skills and turns them into controlled tasks with set-up files, fake business services, workspaces, and graders
  • — The current release has 105 tasks covering controlled business services (like HR or management workflows) and local workspace repair (fixing things on a computer)
  • — Grading watches the agent's execution traces, audit logs, the state of services it touched, and the files left behind, not just its final message
  • — Uses strict automatic checks where there's hard evidence, and only uses an LLM as a judge for fuzzy things like whether wording is appropriate
  • — Tests 13 frontier models under one shared public passing rule
  • — Caveat: the benchmark depends on how 'demand signals' are picked, and on the realism of the simulated business services

Result

No model is close to solved performance. The top model passes only 66.7% of tasks, and none break 70%. Failures cluster in specific areas: HR, management, and tasks that span multiple business systems are the hardest, while local workspace repair is easier but still not maxed out. The authors also point out that just looking at leaderboard rank is misleading, two models with similar pass rates can behave very differently on overall completion, and the tasks that actually distinguish models are concentrated in a middle 'medium-difficulty' band.

Caveats

The benchmark is only as good as the 'demand signals' it pulls from, if those signals are biased toward certain kinds of work, the benchmark will be too. The simulated business services are controlled fixtures, so they may not capture the messiness of real enterprise systems with weird permissions, flaky APIs, or unusual data. Using an LLM judge for semantic checks introduces some subjectivity, even if it's only used where deterministic checks can't reach. And because the benchmark refreshes over time, comparing scores across releases will need careful versioning. What needs proof next: that scores on this benchmark actually predict real-world deployment success, and that refreshing the task set genuinely prevents overfitting rather than just adding noise.

Builds on

  • Jimenez et al, 2024

    SWE-bench grades agents on resolving real GitHub issues with code-level checks. Claw-Eval-Live extends that 'verify the actual work, not just the answer' philosophy beyond coding, into broader business and workspace workflows, and adds a refreshable task layer.

  • Drouin et al, 2024

    WorkArena tests web agents on common knowledge-work tasks in a fixed environment. Claw-Eval-Live shares the focus on real workflow tasks but explicitly avoids freezing the task set, refreshing tasks from live demand signals.

  • Jain et al, 2024

    LiveCodeBench introduced the idea of a continuously refreshed, contamination-resistant coding benchmark. Claw-Eval-Live applies a similar live-update philosophy to workflow agents instead of code generation.

  • Pan et al, 2024

    WebCanvas benchmarks web agents in online environments with evolving content. Claw-Eval-Live borrows the 'evolving environment' instinct but evaluates structured business and workspace workflows with deterministic graders rather than live web pages.

Original abstract

LLM agents are expected to complete end-to-end units of work across software tools, business services, and local workspaces. Yet many agent benchmarks freeze a curated task set at release time and grade mainly the final response, making it difficult to evaluate agents against evolving workflow demand or verify whether a task was executed. We introduce Claw-Eval-Live, a live benchmark for workflow agents that separates a refreshable signal layer, updated across releases from public workflow-demand signals, from a reproducible, time-stamped release snapshot. Each release is constructed from public workflow-demand signals, with ClawHub Top-500 skills used in the current release, and materialized as controlled tasks with fixed fixtures, services, workspaces, and graders. For grading, Claw-Eval-Live records execution traces, audit logs, service state, and post-run workspace artifacts, using deterministic checks when evidence is sufficient and structured LLM judging only for semantic dimensions. The release contains 105 tasks spanning controlled business services and local workspace repair, and evaluates 13 frontier models under a shared public pass rule. Experiments reveal that reliable workflow automation remains far from solved: the leading model passes only 66.7% of tasks and no model reaches 70%. Failures are structured by task family and execution surface, with HR, management, and multi-system business workflows as persistent bottlenecks and local workspace repair comparatively easier but unsaturated. Leaderboard rank alone is insufficient because models with similar pass rates can diverge in overall completion, and task-level discrimination concentrates in a middle band of tasks. Claw-Eval-Live suggests that workflow-agent evaluation should be grounded twice, in fresh external demand and in verifiable agent action.

Claw-Eval-Live is a refreshable, action-verified benchmark for workflow agents where the best of 13 frontier models passes only 66.7% of 105 tasks.

Why it matters

Workflow automation is the current frontier for LLM agents, the pitch is that agents complete end-to-end units of work across SaaS tools, internal services, and local workspaces. But the field's evaluation infrastructure is mismatched to that pitch. Static benchmarks like SWE-bench-style snapshots invite contamination and overfitting, and answer-only grading rewards plausible-sounding outputs even when the agent did not actually mutate the right state. Claw-Eval-Live argues, with evidence, that workflow-agent evaluation needs to be 'grounded twice': in fresh external demand (so tasks track what real users want automated) and in verifiable agent action (so a pass implies the work was actually done). The headline finding, no frontier model exceeds 70%, suggests the field is overstating readiness for production workflow deployment.

Method

  • — Two-layer design: a refreshable signal layer that ingests public workflow-demand signals (e.g., a 'ClawHub Top-500' skill list for the current release) and a frozen, time-stamped release snapshot for reproducibility
  • — Each release materializes selected skills into controlled tasks with fixed fixtures, controlled business services, local workspaces, and graders, so reruns are deterministic given the snapshot
  • — Current release: 105 tasks split across two execution surfaces, controlled business services (HR, management, multi-system business workflows) and local workspace repair (file/system manipulation)
  • — Evaluation collects multi-source evidence: full execution traces, audit logs from the controlled services, post-run service state, and post-run workspace artifacts
  • — Grading hierarchy: deterministic checks (state diffs, audit-log assertions, artifact validation) wherever evidence is sufficient; structured LLM-as-judge only for semantic dimensions that can't be checked programmatically
  • — Single shared public pass rule across all evaluated models, with task-level pass/fail and a separate notion of 'overall completion' that captures partial progress
  • — Tested 13 frontier models under this rule
  • Refreshable vs frozen separation is the explicit answer to contamination and 'benchmark rot' — old releases stay citeable, new releases track current demand
  • — Assumptions: (a) public demand signals like ClawHub Top-500 are a reasonable proxy for what workflows matter, (b) controlled fixtures are faithful enough to surface real failure modes, (c) the LLM judge is reliable on the narrow semantic slices it's used for

Result

The leading model passes 66.7% of tasks; no model reaches 70%. Failures are not uniform — they're structured by task family and execution surface. HR, management, and multi-system business workflows are persistent bottlenecks, suggesting that cross-service coordination and stateful business logic are still hard. Local workspace repair is comparatively easier but unsaturated, meaning even the 'easier' surface has headroom. Two further findings sharpen the picture. First, leaderboard rank is insufficient: models with similar pass rates can diverge substantially on overall completion, indicating different failure profiles (e.g., one model partially completes many tasks, another fully completes a different subset). Second, task-level discrimination is concentrated in a middle band of tasks, easy tasks pass for everyone, hard tasks fail for everyone, and the signal sits in the middle, which has implications for how to grow the benchmark.

Caveats

Several limitations deserve attention. (1) Demand-signal validity: 'ClawHub Top-500' and similar public sources reflect a particular community's stated demand and may underweight regulated or proprietary workflows where real automation pain lives. (2) Fixture fidelity: controlled business services are inherently simplified versus real enterprise stacks (auth, rate limits, partial outages, schema drift). Agents that pass here may still fail on production-grade systems. (3) Judge reliability: even narrow LLM-judge usage introduces variance and potential model-family bias, especially when judging outputs from a sibling family. (4) Cross-release comparability: the very feature that makes Claw-Eval-Live live, refreshing the signal layer, complicates longitudinal claims about progress; readers will need clear release-versioned reporting. (5) The 105-task scale is small relative to the breadth of real workflows; concentration of discriminative power in a middle band suggests future releases should deliberately expand that band. (6) The paper, per the abstract, does not appear to report inter-judge agreement, deterministic-vs-judge coverage ratios, or human pass rates, all of which would strengthen claims. What needs proof next: external validity (does Claw-Eval-Live rank predict real deployment success?), contamination resistance (do refreshed releases actually degrade memorized-task performance?), and judge calibration (does the structured LLM judge agree with humans on the semantic slices?).

Builds on

  • Jimenez et al, 2024

    SWE-bench established execution-grounded grading for coding agents by running test suites against real GitHub issues. Claw-Eval-Live generalizes this 'verify by execution evidence' stance to non-code workflow surfaces (business services, workspace repair) and adds a refreshable task layer.

  • Jain et al, 2024

    LiveCodeBench introduced contamination-aware, continuously refreshed evaluation for code generation. Claw-Eval-Live ports the live-refresh principle to agentic workflow evaluation, separating a refreshable signal layer from a frozen release snapshot.

  • Drouin et al, 2024

    WorkArena evaluates web agents on common enterprise knowledge-work tasks. Claw-Eval-Live shares the enterprise-workflow target but emphasizes refreshable demand sourcing and multi-evidence grading rather than a fixed web-task suite.

  • Liu et al, 2024

    AgentBench provides a broad multi-environment evaluation of LLM agents with task-specific success criteria. Claw-Eval-Live narrows to workflow agents but deepens the grading layer with traces, audit logs, and service-state checks, and adds the refreshable signal layer AgentBench lacks.

Original abstract

LLM agents are expected to complete end-to-end units of work across software tools, business services, and local workspaces. Yet many agent benchmarks freeze a curated task set at release time and grade mainly the final response, making it difficult to evaluate agents against evolving workflow demand or verify whether a task was executed. We introduce Claw-Eval-Live, a live benchmark for workflow agents that separates a refreshable signal layer, updated across releases from public workflow-demand signals, from a reproducible, time-stamped release snapshot. Each release is constructed from public workflow-demand signals, with ClawHub Top-500 skills used in the current release, and materialized as controlled tasks with fixed fixtures, services, workspaces, and graders. For grading, Claw-Eval-Live records execution traces, audit logs, service state, and post-run workspace artifacts, using deterministic checks when evidence is sufficient and structured LLM judging only for semantic dimensions. The release contains 105 tasks spanning controlled business services and local workspace repair, and evaluates 13 frontier models under a shared public pass rule. Experiments reveal that reliable workflow automation remains far from solved: the leading model passes only 66.7% of tasks and no model reaches 70%. Failures are structured by task family and execution surface, with HR, management, and multi-system business workflows as persistent bottlenecks and local workspace repair comparatively easier but unsaturated. Leaderboard rank alone is insufficient because models with similar pass rates can diverge in overall completion, and task-level discrimination concentrates in a middle band of tasks. Claw-Eval-Live suggests that workflow-agent evaluation should be grounded twice, in fresh external demand and in verifiable agent action.

Claw-Eval-Live proposes a two-layer (refreshable signal + frozen release) workflow-agent benchmark with multi-evidence grading, finding 66.7% top pass rate across 13 frontier models on 105 tasks.

Why it matters

Agentic workflow automation is being commercialized aggressively (Claude Code, Codex, Hermes-style agents, MetaGPT-style orchestrators), and procurement decisions increasingly cite benchmark scores. Yet most agent benchmarks inherit assumptions from QA/code-generation evaluation: a frozen, curated task set and answer-string grading. Both assumptions break for workflows. Frozen sets invite contamination as model training corpora absorb leaked tasks and as developers iterate against the leaderboard; answer-string grading conflates 'said the right thing' with 'did the right thing,' a particularly dangerous conflation in environments with side effects. Claw-Eval-Live's framing, 'grounded twice, in fresh external demand and in verifiable agent action,' is a useful design principle the field should adopt, regardless of whether this specific instantiation becomes canonical. The strong empirical result that leaderboard rank diverges from overall completion is, to my reading, the most consequential finding: it implies that current single-number reporting is hiding large differences in agent failure profiles that matter operationally.

Method

  • — Architecture: two-layer benchmark, a refreshable signal layer ingesting public workflow-demand signals (current release uses a ClawHub Top-500 skill ranking) and a frozen, time-stamped release snapshot containing fixtures, services, workspaces, graders, and seed/config state
  • — Task materialization: selected skills are reified as controlled tasks with deterministic fixtures (initial service state, workspace contents, available tools); reproducibility is achieved by snapshotting all of these alongside the model interface
  • — Execution surfaces: two — (a) controlled business services (HR, management, multi-system business workflows), and (b) local workspace repair; this dichotomy roughly mirrors the SaaS-vs-OS axis emerging in agent benchmarks (cf. WorkArena vs Terminal-Bench)
  • — Evidence channels for grading: execution traces (tool calls and arguments), audit logs from controlled services, post-run service state, and post-run workspace artifacts. The grader composes assertions over these channels rather than over the agent's final natural-language reply
  • — Grader hierarchy: deterministic checks (state diffs, audit-log predicates, artifact validators, idempotence/no-side-effect assertions) take precedence; structured LLM-as-judge is used only on irreducibly semantic dimensions (e.g., 'is the drafted email tone appropriate?'). The structuring presumably constrains the judge to rubric-style outputs, though the abstract does not detail the rubric design
  • — Pass rule: a single shared public pass rule across all 13 frontier models in the release; the abstract also references an 'overall completion' notion distinct from binary pass, suggesting either weighted partial credit or a graded sub-task decomposition
  • — Scale: 105 tasks in the current release. Small relative to SWE-bench (~2k) but appropriate given the cost of authoring fully-instrumented controlled environments per task
  • — Implicit assumptions worth flagging: (a) public demand signals like ClawHub Top-500 capture deployment-relevant workflow distribution, (b) controlled fixtures are faithful enough to surface production-relevant failure modes, (c) the LLM judge is calibrated and unbiased on the narrow semantic slices where it is invoked, (d) the refresh cadence is fast enough to outpace contamination, (e) different releases will remain comparable enough to track progress
  • — Reproducibility model: the frozen snapshot pattern (à la LiveCodeBench windows) is the right answer to the live-vs-reproducible tension; per-release results are immutable while the live layer keeps the benchmark relevant

Result

Top model: 66.7% pass rate; no model >70% across the 13 evaluated frontier systems. The fact that frontier ceiling sits well below 70% on a 105-task curated suite is itself a significant claim, particularly given how saturated some adjacent benchmarks (HumanEval, MBPP) have become. The failure structure is more interesting than the headline number: errors cluster by task family (HR, management, multi-system business workflows are persistent bottlenecks) and by execution surface (business services harder than workspace repair, but workspace repair also unsaturated). This pattern is consistent with the hypothesis that cross-service stateful coordination, schema reasoning, and long-horizon planning are the binding constraints, not single-tool tool-use, which is largely solved at the frontier. The leaderboard-rank-insufficiency finding, models with similar pass rates diverging on overall completion, implies meaningful variance in partial-credit behavior: some models likely fail catastrophically on hard tasks while others degrade gracefully, which has direct deployment implications. The discriminative-middle-band observation (task-level discrimination concentrates in a mid-difficulty band) has methodological consequences: the benchmark's information content per task is non-uniform, and future releases should deliberately oversample the discriminative band.

Caveats

Several issues deserve scrutiny. (1) Demand-signal provenance: ClawHub Top-500 is presumably a community-driven popularity ranking, which biases toward developer-visible and English-language workflows and away from regulated/proprietary enterprise tasks (finance compliance, healthcare, legal). The 'fresh external demand' grounding is only as valid as the upstream signal; this should be triangulated with at least one orthogonal demand source. (2) Fixture realism: controlled services strip out auth flakiness, rate limits, eventual consistency, partial outages, schema drift, and adversarial UI patterns (cf. Ersoy et al. on dark patterns). Agents that pass on fixtures may still fail in production; a sim-to-real validation study would strengthen external-validity claims. (3) Judge calibration: even structured LLM-as-judge usage on narrow semantic slices is known to exhibit family bias, sycophancy, and sensitivity to prompt phrasing. The abstract does not report human-judge agreement rates or judge-model ablations, which I would consider essential. (4) Coverage of deterministic vs judge grading: the deterministic share of grading is the key trustworthiness lever; without a reported ratio, it is hard to assess how much of the 66.7% number rests on judge calls. (5) Sample size and statistical power: 105 tasks with 13 models means small per-cell counts on family/surface breakdowns, the 'HR is hardest' style claims need confidence intervals or bootstrap analysis. (6) Cross-release comparability: refresh introduces a versioning burden that the community has historically managed poorly (cf. WebArena's evaluation issues documented by El Hattami et al. 2025); a clear protocol for reporting (release_id, snapshot_hash, judge_version) is necessary. (7) Reward-hacking risk: with multi-evidence grading, agents may learn to satisfy graders without satisfying intent (cf. MacDiarmid et al. 2025 on emergent misalignment from reward hacking); the paper should report adversarial probing of the graders. (8) Contamination defense is asserted via refresh, but not, per the abstract, empirically demonstrated, an ideal ablation would compare frontier-model performance on N-th release tasks vs (N-1)-th release tasks held out from training. (9) The 'overall completion' metric is referenced but not technically defined in the abstract; if it is sub-task weighted, the weighting scheme is itself a design choice with leaderboard consequences.

Builds on

  • Jimenez et al, 2024

    SWE-bench established execution-grounded grading via test-suite pass/fail on real GitHub issues. Claw-Eval-Live generalizes the execution-evidence stance beyond code, adds non-test evidence channels (audit logs, service state, artifacts), and addresses the contamination problem SWE-bench is known to suffer from via a refreshable signal layer.

  • Jain et al, 2024

    LiveCodeBench introduced time-windowed, contamination-resistant evaluation for code generation by continuously sourcing fresh problems. Claw-Eval-Live transplants this live-refresh discipline into agentic workflow evaluation and pairs it with the frozen-snapshot pattern to retain reproducibility.

  • Drouin et al, 2024

    WorkArena targets enterprise knowledge-work tasks for web agents in a fixed environment. Claw-Eval-Live shares the enterprise-workflow target and the multi-system business-workflow emphasis but rejects the fixed-environment assumption and replaces response-checking with multi-evidence grading.

  • Liu et al, 2024

    AgentBench evaluates LLM agents across multiple environments with environment-specific success criteria. Claw-Eval-Live narrows scope to workflow agents but deepens the grading layer (traces + audit logs + state + artifacts) and adds the refreshable demand-driven task sourcing AgentBench lacks.

Original abstract

LLM agents are expected to complete end-to-end units of work across software tools, business services, and local workspaces. Yet many agent benchmarks freeze a curated task set at release time and grade mainly the final response, making it difficult to evaluate agents against evolving workflow demand or verify whether a task was executed. We introduce Claw-Eval-Live, a live benchmark for workflow agents that separates a refreshable signal layer, updated across releases from public workflow-demand signals, from a reproducible, time-stamped release snapshot. Each release is constructed from public workflow-demand signals, with ClawHub Top-500 skills used in the current release, and materialized as controlled tasks with fixed fixtures, services, workspaces, and graders. For grading, Claw-Eval-Live records execution traces, audit logs, service state, and post-run workspace artifacts, using deterministic checks when evidence is sufficient and structured LLM judging only for semantic dimensions. The release contains 105 tasks spanning controlled business services and local workspace repair, and evaluates 13 frontier models under a shared public pass rule. Experiments reveal that reliable workflow automation remains far from solved: the leading model passes only 66.7% of tasks and no model reaches 70%. Failures are structured by task family and execution surface, with HR, management, and multi-system business workflows as persistent bottlenecks and local workspace repair comparatively easier but unsaturated. Leaderboard rank alone is insufficient because models with similar pass rates can diverge in overall completion, and task-level discrimination concentrates in a middle band of tasks. Claw-Eval-Live suggests that workflow-agent evaluation should be grounded twice, in fresh external demand and in verifiable agent action.

2604.28093v1 · Apr 30 · AI

What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design

Ivan Bercovich

A practical guide arguing that benchmark tasks for AI command-line agents should be written to expose failure, not to help the agent succeed, and lists the common ways task authors get this wrong.

Why it matters

When companies and researchers compare AI coding assistants, they often point to scores on benchmarks like Terminal Bench - tests where an AI agent has to use a command line to fix bugs, set up servers, or write code. Those scores influence which models get hyped, funded, and deployed. If the tests themselves are sloppy - too easy, too leading, or gameable - then the leaderboard is measuring the wrong thing. The author, who has spent over a year writing and reviewing these tasks, says this is happening at scale: by one estimate, more than 15% of tasks in popular terminal-agent benchmarks can be 'reward-hacked,' meaning the AI can pass without actually solving the problem. That makes published scores misleading for anyone trying to decide if an AI is ready for real work.

Method

  • — This is a guidelines paper, not an experimental study. It distills lessons from the author's experience contributing to and reviewing tasks for Terminal Bench, a popular benchmark where AI agents do real command-line work.
  • — Core argument: good benchmark tasks have three properties - adversarial (designed to catch failure, not enable success), difficult (the hard part is conceptual, not just tedious setup), and legible (a human reviewer can clearly tell whether the agent actually succeeded).
  • — The paper catalogs common failure modes in how tasks get written: instructions written by AI that accidentally leak hints; over-specifying every step so the test becomes a transcription exercise; difficulty that comes from boring busywork rather than real problem-solving; 'oracle' reference solutions that secretly rely on knowledge the agent doesn't have; tests that check the wrong thing (e.g., checking a file exists rather than that it has the right content); and environments where the agent can cheat - for example, by editing the test itself.
  • — Distinguishes 'conceptual difficulty' (the agent has to figure something out) from 'environmental difficulty' (the agent has to wade through a messy setup). Argues only the first kind is a meaningful capability signal.
  • — Cites recent empirical evidence that a large fraction of tasks in widely-used benchmarks are vulnerable to reward hacking, where an agent technically passes the test without actually solving the intended problem.
  • — Intended audience: people who maintain benchmarks, contributors who write tasks, and researchers who cite benchmark scores as evidence in papers and product claims.

Result

Because this is a guidelines paper rather than an experiment, there is no headline accuracy number. The most striking concrete claim is that, drawing on recent work, over 15% of tasks in popular terminal-agent benchmarks are reward-hackable - meaning an agent can pass them without genuinely doing the task. The paper's deliverable is a structured list of failure modes and design principles (adversarial, difficult, legible) that authors and reviewers can apply directly when writing or auditing tasks.

Caveats

The advice is opinion and experience, not measurement - it isn't validated by, say, showing that benchmarks rewritten under these guidelines correlate better with real-world AI usefulness. The 15%-reward-hackable figure comes from related work, not new analysis here. The guidance is also specific to terminal/sysadmin/coding agents; some of it transfers to other agent benchmarks, but not all. And there's an inherent tension the paper doesn't fully resolve: making tasks more adversarial and harder also makes them more expensive to author and review, which pushes against the market pressure (acknowledged in the abstract) to ship benchmark tasks quickly. Whether the field will actually slow down and adopt stricter standards is an open question.

Builds on

  • Merrill et al., 2026

    Terminal-Bench, the benchmark the author has been contributing to and reviewing for over a year. This paper is essentially a lessons-learned document from inside that effort, generalized into guidelines for anyone building similar evaluations.

  • Bercovich et al., 2026

    A companion dataset by the same author cataloging hundreds of reward-hackable terminal-agent environments and thousands of exploit trajectories. It supplies the empirical backbone - including the >15% reward-hackable figure - for the failure modes this guidelines paper warns against.

  • Krakovna et al., 2020

    Classic catalog of 'specification gaming,' where AI systems satisfy the letter of an objective while violating its intent. The paper applies this lens specifically to benchmark task design, arguing many tasks accidentally invite specification gaming.

  • Von Arx et al., 2025

    METR's empirical observation that current frontier models actively reward-hack evaluations. The guidelines paper treats this as motivation: if top models are already gaming benchmarks, sloppy task design isn't a theoretical risk, it's actively corrupting today's leaderboards.

Original abstract

Terminal-agent benchmarks have become a primary signal for measuring the coding and system-administration capabilities of large language models. As the market for evaluation environments grows, so does the pressure to ship tasks quickly, often without thorough adversarial review of the verification logic. This paper is a guideline for writing good benchmark tasks, drawn from over a year of contributing to and reviewing tasks for Terminal Bench. Most people write benchmark tasks the way they write prompts. They shouldn't. A prompt is designed to help the agent succeed; a benchmark is designed to find out if it can. We argue that good tasks are adversarial, difficult, and legible, and that a large class of common failure modes -- AI-generated instructions, over-prescriptive specifications, clerical difficulty, oracle solutions that assume hidden knowledge, tests that validate the wrong things, and reward-hackable environments -- are predictable consequences of treating task authoring as prompt authoring. We catalog these failure modes, argue that real difficulty is conceptual rather than environmental, and discuss recent empirical evidence that over 15% of tasks in popular terminal-agent benchmarks are reward-hackable. We hope this serves as a useful reference for benchmark maintainers, task contributors, and researchers using benchmark scores as evidence.

A guidelines paper arguing that terminal-agent benchmark tasks should be written adversarially rather than like prompts, and cataloging the failure modes that follow when authors confuse the two.

Why it matters

Terminal-agent benchmarks - Terminal Bench, Terminal-Bench Pro, SETA, OpenThoughts-Agent, and similar - have become a primary signal for ranking coding and sysadmin capability of frontier LLMs. They feed model release blog posts, RL training environments, and product claims. As an evaluation market emerges around them, throughput pressure is rising and adversarial review is lagging. Recent empirical work cited here suggests over 15% of tasks in popular benchmarks are reward-hackable, meaning a non-trivial portion of leaderboard signal reflects exploit discovery rather than capability. For anyone using benchmark deltas as evidence - in papers, in procurement, or in RL reward design - this is a systemic measurement problem, not a rounding error.

Method

  • — Format: a position/guidelines paper, not an empirical study. Evidence is the author's experience contributing to and reviewing tasks for Terminal Bench over a year, supplemented by citations to recent reward-hacking literature.
  • — Three design principles for good tasks: adversarial (the verifier and environment are designed to detect failure, not enable success), difficult (in a conceptual sense, not a clerical one), and legible (a human reviewer can quickly tell whether the agent actually solved the intended problem from the trajectory and verifier output).
  • — Explicit reframing of the authoring stance: stop thinking like a prompt engineer who wants the model to do well; think like a red-teamer who wants the verifier to be uncheatable and the instructions to be minimally leading.
  • — Catalog of recurring failure modes: (1) AI-generated instructions that bake in hints or solution structure; (2) over-prescriptive specs that turn the task into transcription; (3) clerical difficulty - tasks that are 'hard' only because of tedious setup, fragile paths, or volume of files; (4) oracle solutions that rely on hidden knowledge the agent has no way to obtain in-environment; (5) tests that validate the wrong invariants (e.g., file existence instead of content, exit code instead of behavior); (6) reward-hackable environments where the agent can edit tests, short-circuit verifiers, or read ground-truth artifacts.
  • — Conceptual vs environmental difficulty: argues only conceptual difficulty - genuine reasoning, debugging, design - is a meaningful capability signal. Environmental difficulty (path traps, unusual tooling, brittle scripts) inflates apparent task hardness without measuring anything about the model.
  • — Implicit evaluation methodology for task authors: tasks should be reviewed by someone trying to break them, not just by someone checking that a reference solution passes. The reference solution passing is necessary but far from sufficient.
  • — Empirical anchor: cites recent work indicating >15% of tasks in popular terminal-agent benchmarks are reward-hackable, framing the guidelines as a response to a measured, not hypothetical, problem.
  • — Audience and scope: explicitly targets benchmark maintainers, contributors, and researchers citing scores. Scope is terminal/CLI agents; principles likely transfer to broader agent evals but the failure-mode catalog is CLI-specific (e.g., shell-level reward hacks).

Result

There are no model-accuracy numbers because this is not a benchmarking paper. The headline empirical claim, imported from related work, is that >15% of tasks in popular terminal-agent benchmarks are reward-hackable - which the author treats as a lower bound on how much published scores overstate true capability. The contribution is a structured taxonomy: three design principles (adversarial, difficult, legible), a list of six recurring failure modes, and a sharper conceptual/environmental difficulty distinction. Practically, this is the kind of artifact that can be turned into a review checklist for benchmark PRs - and the author's framing suggests that's roughly how it has been used inside Terminal Bench review.

Caveats

Several limitations a careful reader should flag. First, the guidelines are asserted, not validated: the paper does not show that tasks rewritten under these principles produce more predictive or stable model rankings, nor that they correlate better with downstream utility. Second, the >15% reward-hackable figure is borrowed from related work and not re-derived here, so its scope and methodology should be checked at the source. Third, there is an unaddressed economic tension: adversarial authoring and review are substantially more expensive than prompt-style authoring, which conflicts with the throughput pressure the paper itself identifies in the evaluation market - the paper does not propose how maintainers should fund or incentivize the harder workflow. Fourth, some failure modes (oracle solutions assuming hidden knowledge, validating the wrong things) are not unique to terminal agents and have analogs in software testing and RL reward design; the paper would be stronger if it engaged with that literature more directly. Fifth, 'legibility' is left somewhat underspecified - it's clear what it rules out, less clear what operationally satisfies it, especially for long-horizon tasks where trajectories are large. Likely pushback from benchmark authors: that overly adversarial tasks become brittle, ambiguous, or unfair, and that some 'reward hacks' are legitimate solutions the spec failed to anticipate; the paper acknowledges this tension implicitly but does not adjudicate it. What needs proof next: a controlled study showing that benchmarks audited under these guidelines change model rankings or reduce score variance, and a quantification of how many publicly reported model-vs-model deltas would survive a strict adversarial re-authoring pass.

Builds on

  • Merrill et al., 2026

    Introduces Terminal-Bench, the benchmark the author has been reviewing and contributing to. This guidelines paper is effectively a retrospective on what task design patterns held up and which ones broke under adversarial scrutiny inside that project, generalized to a broader audience.

  • Bercovich et al., 2026

    A companion artifact (Terminal Wrench) by the same author cataloging 331 reward-hackable environments and 3,632 exploit trajectories. It supplies the empirical evidence behind the failure-mode taxonomy here, including the >15% reward-hackable claim, and grounds the guidelines in observed exploits rather than speculation.

  • Krakovna et al., 2020

    Foundational catalog of specification gaming in RL and AI systems. The paper transposes that lens onto benchmark authoring, arguing many of the failure modes in terminal-agent tasks are specification gaming made possible by under-specified verifiers and over-permissive environments.

  • Denison et al., 2024

    Documents reward tampering and emergent subterfuge in LLMs trained with RL. Relevant because terminal-agent benchmarks are increasingly used as RL environments, not just evals - so reward-hackable tasks don't just inflate scores, they actively train models to exploit verifiers, sharpening the urgency of the guidelines.

Original abstract

Terminal-agent benchmarks have become a primary signal for measuring the coding and system-administration capabilities of large language models. As the market for evaluation environments grows, so does the pressure to ship tasks quickly, often without thorough adversarial review of the verification logic. This paper is a guideline for writing good benchmark tasks, drawn from over a year of contributing to and reviewing tasks for Terminal Bench. Most people write benchmark tasks the way they write prompts. They shouldn't. A prompt is designed to help the agent succeed; a benchmark is designed to find out if it can. We argue that good tasks are adversarial, difficult, and legible, and that a large class of common failure modes -- AI-generated instructions, over-prescriptive specifications, clerical difficulty, oracle solutions that assume hidden knowledge, tests that validate the wrong things, and reward-hackable environments -- are predictable consequences of treating task authoring as prompt authoring. We catalog these failure modes, argue that real difficulty is conceptual rather than environmental, and discuss recent empirical evidence that over 15% of tasks in popular terminal-agent benchmarks are reward-hackable. We hope this serves as a useful reference for benchmark maintainers, task contributors, and researchers using benchmark scores as evidence.

2604.27776v1 · Apr 30 · AI · Language

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

Jinchao Li, Yunxin Li, Chenrui Zhao, Zhenran Xu, Baotian Hu, Min Zhang

A new test suite called WindowsWorld checks whether AI assistants can actually finish multi-step office jobs that span several Windows apps, and today's best agents mostly fail.

Why it matters

There is a lot of hype right now about AI agents that can use your computer for you - booking travel, filing expenses, doing research. Most public benchmarks measure those agents on neat, self-contained chores like 'edit this one document' or 'fill in this one form.' But office workers do not live inside one app; their day is a relay race across browsers, spreadsheets, chat tools, file explorers, and email. WindowsWorld is the first serious yardstick built around that relay race on Windows, and the results are sobering. If we want to trust agents with actual professional workflows, we need benchmarks like this one, and we need to know honestly where the gaps are. Otherwise companies will deploy agents that look great in a demo and quietly break on Tuesday morning.

Method

  • — The team built 181 tasks set inside a simulated Windows desktop with 17 common apps (think Word, Excel, Outlook, browsers, file explorer, etc.).
  • — Tasks were inspired by 16 real occupations - so a task might look like what an accountant, recruiter, or project manager would actually do at work.
  • — 78% of tasks deliberately require using more than one app, like pulling data from a spreadsheet into a slide deck or coordinating an email with a calendar entry.
  • — Each task has, on average, 5 sub-goals, so the benchmark can score partial progress instead of just pass/fail. They check progress at intermediate checkpoints.
  • — Tasks come in four difficulty levels and were generated by a team of cooperating AI 'writer' agents, then cleaned up and sanity-checked by human reviewers.
  • — They then ran today's strongest models and computer-use agents on the benchmark and measured success rate, how many steps the agent took, and where it got stuck.
  • — Caveat: it is a simulated environment, not your real laptop, and the task list is shaped by what the researchers thought a job looks like.

Result

Across the board, the leading agents struggled. On multi-app tasks, success rates stayed below 21%, far worse than on single-app tasks. The agents particularly fell apart on jobs that required them to make a judgment call ('if the invoice is over $500, route it to finance') and to coordinate across three or more apps - they would often stall on an early sub-goal and never recover. Even when they did make progress, they were inefficient: many runs blew past the number of steps a human would need, and still ended in failure. So the picture is not 'almost there' - it's that today's agents are decent button-pushers within one app and quite bad at stitching a real workflow together.

Caveats

A few honest limits. First, this is a simulator, so it does not capture every quirk of a messy real desktop, network hiccups, or weird enterprise software. Second, the tasks were generated with the help of AI and then human-reviewed; that pipeline is scalable but can bake in stylistic patterns that favor or hurt certain agents. Third, 'occupation-grounded' is a nice framing, but 16 occupations with 181 tasks is still a sample, not the world of work. Fourth, the headline numbers depend on which agents and models were tested at this moment in time - GUI agents are improving fast, so the absolute scores will move. The deeper finding - that cross-app, conditional, long-horizon work is the real frontier - is the part that should age well. What needs proving next: can agents trained or prompted specifically for cross-app planning close the gap, and do these lab results predict real on-the-job reliability?

Builds on

  • Xie et al, 2024

    OSWorld is the closest predecessor: a benchmark for multimodal agents in real computer environments. WindowsWorld keeps the simulated-desktop idea but explicitly targets cross-application, profession-grounded workflows instead of mostly single-app tasks.

  • Bonatti et al, 2024

    Windows Agent Arena also evaluates OS-level agents on Windows at scale. WindowsWorld differs by centering on multi-step, multi-app professional workflows with intermediate sub-goal checks rather than isolated tasks.

  • Yang et al, 2025 (ProBench)

    ProBench pushes for accurate process-level (sub-goal) evaluation of GUI agents. WindowsWorld adopts a similar process-centric scoring philosophy and applies it to cross-application desktop workflows.

  • Rawles et al, 2024 (AndroidWorld)

    AndroidWorld provides a dynamic benchmark for autonomous agents on mobile. WindowsWorld is the desktop, cross-application analogue, focused on professional Windows workflows rather than phone tasks.

Original abstract

While GUI agents have shown impressive capabilities in common computer-use tasks such as OSWorld, current benchmarks mainly focus on isolated and single-application tasks. This overlooks a critical real-world requirement of coordinating across multiple applications to accomplish complex profession-specific workflows. To bridge this gap, we present a computer-use benchmark in cross-application workflows, named WindowsWorld, designed to systematically assess GUI Agents on complex multi-step tasks that mirror real-world professional activities. Our methodology uses a multi-agent framework steered by 16 occupations to generate four difficulty-level tasks with intermediate inspection, which are then refined by human review and executed in a simulated environment. The resulting benchmark contains 181 tasks with an average of 5.0 sub-goals across 17 common desktop applications, of which 78% are inherently multi-application. Experimental results of leading large models and agents show that: 1) All computer-use agents perform poorly on multi-application tasks (< 21% success rate), far below the performance of simple single-app tasks; 2) They largely fail at tasks requiring conditional judgment and reasoning across $\geq$ 3 applications, stalling at early sub-goals; 3) Low execution efficiency, where tasks often fail despite far exceeding human step limits. Code, benchmark data, and evaluation resources are available at github.com/HITsz-TMG/WindowsWorld.

WindowsWorld is a 181-task, process-graded benchmark of cross-application Windows workflows grounded in 16 occupations, on which top computer-use agents score below 21% on multi-app tasks.

Why it matters

Most existing computer-use and GUI-agent benchmarks - OSWorld, Windows Agent Arena, OmniAct, Mind2Web, AndroidWorld, VisualWebArena - either focus on single applications, web-only domains, or short-horizon tasks. Yet the commercial pitch for these agents (replace knowledge-worker drudgery, automate back-office processes) lives or dies on multi-app coordination: pulling data from a CRM into a spreadsheet, summarizing it in a doc, and emailing the result. WindowsWorld is one of the first benchmarks built explicitly around that mismatch, with profession-driven task design and process-level (not just outcome-level) scoring. The empirical headline - sub-21% success on multi-app workflows from frontier agents - is a useful corrective to the demo-driven narrative, and it gives the field a concrete target. It also dovetails with the recent move toward process-centric evaluation (e.g. ProBench) and reflects an emerging consensus that step-wise sub-goal checking is needed to meaningfully evaluate long-horizon agents.

Method

  • — Environment: simulated Windows desktop covering 17 common applications across productivity (Word, Excel, PowerPoint, Outlook), browsing, file management, communication, and similar professional categories. Tasks are executed by agents in this environment.
  • — Task corpus: 181 tasks, average 5.0 sub-goals per task, four difficulty tiers, 78% inherently multi-application. The remaining ~22% are single-app, presumably to allow controlled comparison and difficulty calibration.
  • — Task generation: a multi-agent pipeline conditioned on 16 occupations (i.e., personas like accountant, HR specialist, project manager) drafts candidate tasks. Intermediate inspection agents validate feasibility and sub-goal decomposition; humans then review and refine. This is essentially an LLM-in-the-loop authoring pipeline with a human safety net.
  • Process-centric evaluation: each task is decomposed into ordered sub-goals; the runtime checks state after each step, so the benchmark reports sub-goal completion and where in the trajectory failure occurs, not just terminal success/failure.
  • — Difficulty axes: number of applications involved, presence of conditional/branching logic ('if X then route to Y'), and trajectory length. This lets the authors slice success by complexity, which is how they isolate the multi-app and conditional-reasoning failure modes.
  • — Models/agents evaluated: leading large multimodal models and computer-use agents (the candidate list includes GPT-5, Qwen3-VL, DeepSeek-V3.2, Gemma 3, UI-TARS, UiPath Screen Agent, and the 'unreasonable effectiveness of scaling agents' system from Gonzalez-Pumariega et al., 2025; the abstract does not enumerate exactly which were used).
  • — Metrics: task success rate, sub-goal completion, and execution efficiency relative to a human step budget. Notably, they flag cases where agents 'far exceed human step limits' yet still fail.
  • — Assumptions/caveats baked in: simulated environment rather than real machines; occupation grounding is researcher-curated; sub-goal decomposition assumes a roughly canonical solution path; agents are evaluated in their off-the-shelf or paper-default configurations.
  • — Open release: code, data, and evaluation harness released on GitHub, which matters because reproducibility on computer-use benchmarks is historically poor.

Result

Three findings carry the paper. (1) Across the evaluated agents, success on multi-application tasks is below 21%, dramatically below their performance on single-app tasks - the gap, not just the absolute number, is the headline. (2) Conditional reasoning over 3+ applications is a near-cliff: agents typically stall at early sub-goals and never recover, which suggests the failure is in planning and state-tracking, not in low-level GUI grounding. (3) Execution is inefficient: agents routinely exceed reasonable human step counts and still fail, implying loops, redundant exploration, and poor self-monitoring rather than productive search. Together these results argue that improvements in single-app accuracy do not transfer to professional workflows, and that the efficiency dimension (steps to success, not just success) deserves to be a first-class metric. Concrete per-model numbers are not given in the abstract beyond the <21% multi-app ceiling.

Caveats

Several issues a colleague should flag. First, simulator validity: the tasks live in a controlled Windows simulation, and we do not yet know how scores translate to real enterprise machines with VPNs, SSO, drift in app versions, and unpredictable popups. Second, generation bias: tasks authored by a multi-agent pipeline tend to inherit the planning style of the generator LLMs, which can systematically advantage agents built on similar models or disadvantage agents with different action vocabularies. Human review mitigates but does not eliminate this. Third, sub-goal grading assumes a canonical decomposition; in real workflows there are multiple correct paths, and a strict checker may under-credit creative solutions - the paper should ideally include inter-rater agreement and partial-credit policies. Fourth, the 16 occupations and 181 tasks is meaningful coverage but probably under-samples technical domains (coding IDEs, data engineering tools) and heavy-tail enterprise software. Fifth, the comparison to a 'human step limit' needs definition: averaged over how many humans, with what familiarity? Likely pushback from agent vendors will be that their production stacks (with retries, planners, memory) outperform the evaluated configurations, which is plausible and would benefit from a more standardized harness. What needs proving next: (a) whether explicitly cross-app planners or hierarchical agents close the gap, (b) whether tool-augmented or MCP-style integrations short-circuit GUI bottlenecks, and (c) how performance scales with model capability vs. agent scaffolding.

Builds on

  • Xie et al, 2024 (OSWorld)

    OSWorld pioneered open-ended multimodal agent evaluation in real computer environments and is the explicit foil: WindowsWorld keeps the simulated-desktop paradigm but reframes it around cross-application, profession-grounded, process-graded tasks rather than predominantly single-app objectives.

  • Bonatti et al, 2024 (Windows Agent Arena)

    Windows Agent Arena established large-scale evaluation of multimodal OS agents on Windows. WindowsWorld differs by emphasizing multi-step cross-app workflows and intermediate sub-goal checking, rather than breadth of isolated OS-level tasks.

  • Yang et al, 2025 (ProBench)

    ProBench argues for process-information-rich evaluation of GUI agents. WindowsWorld adopts a similar process-centric philosophy - sub-goal-level grading and trajectory diagnostics - and instantiates it specifically for cross-application desktop workflows.

  • Rawles et al, 2024 (AndroidWorld)

    AndroidWorld provides a dynamic, app-rich benchmark for mobile autonomous agents. WindowsWorld can be seen as the desktop counterpart with a stronger emphasis on professional, multi-application coordination rather than mobile single-app interactions.

Original abstract

While GUI agents have shown impressive capabilities in common computer-use tasks such as OSWorld, current benchmarks mainly focus on isolated and single-application tasks. This overlooks a critical real-world requirement of coordinating across multiple applications to accomplish complex profession-specific workflows. To bridge this gap, we present a computer-use benchmark in cross-application workflows, named WindowsWorld, designed to systematically assess GUI Agents on complex multi-step tasks that mirror real-world professional activities. Our methodology uses a multi-agent framework steered by 16 occupations to generate four difficulty-level tasks with intermediate inspection, which are then refined by human review and executed in a simulated environment. The resulting benchmark contains 181 tasks with an average of 5.0 sub-goals across 17 common desktop applications, of which 78% are inherently multi-application. Experimental results of leading large models and agents show that: 1) All computer-use agents perform poorly on multi-application tasks (< 21% success rate), far below the performance of simple single-app tasks; 2) They largely fail at tasks requiring conditional judgment and reasoning across $\geq$ 3 applications, stalling at early sub-goals; 3) Low execution efficiency, where tasks often fail despite far exceeding human step limits. Code, benchmark data, and evaluation resources are available at github.com/HITsz-TMG/WindowsWorld.

WindowsWorld introduces a process-centric, occupation-conditioned benchmark of 181 cross-application Windows workflows (avg. 5.0 sub-goals, 78% multi-app), on which leading computer-use agents score below 21% and degrade sharply with conditional, ≥3-app reasoning.

Why it matters

The GUI-agent literature has rapidly accumulated benchmarks - WoB, MiniWoB++, WebShop, Mind2Web, VisualWebArena, OSWorld, Windows Agent Arena, AndroidWorld, AndroidInTheWild, A3, SPA-Bench, OSUniverse, OmniAct, GUI-360, ScreenSpot-Pro, ProBench, MobileWorld - yet most either (i) operate within a single application or domain, (ii) score only terminal outcomes, or (iii) source tasks from crowd templates rather than profession-grounded workflows. The community has converged on the view that long-horizon, cross-app, professional automation is the commercially relevant regime, but lacks a benchmark that operationalizes it with intermediate state inspection. WindowsWorld targets exactly that gap: occupation-conditioned generation gives ecological validity; sub-goal grading gives diagnostic resolution; and the 78% multi-app share gives statistical power to study cross-app failure modes. The reported sub-21% multi-app success rate is a useful, and probably durable, lower bound on agent capability for real workflows, and a concrete target for scaffolding research, hierarchical planners, and memory architectures.

Method

  • — Environment: a simulated Windows desktop spanning 17 productivity, browsing, communication, and file-management applications. Execution semantics, action space (presumably accessibility-tree + screen + mouse/keyboard, possibly with SoM-style visual prompting in the spirit of Yang et al., 2023), and observation modality are not enumerated in the abstract but are critical determinants of measured performance.
  • — Task authoring: a multi-agent generation pipeline conditioned on 16 occupational personas drafts tasks; intermediate inspection agents validate executability and decompose into sub-goals; human reviewers refine. This is effectively LLM-as-task-author with an LLM-as-critic loop and a human gate. It scales coverage but couples task distribution to the generator's planning prior - a known issue for synthetic benchmarks.
  • — Corpus: 181 tasks, mean 5.0 sub-goals, four difficulty tiers presumably stratified by app-count, branching, and trajectory length. 78% multi-app share is high relative to OSWorld and Windows Agent Arena and is the main novelty of the distribution.
  • — Evaluation: process-centric scoring at sub-goal granularity, plus terminal success and an efficiency metric relative to a human step budget. This aligns with ProBench's process-information stance and is more diagnostic than OSWorld-style binary outcomes.
  • — Models/agents: 'leading large models and agents' - the bibliography includes GPT-5, DeepSeek-V3.2, Qwen3-VL, Gemma 3, UI-TARS, UiPath Screen Agent, and the scaling-agents system of Gonzalez-Pumariega et al. (2025). The abstract does not specify the exact evaluated set, action interface per agent, or whether agents are run with their native scaffolding or a unified harness; this matters for fair comparison.
  • — Difficulty axes are explicitly tied to the failure analysis: ≥3 applications and conditional judgment are called out as the cliff edge, suggesting the benchmark is designed to be diagnostic along (#apps × branching × horizon).
  • — Assumptions: (1) sub-goal decompositions admit a canonical-or-near-canonical ordering, (2) the simulator's state checks correctly distinguish completion from superficially similar states, (3) the human step budget is well-calibrated, (4) the occupation conditioning produces tasks representative of real practice rather than stylized job descriptions.
  • — Reproducibility: code, data, and harness released on GitHub, which is necessary but not sufficient - apples-to-apples agent comparison further requires standardized observation/action APIs and seed control.
  • — Caveats acknowledged or implicit: simulated rather than live OS, English-only and Windows-only scope, generator-induced distribution bias, and unspecified treatment of multi-solution tasks.

Result

Three quantitative claims anchor the evaluation. (1) Multi-application success ceilings out below 21% across all evaluated computer-use agents, with single-app performance materially higher - establishing a large gap attributable specifically to cross-app coordination. (2) Tasks requiring conditional judgment across ≥3 applications elicit early-sub-goal stalls: agents fail to advance past the first or second checkpoint, which under process-centric scoring reads as low partial-credit, not as 'almost solved.' This pattern is consistent with planner/state-tracking failure rather than grounding failure - if perception were the bottleneck, we would expect more uniform partial progress along the trajectory. (3) Execution efficiency is poor: many failed runs exceed human step budgets by large margins, indicating non-productive exploration (likely loops, repeated re-grounding, and lack of internal progress estimation). Absolute per-model numbers are not given in the abstract; the comparative profile across difficulty tiers is the more informative result. Together these findings imply that scaling current single-step grounding accuracy will not, on its own, close the multi-app gap - the failure surface lives in workflow-level reasoning.

Caveats

Methodological pushback worth raising: (1) Sub-goal grading granularity. If sub-goal completion is determined by environment state checks, multi-path solutions risk being under-credited; the paper should report inter-rater agreement on decompositions and an ablation comparing strict vs. lenient state matching. (2) Generator-induced bias. LLM-authored tasks tend to encode the generator's plan structure, which can advantage agents built on similar base models. A useful sanity check would be to compare success rates on LLM-authored vs. human-authored subsets, or to perturb task phrasing to test robustness. (3) Agent harness parity. Computer-use agents differ wildly in action vocabularies (raw mouse/keyboard vs. accessibility-tree calls vs. mixed) and in scaffolding (planner, memory, retries). Without a unified harness, the <21% headline is a property of the (agent, scaffold) pair, not of the underlying model. Reporting under at least two scaffolding regimes (minimal and best-known) would strengthen the claim. (4) Simulator validity. Real desktops introduce nondeterminism (network, modal popups, version drift, locale, multi-monitor) that simulators sanitize; an external validity study on a small live-machine subset would be high-value. (5) Efficiency metric. 'Human step limit' needs operational definition (how many humans, expert vs. novice, recorded under what UI?), otherwise the inefficiency claim is suggestive but not pinned. (6) Coverage. 17 applications and 16 occupations skew toward office-knowledge work; technical occupations using IDEs, terminals, BI tools, or proprietary enterprise stacks may behave very differently and are likely underrepresented. (7) Conditional reasoning failures may partly reflect prompt/observation truncation rather than reasoning per se - an ablation with extended context, scratchpads, or explicit sub-goal hints would tease this apart. (8) Statistical power. With 181 tasks split across four difficulties, the per-cell sample sizes for fine-grained slices (e.g., '≥3 apps with branching') may be small; bootstrap confidence intervals on the sub-21% number are essential. Missing ablations I'd want before treating this as definitive: (a) hierarchical planner vs. flat ReAct on the same backbone, (b) memory/replay across sub-goals, (c) accessibility-tree-only vs. screenshot-only vs. multimodal observation, (d) MCP/tool-augmented bypass of GUI for portions of the workflow, (e) retraining or fine-tuning a single model on cross-app trajectories to test whether the gap is data-limited or architecture-limited, (f) human upper bound including time and step distributions, not just a budget. Failure modes to probe further: clipboard-mediated state transfer, focus management across windows, dialog-handling, time-sensitive tasks, and recovery from incorrect actions. Strong follow-ups: (i) train a cross-app planner on synthetic trajectories from this generator and test whether the multi-app gap closes without harming single-app accuracy; (ii) build an evaluator that gives credit for alternate correct decompositions (graph-structured sub-goals); (iii) extend to live machines with telemetry replay; (iv) couple WindowsWorld with ProBench-style process metrics and ScreenSpot-Pro grounding tests to factor performance into grounding × planning × execution components; (v) study whether scaling agent ensembles (à la Gonzalez-Pumariega et al., 2025) yields disproportionate gains on the conditional-≥3-app slice, which would suggest search/verification rather than single-policy capability is the binding constraint.

Builds on

  • Xie et al, 2024 (OSWorld)

    OSWorld is the methodological anchor for simulated-OS benchmarking of multimodal agents. WindowsWorld inherits the simulated-desktop paradigm but pivots from open-ended, largely single-application tasks to occupation-grounded, cross-application workflows with sub-goal-level evaluation.

  • Bonatti et al, 2024 (Windows Agent Arena)

    Windows Agent Arena scaled multimodal OS-agent evaluation on Windows. WindowsWorld differs in distribution (78% multi-app, profession-conditioned) and in evaluation protocol (process-centric sub-goal checking and step-efficiency metrics) rather than in platform.

  • Yang et al, 2025 (ProBench)

    ProBench advocates accurate process-information evaluation of GUI agents. WindowsWorld operationalizes a closely related stance for cross-application desktop workflows, exposing where in trajectories agents stall - a diagnostic capacity outcome-only metrics lack.

  • Rawles et al, 2024 (AndroidWorld)

    AndroidWorld established dynamic, app-rich benchmarking for mobile autonomous agents. WindowsWorld is conceptually the desktop counterpart with explicit emphasis on cross-application coordination and conditional reasoning, complementing rather than replacing mobile evaluation.

Original abstract

While GUI agents have shown impressive capabilities in common computer-use tasks such as OSWorld, current benchmarks mainly focus on isolated and single-application tasks. This overlooks a critical real-world requirement of coordinating across multiple applications to accomplish complex profession-specific workflows. To bridge this gap, we present a computer-use benchmark in cross-application workflows, named WindowsWorld, designed to systematically assess GUI Agents on complex multi-step tasks that mirror real-world professional activities. Our methodology uses a multi-agent framework steered by 16 occupations to generate four difficulty-level tasks with intermediate inspection, which are then refined by human review and executed in a simulated environment. The resulting benchmark contains 181 tasks with an average of 5.0 sub-goals across 17 common desktop applications, of which 78% are inherently multi-application. Experimental results of leading large models and agents show that: 1) All computer-use agents perform poorly on multi-application tasks (< 21% success rate), far below the performance of simple single-app tasks; 2) They largely fail at tasks requiring conditional judgment and reasoning across $\geq$ 3 applications, stalling at early sub-goals; 3) Low execution efficiency, where tasks often fail despite far exceeding human step limits. Code, benchmark data, and evaluation resources are available at github.com/HITsz-TMG/WindowsWorld.

Caught up.

papers richer, with enough detail to actually remember them.

New batch tomorrow morning.

Five papers in your inbox every morning.

Settings

Default reading depth

Shortcuts

Next paper
→ or Space
Previous paper
Brief depth
1
Full depth
2
Deep depth
3
Copy link
C
Open on arXiv
O
Toggle highlights
H
Minimize glossary
G
Show shortcuts
?
Close
Esc