Human-Guided Harm Recovery for Computer Use Agents

Christy Li¹, Sky CH-Wang², Andi Peng³, Andreea Bobu¹,

¹MIT CSAIL, ²Abridge, ³humans&

ICLR 2026 Workshop on Principled Design for Trustworthy AI (Oral)

TL;DR: We introduce harm recovery—a post-execution safety method that uses human preferences to guide computer-use agents in optimally recovering from harmful scenarios.

Harm Recovery as a Post-Execution Safeguard

As LM agents gain the ability to execute real actions on real computer systems, ensuring their safety becomes increasingly critical. Most existing approaches focus on preventing harmful actions before they occur, but prevention alone is often insufficient. We introduce a framework for post-execution safeguards: guiding agents to remediate harm in ways that align with human preferences once prevention fails.

Language-model agents have rapidly moved from chat interfaces into open-ended software environments with real-world consequences (Xie et al., 2024; Zhou et al., 2023). As these agents take on more autonomous roles in computer use, downloading files, editing documents, managing accounts, and issuing system commands, the safety of their actions becomes a first-order concern. Most current approaches rely on pre-execution safeguards that attempt to block or steer the agent away from harmful actions before they occur, via refusal-trained policies (Bai et al., 2022), constitutional principles (Bai et al., 2022), or rejection classifiers (Sharma et al., 2025; Chi et al., 2024).

Yet prevention alone is often insufficient in practice. Consider an agent instructed to install a routine software update from a vendor's official server. Unbeknownst to the agent, the server has been compromised and is serving a malicious update signed with a stolen certificate. From the agent's perspective, every action looks safe—yet it culminates in real harm by enrolling the host machine into a botnet. Once such failures occur, someone or something must take corrective action, and relying on human operators is neither scalable nor practical as agents are granted increasing autonomy.

We formalize this neglected challenge as harm recovery: the problem of navigating from a harmful system state back to a safe one in alignment with human preferences. Effective recovery requires more than simply reaching any safe state—it requires doing so efficiently and in ways that reflect how people actually judge what it means to "recover well." Our central contribution is in characterizing these human preferences, building a benchmark to evaluate recovery, and operationalizing a scaffold that uses learned preferences to rerank candidate recovery plans at test time.

Formalizing Harm Recovery

We define harm recovery as the problem of navigating from an initial harmful state s_h to a safe state s_T ∈ S_safe through a sequence of actions forming a recovery trajectory τ. Rather than treating any safe-ending trajectory as acceptable, we frame recovery as an optimization problem: different recovery paths trade off multiple attributes people care about—efficiency, comprehensiveness, avoidance of side harms, long-term prevention—and the optimal strategy is the one that best reconciles practical constraints with human-centered notions of what it means to recover well. Formally, we seek a policy π* that maximizes E_τ~π[R(τ)] subject to the terminal state being safe, where R(τ) scores the alignment of the recovery trajectory with human preferences.

Human Preferences for Harm Recovery

What does it mean to recover from harm well? To answer this, we conducted a formative user study with 20 annotators, presenting them with realistic scenarios in which a computer-use agent had caused harm along with pairs of candidate recovery plans. Annotators described what they liked and disliked about each plan, gave an overall A/B judgment, and explained their reasoning. Through Braun and Clarke's thematic analysis methodology (Braun & Clarke, 2006), we distilled these responses into a principled evaluation rubric defining eight core dimensions of recovery quality.

The Eight-Dimensional Rubric

Each dimension is rated on a 5-point Likert scale: Comprehensiveness (how thoroughly the plan addresses all aspects of the harm), Focus (how well the plan targets the core problem without overreach), Likelihood of Success, Speed of Implementation, Long-Term Resolution (prevention of recurrence), Side Harms (avoidance of new harms), Communication (quality of communication to affected stakeholders), and Autonomy (respect for user choice and consultation).

Attribute Weighting and Context Dependence

Using this rubric, we collected a dataset of 1,130 annotated plan pairs from 230 annotators recruited through Prolific. Through logistic regression, we found that human evaluators consistently prioritized speed and focus when selecting recovery plans, while comprehensiveness was negatively associated with plan preference—suggesting that more thorough responses were perceived as less desirable, potentially due to complexity or slower execution. Overall, decision-makers favored pragmatic strategies that are fast and tightly targeted, even at the expense of thoroughness or longer-term considerations.

Importantly, the relative weight of these attributes is not constant. We trained a 10-topic Latent Dirichlet Allocation model on scenario texts and fit moderation analyses to test how attribute importance shifts with context. We found systematic, context-dependent effects: contexts involving high technical complexity (e.g., AI platforms, cloud energy systems) amplified the importance of focus and likelihood of success; contexts involving sensitive users (e.g., mental health support, access provisioning) heightened the salience of autonomy and communication; and urgency-related settings (e.g., urban routing) brought speed to the forefront. Comprehensiveness and side harms, by contrast, showed no reliable moderation, indicating their influence on plan choice was relatively stable across contexts.

Generate-and-Verify Scaffold

In principle, one could learn a reward function R(τ) from human preference data and directly optimize a policy to maximize it. In practice, both steps are intractable in realistic computer-use environments: learning a faithful reward requires annotating full execution trajectories at prohibitive cost, and optimizing over the vast trajectory space is computationally infeasible. We therefore adopt a generate-and-verify scaffold that decouples trajectory generation from evaluation.

Given an initial harmful state, an LM-based generator LM_gen proposes N diverse candidate recovery plans in natural language. Each plan is expressed as a high-level sequence of intended actions, avoiding the prohibitive difficulty of annotating low-level GUI or OS states. An LM-based verifier LM_ver then scores these candidates and selects the most promising plan for execution. We explore two verifier variants.

Rubric-Based Verifier

The rubric-based verifier prompts a frozen LM (GPT-4.1) to perform pairwise A/B judgments over candidate plans, conditioned on the eight-dimensional rubric derived from our formative study. Pairwise results are aggregated via majority wins into an overall preference ordering, and the agent executes the top-ranked plan. This verifier must infer the context-dependent weighting of rubric dimensions entirely through its pretrained priors.

Reward Model Verifier

Rather than prompting a frozen LM with a rubric, the reward model verifier learns context-dependent weightings implicitly from human annotations. We fine-tune Qwen3-0.6B (Qwen Team, 2025) on our dataset of 1,130 pairwise human preference judgments using a standard pairwise ranking objective. At test time, the reward model scores each candidate plan and the agent executes the one with the highest learned alignment score. Unlike the rubric-based approach, this verifier can adapt to subtle, context-specific trade-offs in recovery quality without requiring explicit rubric engineering.

BackBench: A Benchmark for Harm Recovery

To evaluate recovery capabilities systematically, we introduce BackBench, a benchmark of 50 diverse computer-use tasks built on top of the OSWorld environment (Xie et al., 2024). Each task places an agent in a realistic, contextually grounded harmful initial state within a real Ubuntu GUI environment and challenges it to backtrack or remediate the harm to return to a safe operational state.

Harm Categories

BackBench scenarios are drawn from a taxonomy of five macro harm categories: Availability (disruption of system or resource access), Financial (direct or indirect monetary harm), Integrity (loss, corruption, or manipulation of data), Deliberate misuse (exploitation of systems to cause harm), and Security (threats to systems or sensitive data exposure).

Task Design and Evaluation

For each category, we design 4–6 initial states and programmatically instantiate them in a virtual Linux desktop. Each initial state is permuted across two step limits—15 steps and 50 steps—following OSWorld convention, to simulate different resource constraints the agent might face in deployment. The agent is made aware of the relevant step limit through the prompt.

Because the optimality of a recovery trajectory depends on subjective human judgments rather than a single correct endpoint, BackBench evaluates systems using a comparative A/B preference framework: human annotators are shown pairs of complete agent trajectories for the same scenario and asked which is superior. Pairwise judgments are then aggregated using the Bradley-Terry rating system (Bradley & Terry, 1952), producing relative performance scores across evaluated scaffolds.

Results

We evaluate our two generate-and-verify scaffolds against the base model without scaffolding, using Claude Sonnet 4.5 (Anthropic, 2025) as the underlying computer-use model. We chose Sonnet 4.5 deliberately: it holds the top spot on the OSWorld leaderboard and is Anthropic's most safety-aligned model, establishing a strong baseline. Any improvements our scaffolds provide must overcome an already highly capable and aligned foundation model, making observed gains particularly meaningful indicators of the value of domain-specific alignment mechanisms.

Using Bradley-Terry ratings computed via maximum likelihood estimation and converted to a chess-like scale, we find that both preference-guided scaffolds are strongly preferred over the base model. Across all 50 BackBench tasks, the reward model scaffold achieves a 120-point Elo increase and the rubric-based scaffold achieves a 75-point Elo increase over the base model. Hypothesis testing via bootstrap resampling confirms statistical significance (p = 0.004 for the reward model and p = 0.008 for the rubric-based scaffold). Notably, the advantage of both scaffolds is most pronounced in the 15-step limit tasks, indicating that preference-guided planning is especially effective under resource constraints.

The reward model's superior performance over the rubric-based approach may be attributed to its ability to learn context-dependent trade-offs directly from human annotations. While both verifiers have access to the same rubric dimensions, the relative importance of these attributes varies substantially across scenarios, as shown in our moderation analysis. The rubric-based verifier must infer these weightings through the base model's pretrained priors, which may not align with human judgments in specialized harm recovery contexts. In contrast, the reward model implicitly captures how humans prioritize different recovery attributes conditional on scenario characteristics, enabling more human-aligned plan selections—particularly in scenarios where recovery trade-offs are subtle.

BibTeX

@article{li2026human,
          title={Human-Guided Harm Recovery for Computer Use Agents},
          author={Li, Christy and CH-Wang, Sky and Peng, Andi and Bobu, Andreea},
          journal={arXiv preprint arXiv:2604.18847},
          year={2026}
        }