When AI Agents Rewrite Their Own Rules: Self-Improving Harnesses for Real Estate

The Part of Your AI Agent That Actually Breaks

When an enterprise AI agent fails in production, the instinct is to blame the model. Usually that is the wrong place to look. An agent's behaviour is governed as much by its harness — the system prompt, the tools it can call, its memory, its verification rules, its runtime policies, and its failure-recovery logic — as by the language model underneath. Popular agent harnesses like SWE-agent, Claude Code, Codex, and OpenHands wrap the same frontier models; what separates a reliable agent from a flaky one is mostly that surrounding layer.

The catch: that layer is almost always tuned by hand. An engineer watches a few failures, forms a hunch, edits a prompt or a rule, and hopes. As lead author Hangfan Zhang of the Shanghai AI Laboratory put it, the deeper problem is that this paradigm "often lacks a systematic feedback loop" — edits are made on intuition and ad hoc debugging rather than evidence. With new models shipping every few weeks, hand-tuning a model-specific harness becomes a treadmill nobody can keep up with.

A 2026 paper covered by VentureBeat — "Self-Harness" (arXiv:2606.09498) — proposes a fix: let the agent improve its own harness, from the evidence of its own runs.

From Guesswork to a Feedback Loop

This is a close cousin of the pattern we covered in self-improving AI agents, but aimed one layer down. Where that work evolves the high-level orchestration, Self-Harness evolves the concrete operating rules an agent runs under — and it does so without touching the model's weights and without leaning on a bigger, more expensive model to supervise it. The improvement comes from the agent's own execution traces.

It runs as a three-stage loop:

1 · Weakness mining → 2 · Harness proposal → 3 · Proposal validation

↻ edits that pass regression tests merge into the next harness version

1. Weakness mining. Starting from a minimal harness, the agent runs a batch of tasks with verifiable outcomes, then categorizes the failed traces to find model-specific failure patterns — the mistakes this particular model keeps making.

2. Harness proposal. A "proposer" role turns each failure pattern into a small, targeted edit tied to that specific mechanism — not a vague "try harder" instruction, and deliberately minimal to avoid over-correcting.

3. Proposal validation. Each candidate edit is run through regression tests. It is promoted only if it improves performance without measurable degradation on held-out tasks. Multiple passing edits merge into the next harness version, which becomes the starting point for the next round. The acceptance gate is the whole game: improvement without regression, proven on data.

The Results — and Why They Are Credible

The researchers tested Self-Harness on Terminal-Bench-2.0 — a benchmark for general tool use: managing artifacts, running commands, verifying results, and recovering from errors — across three models (MiniMax M2.5, Qwen3.5-35B-A3B, and GLM-5). They froze everything except the harness: same model, same tools, same evaluator. On held-out tasks, performance climbed 33% to 60% in relative terms; Qwen3.5-35B-A3B jumped from 23.8% to 38.1%.

What makes the result more than a benchmark number is what the system changed. The edits are specific and legible, not "make the prompt longer":

MiniMax M2.5 kept exploring dataset configurations until it timed out and shipped nothing. Self-Harness wrote a "loop breaker" into its runtime policy — stop and redirect after 50 tool calls — plus a rule to produce an initial version of any required artifact early.
Qwen3.5 would hit a file-overwrite error, blindly retry the same command, and eventually delete files in confusion. The fix was a strict command-retry discipline: no exact-duplicate commands.

Those are exactly the kinds of rules a seasoned engineer would add — discovered and validated automatically, from the model's own behaviour.

What This Means for Real Estate and PropTech

Real estate AI lives in messy, high-variability workflows where harness failures — not model failures — quietly erode trust. Self-improving harnesses turn those ambiguous breakages into solvable, testable problems.

Document and lease pipelines that adapt. When a counterparty changes a contract template or a portfolio adds a new lease format, an agent often "looks broken." A harness loop surfaces exactly where it misreads the new format and proposes a targeted verification rule — instead of a human re-debugging from scratch.
Maintenance and dispatch that stop repeating mistakes. The retry-discipline and loop-breaker patterns map directly onto agents that re-route the same ticket forever or stall on a flaky vendor API — the operational failures behind real maintenance AI wins.
Model upgrades without a rewrite. Swapping in next quarter's cheaper, faster model usually means re-tuning the harness by hand. A self-harnessing loop re-discovers the new model's specific failure modes automatically — making model migration a routine, low-risk event.
Cheaper models, raised to reliable. A 60% relative lift on hard tasks can move a smaller, cheaper model across the threshold of "good enough for production" — material when you are running agents at portfolio scale.

How to Adopt It Without Creating New Risk

An agent that edits its own rules is also one that can regress. The same disciplines that make any agentic system trustworthy make this safe:

The acceptance gate is non-negotiable. No proposed edit ships unless it beats the current harness on a held-out set with no regressions. This single rule is what separates self-improvement from self-sabotage.

Verifiable outcomes first. The loop only works on tasks where success is machine-checkable. Before automating harness edits, invest in evaluation: a benchmark of real cases with objective pass/fail signals.

Versioned, reversible harnesses. Treat each harness version like code — a diff, an owner, instant rollback. Improvement you cannot reverse is a liability, not a feature.

Humans still own irreversible actions. A self-tuning harness optimizes how the agent works; it never removes the approval gate on sending a payment, signing a document, or emailing a tenant.

Frequently Asked Questions

What is an AI agent "harness"? It is the system around the model that makes it act: the system prompt, available tools, memory, verification rules, runtime policies, orchestration logic, and failure-recovery procedures. Many agent failures originate here rather than in the model itself.

Does Self-Harness retrain the model? No. The model's weights stay frozen. Only the harness — editable text and rules — changes, which makes it cheap, fast to iterate, and auditable, and it works even with frontier models you cannot fine-tune.

Is it safe to let an agent change its own rules in production? Yes, when every edit must pass a held-out regression test before shipping, harness versions are diffed and instantly reversible, and irreversible business actions still require human approval.

Where to Start

You do not need a self-evolving platform to benefit from this idea today. Pick one repetitive, high-volume workflow, and instrument it so success is measurable on real cases. Once you can score quality objectively, two things become possible: you can debug the harness with evidence instead of intuition, and you can eventually let the loop propose and validate its own edits. Without that measurement foundation, "self-improving" is just marketing.

VSBD designs and ships agentic AI systems for PropTech platforms across Europe and the USA — including the evaluation, observability, and harness foundations that make self-improvement safe rather than risky. It is the work behind our PropTech 2026 Awards nomination for agentic AI orchestration. If you want your real estate agents to get more reliable over time instead of silently breaking, we can help you build it.