Harness Engineering for AI Coding Agents from Argentina


Your developers switched on AI coding agents and throughput jumped. Then the second-order problems surfaced: architecture drift across repositories, inconsistent patterns that pass tests but misbehave in production, security controls that are enforced in one module and silently bypassed in another. Agents ship code faster than any reasonable review cadence can absorb, and the failures cluster in ways nobody designed for.

Harness engineering is the emerging practice that contains this. Not by restricting agents, but by wrapping them in constraints, verification loops, and feedback systems that make their output predictable and auditable. We build those harnesses out of our Córdoba delivery center for engineering organizations in North America and Europe.

Siblings Software is a software outsourcing company headquartered in Córdoba, Argentina, with daily overlap with US Eastern time. We have been building outsourced engineering squads since 2014, and our AI practice now treats harness engineering as a first-class capability alongside platform engineering.

Architecture diagram showing AI agent output flowing through constraint, feedback, and observability harness layers on its way to production code

Our Services Contact Us

Why AI Agents Need Harnesses, Not More Prompts

The term picked up momentum in early 2026 after Mitchell Hashimoto (co-founder of HashiCorp) put a name to something many engineering leaders had already concluded the hard way: when an agent misbehaves, the right fix is usually a change to the system around it, not another sentence added to a prompt. OpenAI echoed the point shortly after, describing how they shipped an internal product with effectively no hand-written code by investing in the scaffolding that surrounded their Codex agents rather than the prompts driving them.

The Thoughtworks Technology Radar (Volume 34, April 2026) flagged harness engineering as one of the defining macro trends in the industry. Their read: the space is shifting from experimentation toward repeatability and stability. Agent reliability has moved from "one concern among many" to a board-level risk at several of the companies we talk to.

The core argument is simple. Telling an agent "follow our coding standards" in a system prompt is like telling a new hire to "write good code." It works some of the time. LLM compliance with instructions is probabilistic, not deterministic. Your agent can honor every architectural convention on Monday and introduce a direct database call that skips your ORM layer on Tuesday. Prompts suggest; they do not enforce.

A harness adds the enforcement layer. AGENTS.md files encode architectural boundaries in a format agents can reason about. CI gates reject code that violates those boundaries before it reaches the main branch. Plan-Execute-Verify loops push agents through explicit checkpoints. Observability dashboards surface whether the harness is working or whether new failure patterns are quietly forming. Each layer reinforces the others: constraints reduce the volume of failures, verification catches what slips through, and observability exposes the patterns both layers missed.

Comparison illustration of prompt engineering limits such as probabilistic compliance and context window pressure against harness engineering strengths such as deterministic enforcement and layered constraints

What We Build

Our harness engineering practice covers five service areas. Most engagements begin with constraint design and verification, then expand into orchestration and observability as your team's agent usage matures.

Five harness engineering service areas from Siblings Software: agent constraint design with AGENTS.md and rules files, verification pipelines with PEV loops and CI gates, agent orchestration infrastructure, observability with DORA integration, and team enablement with workshops and playbooks

Agent Constraint Design

We author the AGENTS.md, rules files, and constraint specifications that tell agents what they can and cannot do inside your codebase. This goes well beyond project context. We encode architectural boundaries (which modules are off-limits for modification), testing requirements (what coverage is mandatory), security constraints (how credentials must be handled), and dependency policy (which packages are approved). The constraint layer is version-controlled and evolves with your codebase.

Verification Pipeline Build

We implement Plan-Execute-Verify (PEV) loops that push agents through explicit checkpoints. The agent plans, executes inside a sandbox, then the output is verified against your test suite, static analysis, architectural compliance checks, and mutation testing. Failed verification returns to the agent with specific remediation instructions. We wire this into whichever CI/CD you already run: GitHub Actions, GitLab CI, Jenkins, Azure DevOps, or CircleCI.

Agent Orchestration Infrastructure

For teams running more than one agent (Claude Code for complex refactors, Copilot for routine tasks, Codex for batch operations), we build the coordination layer. Task decomposition so agents work on independent slices and stop colliding in merges. Sandbox isolation per agent run. Rollback so failed runs do not contaminate main. This is the "agent orchestration as the new CI/CD" pattern the industry is converging on.

Observability and Metrics

You cannot improve what you do not measure. We deploy dashboards that track agent task resolution rate, pass@1 rate (whether the agent gets it right on the first attempt), rework rate, and architecture drift. These integrate with DORA metrics so you can see how agent-generated code affects broader delivery performance. The dashboards surface when a harness is too loose (too many failures) or too tight (blocking valid work).

Team Enablement and Training

The hardest part of harness engineering is human, not technical. Developers shift from writing code to writing specs, reviewing agent output, and maintaining constraint files. We run workshops on spec-driven development, build agent workflow playbooks tailored to each squad, and help manage the cognitive debt that builds up when teams delegate code generation to agents without preserving understanding of what was built.

Our harness work connects naturally with our AI code security practice for the security constraint layer, our AI-powered testing team for verification pipelines, and our AI DevOps practice for the CI/CD surface area.

The Plan-Execute-Verify Loop

PEV is the pattern we implement in almost every engagement. It turns agent coding from a "generate and hope" workflow into a structured engineering process with quality gates that are actually enforced.

Plan-Execute-Verify loop diagram with three phases: plan phase where the agent reads specs and checks constraints, execute phase where the agent generates code in a sandbox, and verify phase that runs tests, static analysis and architecture compliance

Plan

The agent reads the specification, loads project context from AGENTS.md and rules files, decomposes the task, checks constraints, and proposes an implementation approach. For high-risk changes we insert a human approval gate between plan and execution. The agent does not move forward until the approach is explicitly signed off.

Execute

The agent generates code and tests inside a sandboxed environment. Execution is bounded by the constraints defined in the plan. Changes land as incremental commits, not one massive diff. If the agent tries to touch a protected module or pull in a disallowed dependency, the constraint layer blocks the action before it reaches the repository.

Verify

The test suite runs at every level: unit, integration, end-to-end. Static analysis flags code quality and security issues. Architecture compliance verifiers confirm the change respected module boundaries. Mutation testing checks that the new tests actually catch bugs and are not just touching code paths. Pass/fail feedback, with specific context, goes back to the agent for retry.

The part that makes PEV different from "run tests after the agent writes code" is the planning phase. Without it, agents generate plausible-looking code that compiles and ships but quietly introduces architectural problems. A constraint-informed plan cuts those issues by roughly 60 to 70 percent in the engagements we have measured. The remainder gets caught in verification.

AGENTS.md: Where Constraints Actually Live

The AGENTS.md open standard, released in 2025, gives agents structured project context. Context alone is not a harness. Most teams we walk into already have some version of AGENTS.md or CLAUDE.md. The file typically lists the tech stack, sketches a few conventions, and includes a handful of "don't do this" notes.

That is a starting point, not a constraint system. A well-engineered AGENTS.md includes architectural boundaries that map to enforceable CI checks, testing requirements tied to specific verification gates, security rules that reference actual scanning configurations, and dependency policy that mirrors your package manager setup. The file stops being a suggestion list and becomes a contract between humans and agents.

AGENTS.md file structure with a project context block listing the tech stack, an architectural boundaries section defining guardrails, and a testing requirements section with verification rules

We write AGENTS.md files that behave like executable specifications. Every constraint in the file maps to a CI check that enforces it. Every testing requirement maps to a verification gate. When the file says "all database queries go through the ORM," there is a static analysis rule that catches direct SQL. When it says "integration tests required for API routes," the pipeline blocks merges without them.

How an Engagement Actually Runs

Most engagements follow four phases across 10 to 14 weeks. A focused rollout for a single team or a small repository set typically reaches full deployment in 4 to 6 weeks.

Four-phase harness engineering engagement timeline: agent audit in weeks 1 to 2, harness design in weeks 3 to 5, build and integrate in weeks 5 to 10, and handoff and training in weeks 10 to 14

Phase 1: Agent Audit (Weeks 1-2)

We map current agent usage across teams and repositories. Which tools are people actually using (not which were approved)? What share of recent commits involved AI assistance? Where are the failure clusters? We measure baseline metrics: task resolution rate, rework rate, architecture drift. The output is a prioritized inventory of harness requirements plus an implementation plan you can socialize with leadership.

Phase 2: Harness Design (Weeks 3-5)

Using the audit, we design the constraint architecture. Which boundaries need hard enforcement? Where do verification gates fit in the pipeline? What level of human oversight is appropriate for different risk classes? How do multi-agent workflows get coordinated? We make these decisions with your team in the room. A harness imposed from outside does not survive the first week of real use.

Phase 3: Build and Integrate (Weeks 5-10)

The implementation phase. AGENTS.md and rules files per repository. PEV loop infrastructure. CI gate configuration. Agent sandboxing. Observability dashboards. Threshold calibration. Every component is exercised against real agent workflows, not hypothetical ones. Thresholds get tuned to avoid over-constraining, because a harness that blocks too much valid work is worse than no harness at all.

Phase 4: Handoff and Training (Weeks 10-14)

Documentation, runbooks, spec-writing workshops, and hands-on training for maintaining the harness as your codebase and agent stack evolve. We teach your team to update constraints, adjust verification gates, and read the observability data. The default outcome is full ownership transfer. Ongoing support is optional, not assumed.

The harness hooks into your existing DevOps pipelines and plays well with our AI agents development practice if you are also building agent-powered features into your product.

Case Study: Taming Agent-Generated Code in a Cross-Border SaaS

The Situation

A Series B SaaS platform with roughly 70 engineers across Miami, Buenos Aires, and Mexico City came to us in early 2026 after hitting a wall with their AI coding tool adoption. The team had rolled out Claude Code and Cursor Agent six months earlier. Per-developer throughput rose around 40 percent on their internal tracking. Leadership was happy with the numbers.

Then three things landed in the same month. A production incident traced back to an AI-generated database migration that bypassed their ORM and broke data integrity for roughly 12,000 customer records. A security audit flagged credential handling inconsistencies across 8 of their 14 repositories, all in agent-generated modules. And their lead architect realized their monorepo had developed two competing API patterns because different agents had introduced different conventions in different services, and nobody noticed until integration failures surfaced.

They tried fixing this with better prompts and more elaborate system instructions. It worked for roughly two weeks. Then fresh instances of the same patterns showed up elsewhere. Their VP of Engineering put it plainly: "We cannot give back the speed, but we cannot keep shipping at this defect rate either."

What We Built

We stood up a five-person squad from Córdoba over 12 weeks: two platform engineers, two DevOps specialists, and a harness architect leading the engagement. Our Argentine team overlapped four hours daily with their Miami leads, which mattered more than anyone expected when constraint thresholds needed tuning in real time.

The core decisions were:

  • AGENTS.md files per repository with enforceable architectural boundaries. The ORM-only constraint was backed by a Semgrep rule that blocked direct SQL in CI. The duplicate API pattern was resolved with a shared API style guide agents had to consume before planning.
  • PEV loops in GitHub Actions that required agent-generated PRs to pass plan validation, automated testing at three levels, and architecture compliance checks before merge was allowed.
  • A multi-agent coordination layer that assigned Claude Code to complex refactors and Cursor to routine feature work, with task decomposition to stop concurrent agent sessions from colliding.
  • Observability dashboards tracking agent success rate, rework rate, and drift detection across all 14 repositories, plugged into their existing Datadog setup.
Case study results chart: agent task resolution rate lifted from 31 percent to 73 percent, rework dropped 82 percent, feature delivery throughput rose 2.4x, and the client avoided roughly 1.6 million dollars in annual rework cost

Six months after deployment the team held the 40 percent throughput gain from AI tools while taking defect rates below their pre-agent baseline. Their VP of Engineering summarized it as "we stopped fighting the agents and started working with them." For more proof points, see our case studies page.

Cognitive Debt: Where Most Clients Get It Wrong

Most companies treat harness engineering as a purely technical problem. Build the CI gates, write the rules files, stand up dashboards. Done. The harder failure mode is not technical. It is cognitive debt: the widening gap between what your codebase actually does and what your team understands about it.

When agents generate 60 or 70 percent of your code, developers lose touch with implementation details they used to know cold. A senior engineer who used to understand every quirk of the authentication module now reviews agent-generated changes to it without the same mental model. The code works. The tests pass. Understanding erodes quietly. Six months later, when something breaks in a way the tests did not cover, nobody on the team has the context to debug it quickly.

Cognitive debt spectrum showing risk levels from low where a human writes the spec and reviews the output, to moderate where the agent generates and the human spot-checks, to high with full autonomy and no human in the loop, with harness engineering keeping teams in the productive middle zone

This is why our engagements always include the team enablement component. We help engineering managers decide which parts of the codebase demand deep human understanding (security-critical paths, data integrity logic, customer-facing payment flows) and which can safely become "agent-maintained territory" with lighter oversight. The target is not maximum agent autonomy. It is the right balance between velocity and comprehension for each team.

When Outsourcing Harness Engineering to Argentina Makes Sense

Not always. An honest breakdown:

Outsourcing Is a Good Fit When

  • You have no platform engineers with agent reliability experience on staff (most teams do not; the discipline barely existed 18 months ago).
  • Agent-generated code is already causing production issues and you need harnesses operational in weeks, not quarters.
  • Your team uses multiple AI coding tools and you need harnesses that stay coherent across all of them.
  • You want the harness built once, tuned to your codebase, and handed off to your internal team for ongoing maintenance.
  • You have more than 30 engineers using agents and the coordination overhead has become unmanageable through ad-hoc review.
  • You want nearshore delivery with time-zone overlap with US Eastern time, not a 12-hour async relay across APAC.

Building In-House Makes More Sense When

  • You already have a strong platform engineering team that just needs a reference architecture and some sparring partners.
  • Your team is small enough (under 15 engineers) that informal review still catches most issues.
  • You use a single AI coding tool and your constraint needs are narrow and well-scoped.
  • You have 6 to 9 months of runway to iterate on the harness without urgent production pressure.
Comparison table showing outsourcing strengths in time to first harness, cost profile, and cross-repository experience versus in-house advantages in long-term maintenance and deep institutional knowledge

The Cost Reality

Hiring a platform engineer with agent harness experience, a DevOps specialist who can build PEV infrastructure, and an architect fluent in multi-agent coordination in major US markets typically runs upward of USD 650,000 per year in fully loaded compensation. You are also hiring into a specialty with a painfully thin talent pool, since the discipline barely existed 18 months ago. Our nearshore model from Córdoba delivers the same skill set at roughly 40 to 55 percent of that cost, with engineers who overlap with US East Coast hours and document in English by default.

For project-based engagements, typical harness engineering builds land between USD 80,000 and USD 240,000 depending on scope, repository count, agent tools in use, and the complexity of your CI/CD. Focused single-team engagements start around USD 45,000. Dedicated teams run from roughly USD 24,000 per month.

Discuss Your Project

Measuring Harness Effectiveness

Harness engineering without measurement is guesswork. We deploy a compact set of metrics that show whether the harness is working and where it needs adjustment.

Bar chart comparing agent reliability metrics before and after harness engineering: task resolution rate improving from 38 to 84 percent, pass at 1 rate from 25 to 67 percent, rework rate dropping from 58 to 15 percent, and architecture drift from 42 to 8 percent

Task Resolution Rate tracks the percentage of tasks an agent resolves correctly, verified against automated tests. DORA-style research suggests top agents in well-harnessed environments hit 65 to 77 percent. Without harnesses, most teams sit between 25 and 40 percent.

Pass@1 Rate measures whether the agent gets it right on the first attempt, with no retries. This matters because retries burn compute budget and developer attention. A well-tuned harness pushes pass@1 above 60 percent on routine tasks.

Rework Rate shows what share of agent-generated PRs require human intervention after the first verification pass. Pre-harness rework rates of 50 to 60 percent are common. Post-deployment, we target below 15 percent.

Architecture Drift tracks how often agent-generated code violates established architectural patterns. This surfaces the slow degradation that per-PR review misses. We integrate drift detection into your existing observability stack.

Three ways to work with us, depending on where you are with agent adoption.

How to Work With Us

Project-Based
Outsourcing

We design and build the complete harness engineering infrastructure and hand it over. Best for companies that want a production-ready harness without managing the build internally. Typical duration: 10-14 weeks. Includes AGENTS.md authoring, PEV loop deployment, CI gate configuration, observability setup, and team training.

Learn More

Dedicated
Engineering Team

An ongoing team embedded in your organization: harness architects, platform engineers, and DevOps specialists. They maintain and evolve the harness as your agent usage grows, onboard new repositories, tune constraints, and monitor effectiveness. Works as an extension of your platform team, with US East overlap.

Hire a Team

Staff
Augmentation

Embed individual harness engineers into your existing platform or DevOps team. Best when you already have the strategy defined but need hands-on expertise to implement PEV loops, configure agent gates, write AGENTS.md files, or build observability dashboards.

Hire Engineers

Frequently Asked Questions

It is the discipline of designing the environments, constraints, and feedback loops that make AI coding agents reliable at scale. Instead of relying on prompts, you wrap agents in deterministic guardrails: AGENTS.md files that encode architectural boundaries, CI gates that block non-compliant code, and Plan-Execute-Verify loops that catch failures before they reach production.

We build harnesses across Claude Code, Cursor (Agent mode), GitHub Copilot Workspace, OpenAI Codex, JetBrains Junie, Augment Code, and Amazon Q Developer. Each has different constraint mechanisms and failure modes. We design harnesses that stay coherent across whichever combination your team uses, including multi-agent orchestration.

Project-based engagements typically run USD 80,000 to USD 240,000 depending on repository count, agent tools in use, and team size. Focused single-team engagements start around USD 45,000. Dedicated teams run from roughly USD 24,000 per month. We scope every engagement after a discovery call; you will not get a blind quote.

Basic constraint harnesses and AGENTS.md files are usually operational within two weeks, with a measurable drop in agent rework by week four to six. Full enterprise deployments including PEV loops, multi-agent orchestration, and observability take 10 to 14 weeks. Single-team rollouts can reach full deployment in 4 to 6 weeks.

Yes. Córdoba overlaps with US Eastern time most of the year. Stand-ups happen at reasonable hours for both sides, design reviews are synchronous, and feedback on harness changes is same-day. For harness engineering that matters disproportionately, because constraint tuning requires tight iteration with your own developers, not an overnight handoff.

Full ownership transfer is the default. We deliver runbooks, internal documentation, spec-writing workshops, and hands-on training so your team can maintain, extend, and tune the harness without us. Ongoing support is available but is not baked into the project cost. Several clients keep a part-time harness architect with us for 6 to 12 months post-handoff, mostly for quarterly reviews as their agent stack evolves.

Related Services

CONTACT US