¿En qué se diferencia la ingeniería de evaluación LLM del QA tradicional?

El QA tradicional prueba código determinista: misma entrada, misma salida, asserts exactos. Los LLM y agentes son probabilísticos; el mismo prompt puede dar respuestas distintas con drift sutil tras upgrades. La ingeniería de eval reemplaza asserts booleanos por distribuciones de score: groundedness, faithfulness, cumplimiento de formato, latencia y violaciones de política. Trata prompts, retrieval y agentes como sistema bajo prueba, con datasets versionados, jueces calibrados y pipelines de CI que bloquean releases malos como los unit tests bloquean builds rotos.

¿Construyen evaluaciones para sistemas RAG y agentes IA, o solo para asistentes de chat?

Diseñamos evals para los tres. Asistentes de chat usan rúbricas por tarea y probes de política. RAG recibe scoring de retrieval, auditoría de citas, faithfulness contra documentos fuente y groundedness end-to-end. Los agentes requieren evaluación de trayectoria: herramienta correcta, tarea completada, recuperación ante fallos, alcance respetado. Cada superficie tiene modos de falla distintos; adaptamos el harness a la modalidad en lugar de reutilizar una plantilla.

¿Con qué herramientas y frameworks de evaluación trabajan?

Stack open source por defecto: Promptfoo para CI de prompts, DeepEval para asserts estilo unit, RAGAS para retrieval y faithfulness, y tracing con LangChain o LlamaIndex según el repo. En plataformas comerciales integramos Braintrust, LangSmith, Arize Phoenix, Humanloop y las features de eval en Azure AI Foundry, Vertex AI y AWS Bedrock. Usamos el stack que ya pagan; evitamos migraciones forzadas salvo que una herramienta bloquee de verdad al equipo.

How long does an evaluation engineering engagement take?

Primeros scores en dos o tres semanas: dataset dorado inicial, un LLM-as-judge para el modo de falla más doloroso y un gate de CI que bloquea regresiones. Despliegue empresarial completo con multi-rúbrica, red-team, replay y dashboards suele llevar 8 a 14 semanas. Proyectos enfocados en una superficie y un ciclo de cambio de modelo cierran en 4 a 6 semanas.

¿Cuánto cuesta la ingeniería de evaluación LLM tercerizada?

Proyectos por alcance suelen ubicarse entre USD 80.000 y USD 240.000 según superficies, profundidad del dataset y cobertura red-team. Equipos dedicados desde unos USD 24.000 mensuales por un squad de tres personas. Refuerzo de un eval engineer: USD 7.000 a USD 11.000 por mes. Contratar in-house en EE. UU. hoy cuesta USD 160.000 a USD 240.000 base más beneficios, con talento escaso.

¿Quién es dueño de los datasets, jueces y pipelines después del proyecto?

Usted. Todo el artefacto va a sus repos bajo su IP. Datasets versionados en formatos legibles sin nuestra herramienta. Prompts de juez en el repo junto a criterios y muestras de calibración humana. Definiciones de CI en sus pipelines, no en un portal de vendor. Diseñamos cada proyecto para que, si nos reemplaza por un equipo interno, nada se rompa.

Ingeniería de Evaluación LLM desde Argentina

La demo del modelo impresionó a todos. Tres semanas después, con tráfico real, la bandeja de soporte se llena de respuestas sutilmente incorrectas, el equipo legal marcó dos citas a documentos que no existen y nadie puede decir si el cambio de prompt del martes mejoró o empeoró las cosas. No es un problema del modelo. Es un problema de evaluación, y para eso existe nuestra práctica de ingeniería de eval.

Diseñamos y construimos el harness alrededor de sus productos de IA: datasets dorados, jueces, gates de CI, suites red-team y replay de producción. El mismo tipo de infraestructura que su equipo de backend da por sentada, aplicada a los sistemas probabilísticos que sus ingenieros deben shippear cada dos semanas.

Siblings Software es una empresa de outsourcing de software con base en Córdoba, Argentina, con traslape diario con el huso horario EE. UU. Este. Construimos squads de ingeniería tercerizados desde 2014. La ingeniería de evaluación se formalizó como práctica dedicada dentro de nuestro cluster de IA en 2025, junto con harness engineering para agentes de código y observabilidad de agentes.

Flujo de alto nivel de ingeniería de eval: el output de un LLM o agente pasa por un golden dataset, jueces, gates de CI y replay de producción antes de llegar a un modelo medido y entregable

Nuestros servicios Contáctenos

Por qué la calidad LLM no se prueba como código

Los equipos de software llevan dos décadas perfeccionando pruebas deterministas: mismas entradas, misma salida, tests que pasan o fallan, cobertura que mide cuánto código tocan. Ese modelo mental se quiebra en el momento en que un modelo generativo entra en el request path.

Un LLM puede responder la misma pregunta de dos maneras distintas, ambas técnicamente correctas, una sutilmente fuera de política. Un pipeline RAG puede devolver el documento correcto con la sección equivocada. Un agente puede completar un flujo y filtrar un email de cliente. Ninguna de esas fallas lanza excepción: parecen software que funciona.

Según la actualización 2026 de Gartner sobre mercados de ingeniería IA, evaluación y observabilidad ya son categoría propia, y la calidad es la barrera principal para pasar de prototipo a producción. Un dato útil: cerca de cuatro de cada diez equipos que shippean una feature LLM regresan en calidad en noventa días, muchas veces sin darse cuenta.

Comparación lado a lado: testing determinístico vs evaluación probabilística, contrastando mismo-input-mismo-output, pasa/falla y código-bajo-prueba a la izquierda con mismo-input-output-variable, distribuciones de score y comportamiento-bajo-prueba a la derecha

La ingeniería de evaluación cierra esa brecha. No es un framework de tests bonito: es un modelo operativo pequeño — dataset curado, jueces calibrados (automáticos y humanos), pipeline de CI que puntúa cada cambio y loop de feedback desde producción al dataset para que el harness mejore con su tráfico.

Qué construimos para usted

Seis áreas de servicio en secuencia. La mayoría arranca con diseño de eval y un primer dataset dorado, suma gates de CI y jueces, y luego expande a red-team y replay de producción cuando lo básico está estable.

Seis áreas de servicio de ingeniería de eval en una grilla de 3 por 2: diseño de eval, golden datasets, pipelines de eval en CI, replay de producción, red-team y seguridad, y capacitación del equipo

Diseño de evaluación

We start with the failures, not the metrics. We sit with your support tickets, your incident logs, and the screenshots your sales engineers send you in disgust. From there we build a task taxonomy, write rubrics that map to real customer pain, and pick scoring policies that correlate with the business outcome you actually care about. Vague metrics like “quality score” are where most internal eval projects die.

Datasets dorados

The dataset is the contract. We curate test cases from production traces, write synthetic adversarial prompts, label expected behavior with domain experts, and version everything so you can answer the question “what changed” six months later. The dataset evolves with your product, but it never silently mutates. RAG-specific datasets include source documents, expected citations, and traps for known retrieval failures.

Pipelines de eval en CI

We wire your evals into the same CI/CD that runs your unit tests. Every prompt change, model upgrade, retrieval index rebuild, or agent tool addition triggers a scored run. Bad changes block the merge with a diff that explains which rubric regressed and on which examples. We work with Promptfoo, DeepEval, RAGAS, Braintrust, LangSmith, and the eval features in your cloud's AI platform.

Replay de producción

The dataset you hand-curate is never the dataset your users actually use. We sample real production traces, hash-redact PII, score them against the same rubrics, and feed the interesting ones back into the golden set. The replay loop is what makes the harness adaptive. It is also the layer that catches drift after a silent provider model upgrade.

Red-team y seguridad

Adversarial suites built around your specific threat model: jailbreak attempts, prompt-injection vectors, PII leakage probes, regulated-content traps, and policy violations defined in your terms of service. We map each probe to an enforceable rubric so coverage gaps become visible. Pairs naturally with our AI code security practice when the same engineering team owns both surfaces.

Enablement del equipo

Once the harness is running, your team needs to own it. We run rubric calibration workshops, build playbooks for adding new tasks, and pair with your engineers on the first two or three model changes after handoff. The objective is your in-house team running the eval program independently within a quarter, with us available for occasional rubric audits and red-team refreshes.

Las cuatro capas de un stack de evaluación

Un programa de eval sólido corre cuatro tipos de chequeos a distintos ritmos. Cada capa atrapa un modo de falla distinto. Saltarse una es cómo terminan con buenos scores en CI e incidentes vergonzosos en producción.

Pirámide de capas de evaluación: checks programáticos en la base, LLM-as-judge en el medio, review con humano en el loop por encima y sondas adversariales de red-team en la cima

Chequeos programáticos

Schema validation, latency budgets, cost ceilings, exact-match for closed-vocabulary tasks, and regex assertions for the things that simply must be present. Cheap, deterministic, and run on every request.

LLM-as-judge a escala

A second model scores outputs against a structured rubric and returns reasons. Not because it is perfect, but because it lets you score thousands of cases per change. We treat the judge as code: versioned prompts, agreement metrics with humans, and re-tests when the underlying judge model is upgraded.

Revisión humana en el loop

Domain experts spot-check the judge, calibrate the rubric, and label the hard examples that automated scorers cannot resolve. The volume is small. The value is keeping the rest of the stack honest.

Adversarial y red-team

A continuously refreshed set of probes that try to break the system: jailbreaks, injections, off-topic baits, PII fishing, policy stress-tests. This is where most real-world incidents originate, and the layer most internal eval programs forget to maintain after launch.

LLM-as-judge sin que el juez se desvíe

Using a model to score another model's output is the technique that makes eval work tractable above a few hundred test cases. It is also the technique that fails silently when the judge itself drifts, mis-reads the rubric, or starts agreeing with itself. We have seen teams celebrate a 15-point quality jump that turned out to be the judge becoming more lenient after a provider update.

A few practical rules we apply on every engagement. The judge prompt is checked into the repo and reviewed like code. Every batch of judge scores ships with an agreement metric against a small human-labeled sample. When the judge model upgrades, we re-run the calibration set before trusting any new numbers. Ties between judges are broken by humans, never by majority vote of cousin models from the same family.

Loop de feedback de LLM-as-judge que muestra el output de producción fluyendo hacia un modelo juez, scores almacenados, calibración humana sobre una muestra y actualizaciones de rúbrica que retroalimentan al juez

Construir un dataset dorado que refleje a sus usuarios

Most teams start with a dataset of forty examples one engineer wrote on a Friday afternoon. It is a fine starting point. It is a terrible long-term truth. The cases are too clean, the phrasing is too uniform, and the edge cases that ruin Saturday mornings are not in there because nobody has been paged for them yet.

Our dataset work blends three sources. Real production traces, sampled and stripped of PII, give the actual distribution of user behavior. Synthetic generation, run with a different model family from the one under test, fills in adversarial gaps. Domain experts label expected behavior, including the rationale, so the next person who looks at the case understands why the answer is what it is. The result is a versioned, reviewable artifact your future engineers will thank you for.

Diagrama que muestra cómo las trazas de producción, la generación sintética y los expertos de dominio contribuyen a un golden dataset versionado que alimenta las corridas de eval en CI ante cada cambio de prompt o modelo

A note on size. There is no magic number. We typically deliver a first dataset of 250 to 600 cases for a single product surface and grow it deliberately. Bigger is not better past a point. Coverage of failure modes is what matters, not raw count.

Cómo corre un proyecto

Ocho a catorce semanas para un build empresarial. Cuatro a seis para un proyecto enfocado en una superficie de producto. La forma es consistente en ambos casos.

Cronograma del engagement en cuatro fases: discovery en las semanas 1 a 2, diseño en las semanas 2 a 4, build y conexión en las semanas 4 a 10, y handoff en las semanas 10 a 14

Fase 1: Discovery (semanas 1–2)

We map the product surfaces, the model providers, the existing test infrastructure, and the failure inventory. The deliverable is a prioritized list of evaluation targets and a baseline measurement: how the current system actually performs against the rubrics we are about to build. Teams are usually surprised. Sometimes pleasantly.

Fase 2: Diseño (semanas 2–4)

Rubrics get drafted, calibrated against a small expert sample, and frozen as versioned artifacts. The first golden dataset takes shape. We also pick the toolchain: most clients keep what they already pay for, but if you are running raw notebooks today we usually settle on Promptfoo plus RAGAS for retrieval-heavy systems and Braintrust or LangSmith for richer dashboarding.

Fase 3: Build e integración (semanas 4–10)

The pipelines go in. CI eval gates fire on prompt changes, model upgrades, and retrieval index rebuilds. Production replay is wired up, with sampling rates and PII redaction tuned to your data residency rules. Red-team suites are seeded with the threats your security team is willing to write down. Everything ships behind feature flags so we never block your team mid-flight.

Fase 4: Handoff (semanas 10–14)

Documentation, runbooks, calibration workshops, and pairing on the first two or three model changes after we are gone. We default to full ownership transfer. If you want us to stay on for ongoing rubric audits or red-team refreshes, we offer that as a small monthly engagement, not as a hostage situation.

Eval pipelines tend to bolt cleanly onto your existing AI DevOps pipelines. For teams shipping autonomous agents alongside the LLM features, our AI agents development practice covers the agent-runtime side of the same problem.

Mini caso: Meridian Labs reduce alucinaciones 71%

La situación

Meridian Labs is a US-based legal-research SaaS that ships a RAG assistant on top of a 2.4 million document corpus. Their team had been live for nine months when they came to us. Internal usage was strong, but two enterprise pilots had stalled because the assistant was citing case law that did not exist, occasionally inventing statute numbers that looked plausible to junior associates and embarrassing to senior partners.

They had unit tests on the application code, a small notebook with thirty examples a product manager had written, and a Slack channel where engineers posted screenshots of bad answers. No one could tell whether last sprint's prompt change had improved things or made them worse. Every model swap was a coin flip.

Their engineering director told us the line we hear a lot: “We do not have a model problem. We have a measurement problem, and the measurement problem is so bad we cannot tell whether it is a model problem.”

Qué construimos

An eleven-week engagement with a four-person squad: an eval lead, two eval engineers, and a part-time legal-domain reviewer we sourced through our network for the rubric calibration work.

The high-impact pieces:

A 480-case golden dataset assembled from sampled production traces and synthetic adversarial prompts, with citations labeled by the legal reviewer. Hash-versioned, IP-clean.
Three RAGAS-based judges for groundedness, citation correctness, and answer faithfulness, each calibrated against a 60-case human-labeled subset.
A CI eval gate in their GitHub Actions that blocked merges when groundedness dropped below the agreed threshold or citation correctness regressed by more than two points.
A production replay loop that sampled five percent of real traffic, scored it nightly, and surfaced new failure clusters in their existing Datadog stack.

Resultados del caso de estudio: 71 por ciento de reducción en citas alucinadas, ciclos de cambio de prompt y modelo 5x más rápidos, y 1,2 millones de dólares evitados en costos de soporte y rework a lo largo de doce meses

Six months post-handoff Meridian had landed both stalled enterprise pilots, the legal review board signed off on the assistant for paid tiers, and the engineering team was running rubric updates without us. The honest caveat: their first attempt to extend the dataset on their own missed a hallucination cluster around state-level statutes. We caught it in a quarterly audit. They have run the audits themselves since.

More on how we run engagements like this on our case studies page.

Los números que valen la pena seguir

Ingeniería de eval sin medición es teatro. Cuatro números concentran la señal en los proyectos que llevamos. Cada uno necesita objetivo, tendencia y alerta.

Gráfico de barras que compara cuatro métricas antes y después de un harness de eval real: la groundedness sube de 46 a 86 por ciento, la tasa de alucinación baja de 18 a 3 por ciento, el tiempo al release cae de 21 a 4 días y la tasa de escape de regresiones baja de 31 a 4 por ciento

Groundedness — the share of answers whose claims are supported by the retrieved context or a recognized source. This is the metric that maps most directly to user trust on RAG products.

Hallucination rate — the inverse of groundedness, expressed at the response level. We track it separately because it is the number that ends up in board decks.

Time-to-release — the number of days between a prompt or model change being committed and being shipped to all users. A working eval harness compresses this dramatically because the human review step shrinks.

Regression escape rate — how often a quality regression slips past the harness and is caught in production. This is the metric that tells you whether the harness is actually load-bearing or just decorative.

Evaluación tercerizada vs. las alternativas

Tercerizar evaluación no siempre conviene. La respuesta honesta depende de la madurez de su programa de IA, el presupuesto para contratar in-house y la presión de producción actual.

Tabla comparativa de evaluador freelance, equipo de eval interno y harness tercerizado según tiempo a la primera señal, ownership del pipeline, acceso a expertos de dominio, expertise en tooling y costo

Tercerizar tiene sentido cuando

You have shipped an LLM feature, traffic is real, and you can already feel the quality drift you cannot measure.
Your engineering team has not hired an eval specialist before, and you do not want to spend two quarters trying.
You operate in a regulated industry and need defensible evidence that the model behaves the way your compliance team is being asked to sign off.
You ship more than one AI surface and the duplication of judges, rubrics, and datasets across them is starting to hurt.

Conviene in-house cuando

You have a strong applied research team that just needs a reference architecture and the time to build it.
Your AI surface is single-purpose, low-traffic, and the cost of a bad answer is reputational, not legal.
You can afford the 6–9 months it usually takes to recruit, onboard, and ramp an eval engineer in the current US market.

Una nota sobre costos

Hiring a senior LLM evaluation engineer in the US currently lands between $160,000 and $240,000 base, before benefits, equity, and the recruiter fee. Add a domain expert and a part-time data engineer for the dataset work and you are at roughly $650,000 fully loaded for a year, assuming you can find the people. Our nearshore model delivers the same skill set at 40–50% of that, in your time zone, with a published replacement guarantee.

For project-based work, typical eval engineering builds run $80,000 to $240,000 depending on scope. Dedicated three-person squads start around $24,000 per month. Single-engineer AI staff augmentation placements range from $7,000 to $11,000 per month.

Hablemos de su proyecto

Riesgos que anticipamos

Los programas de eval fallan de cinco maneras reconocibles. Las anticipamos en el diseño en lugar de sorprendernos después.

Tabla de riesgos y mitigaciones que cubre el sesgo del juez, la contaminación del set de prueba, la fuga de PII en datasets, el sobreajuste del eval y el lock-in de proveedor

Judge bias and drift. Judges are LLMs, and they shift quietly when the underlying model is upgraded by the provider. We track inter-judge agreement and human-judge agreement on a fixed calibration set, and re-baseline whenever the judge model version changes.

Test set contamination. Public benchmarks leak into training data. We hash-version every dataset, keep held-out splits the model has never seen, and refresh the most-used cases on each significant model upgrade.

PII leakage in datasets. Production traces are toxic if mishandled. We tokenize and redact at sample time, store only what we need, and keep EU and US residency boundaries in the pipeline. The NIST AI Risk Management Framework is the policy spine we map to when a client asks for one.

Eval over-fitting. A team that runs the same fifty cases for nine months ends up with a model perfectly tuned to those fifty cases and nothing else. We rotate adversarial probes, add new failure clusters from production replay, and run blind expert audits to keep the harness honest.

Vendor lock-in. Every artifact we produce is portable. Datasets ship as JSON Lines you can parse without our tools, judge prompts live in your repo, CI definitions live in your pipelines. If you replace us with an in-house team next quarter, nothing breaks. We have walked clients through that exact transition before.

Dónde la evaluación aporta más valor

Algunas superficies se benefician más que otras. El patrón: alto tráfico, salidas reguladas, respuestas con citas o agentes con efectos colaterales. Ahí es donde más nos buscan.

RAG assistants

Legal research, medical literature, internal knowledge bases. Citation correctness and groundedness are the headline rubrics. RAG-specific eval work overlaps heavily with retrieval debugging.

Customer-facing chat

Support copilots, sales assistants, eCommerce concierges. Tone, policy compliance, and refusal-handling matter as much as accuracy. We work with these alongside our AI eCommerce practice.

Code generation

Internal coding agents, code review bots, doc generators. Eval focuses on pass@1, regression risk, and security drift. Pairs naturally with agent harness engineering.

Autonomous agents

Workflows that take real-world actions: ticket triage, financial ops, scheduling. Trajectory evaluation, recovery from failed tool calls, and scope adherence drive the rubrics.

Regulated industries

Healthcare, banking, insurance. Defensibility matters: every score must be reproducible, every dataset auditable, every judge prompt versioned. The harness becomes a compliance artifact.

Internal LLM platforms

Companies running their own gateway in front of multiple providers. Eval gates protect the platform from silent vendor model changes that would otherwise propagate downstream.

Tres modalidades de contratación, según cuánto del programa de eval quiera operar internamente y cuánto delegar.

Cómo trabajar con nosotros

Outsourcing
por proyecto

We deliver a full eval harness for one or more product surfaces and hand over the whole thing. Includes rubric design, golden dataset, judges, CI gates, production replay, and runbooks. Typical duration 8–14 weeks. Best for teams that want a production-ready harness without spending six months building one in-house.

Más información

Equipo dedicado
de eval

An ongoing three- to six-person squad embedded in your AI org. The team owns the eval program: new product surfaces, rubric refreshes, red-team rotations, and the production replay backlog. Works as an extension of your platform team, not a vendor at arm's length.

Contratar un equipo

Refuerzo de
equipo

Embed individual eval engineers, judge specialists, or red-team operators into your existing AI team. Best when you have the strategy defined and need hands-on expertise to build pipelines, write rubrics, or chase down a specific failure cluster.

Contratar ingenieros

Por qué los equipos nos eligen para esto

Llevamos construyendo equipos de ingeniería tercerizados desde 2014, con más de 250 proyectos en salud, banca, eCommerce, logística y SaaS regulado. Nuestro cluster de IA creció de ese trabajo en los últimos tres años, y la evaluación se formalizó en 2025 porque siempre volvían las mismas preguntas: ¿este modelo es mejor? ¿cómo evitamos regresiones después del deploy?

A few specifics worth saying out loud. About 40% of our clients are venture-backed startups, the rest are mid-market and enterprise. We staff in two-week sprints with shared velocity dashboards, demos, and retros each cycle. Engineers are based in Córdoba with overlap on US Eastern time most of the year, so rubric calibration and incident triage do not wait for an overnight handoff. We publish a 2-week satisfaction guarantee and a 30-day notice to scale down. Our founder, Javier Uanini, still reads every weekly status report.

The opinions we have formed running these engagements are not negotiable parts of the offering, but they show up in every contract.

First, evaluation is a product surface, not a side project. It deserves the same code review, observability, and on-call rotation as the rest of the system. Teams that treat it as an internal tool always end up with stale data and quiet drift.

Second, the judge is never the answer by itself. Every program we ship has humans in the loop on a small calibration sample. The teams that try to run fully automated eval pipelines without a human anchor are the ones that come to us six months later with a quality crisis they cannot diagnose.

Third, your dataset is your moat. Tools come and go, model providers change pricing every quarter, the leaderboard you cared about last year is irrelevant now. The labeled, versioned, domain-specific dataset is the artifact that compounds. We treat it accordingly.

Preguntas Frecuentes

El QA tradicional prueba código determinista. Los LLM y agentes son probabilísticos: el mismo prompt puede dar respuestas distintas con drift sutil. La ingeniería de eval usa distribuciones de score (groundedness, faithfulness, formato, latencia, políticas) y trata prompts, retrieval y agentes como sistema bajo prueba, con datasets versionados, jueces calibrados y CI que bloquea releases malos.

Los tres. Chat con rúbricas por tarea y probes de política. RAG con scoring de retrieval, auditoría de citas y faithfulness contra fuentes. Agentes con evaluación de trayectoria: herramienta, tarea, recuperación y alcance. Adaptamos el harness a la modalidad.

Open source: Promptfoo, DeepEval, RAGAS, LangChain/LlamaIndex según el repo. Comercial: Braintrust, LangSmith, Arize Phoenix, Humanloop y eval en Azure AI Foundry, Vertex AI y Bedrock. Usamos lo que ya pagan.

Primeros scores en 2–3 semanas. Despliegue completo 8–14 semanas. Proyecto enfocado en una superficie: 4–6 semanas.

Proyectos: USD 80.000–240.000. Equipos dedicados desde USD 24.000/mes. Refuerzo: USD 7.000–11.000/mes. In-house en EE. UU.: USD 160.000–240.000 base más beneficios.

Usted. Todo va a sus repos bajo su IP, sin lock-in en portales de vendor.

Servicios Relacionados