Agent EngineerInstrument / v1

I build agentic
systems. I evaluate
them honestly.

Graphic designer turned AI-native builder. Deterministic scoring, adversarial verification, cost-gated reproducible runs, and I report honest nulls. The four artifacts below share a kernel. That's the proof.

0.0%False positives (Quorum K=3)
0.548F1 (FieldAgent CUAD)
576Council tests

00

Build the substrate. Prove it on multiple problems.

Every artifact below vendors a shared kernel from Quorum's core/: task-aware routing, multi-agent verification, deterministic evaluation. The story is not "four demos" but one architecture, stress-tested across three domains.

01

Deterministic scoring

No LLM-judge in the success path. Exact match, span-IoU, statistical tests. If I cannot measure it, I do not claim it.

02

Adversarial verification

K skeptic agents challenge every finding before it surfaces. False positives collapse under pressure.

03

Honest nulls

Truncation artifacts, model-specific noise, non-significant adaptations. I find them and say so.

04

Cost-gated runs

~$0.25 per Quorum scan. Reproducible offline with make eval-dry. Engineering discipline, not vibes.

01

Artifacts

FlagshipShared kernel originartifact / 01

Quorum

Task-aware agent orchestrator with adversarial verification and full tracing.

Primary finding

K=3 adversarial verification cut false positives from 27.8% to 0.0% (95% CI: [11.1, 50.0] to [0, 0]; recall 100% to 77.8%). Tested on a 36-snippet labeled set including prompt-injection traps, using DeepSeek-v4-pro. Held-out real target: 3/3 genuine bugs found, 0 surviving false positives.

27.8% 0.0%False positive rate (K=3)
3 / 3Held-out bugs found
~$0.25Per run (total)
58Tests, CI green
Cost-routing note. The multi-tier routing harness (DeepSeek to Haiku to Sonnet to Opus) is committed and gated on an Anthropic key. A live cost-per-tier number is not fabricated here.

Adaptive red-teamVendors Quorum core/artifact / 02

Aegis

Adaptive attacker agent red-teams a target on two harmless proxies. Deterministic scoring (exact match, no LLM judge).

Primary finding (read this first)

A reasoning model is significantly more robust: injection ASR 49.3% vs 68.1% (p=0.0012), canary 10.4% vs 21.5% (p=0.010), overall p=0.0002. But the full defense stack erases that gap entirely: 1.7% vs 2.8% (p=0.40, not significant). The honest conclusion is that layered defenses matter more than model choice.

1.7% vs 2.8%ASR under full defense (p=0.40, ns)
-25%Defense reduction (input classifier)
78Tests, CI green
Adaptation lift. 24.0% to 29.9%: became significant only after scaling the benchmark (McNemar b=17/c=0, p approx 0). Was a null at small n. Scaling is the legit power lever, not p-hacking.

CUAD contract riskVendors Quorum core/artifact / 03

FieldAgent

An agent reads a real commercial contract and flags risk-bearing clauses. Graded span-IoU against CUAD gold, no LLM judge.

Primary finding (read this first)

Detection F1 = 0.548 (P = 0.741 / R = 0.435), 95% CI [0.460, 0.637] on 20 held-out CUAD contracts. +0.21 F1 over a keyword floor, robust and baseline-independent. The honest finding: the "agentic chunking lift" looked like +0.45 on DeepSeek due to a truncation artifact. A fair rerun collapses it to +0.07 (CIs overlap), and it ties on Claude Sonnet. This honesty is the point. Party names and figures are redacted in the demo.

0.548F1 (span-IoU, CUAD held-out)
+0.21F1 over keyword floor
+0.07Fair agentic lift (not +0.45)
47Tests, CI green
02

Skill-Tuning Council

Methodology / systems designInternal infra, no public URL

A 4-proxy council votes on every self-improvement before it ships. Proxies: taste, pragmatism, intent, anti-drift. Pipeline: adversary generates a candidate change, editors refine it, a merger integrates it, the council votes, and disagreement triggers escalation. 576 tests.

The insight is that self-modifying systems need adversarial checks on every loop iteration, not just at the end. Drift is cheap; catching it early is cheaper.

576Council tests
01
AdversaryGenerates candidate change
02
EditorsRefine and constrain
03
MergerIntegrate candidates
04
Council4-proxy vote
05
EscalateDisagreement triggers review
03

How I build / eval discipline

M-01

No LLM-judge in the success path

Exact match, span-IoU, statistical hypothesis tests. When the measurement is a language model, the measurement is wrong. I use language models as tools, not oracles.

M-02

Adversarial verification before surfacing

K skeptic agents challenge every finding. If it doesn't survive adversarial review, it doesn't ship. Quorum cut false positives from 27.8% to 0.0% this way.

M-03

Cost-gated reproducible runs

~$0.25 per Quorum scan. make eval-dry reproduces offline. Engineering work is reproducible or it is not engineering work.

M-04

Report the honest nulls

FieldAgent's +0.45 lift was a truncation artifact. Aegis's adaptation was a null at small n. I found these and said so. That is the differentiator.

M-05

Shared kernel across artifacts

One core/, three domains. Quorum, Aegis, FieldAgent, and the Council all vendor the same substrate. That is the architecture claim.

M-06

Statistical tests with CIs

95% confidence intervals on every key claim. p-values reported with sample sizes. McNemar for before/after. No cherry-picking the significant result.

04

Contact

Frontier-lab Applied AI, Forward-Deployed Engineer, Agent Engineer, Design Engineer roles. Serious agentic systems work with research discipline.