Agent EngineerInstrument / v1

I build agentic
systems. I evaluate
them honestly.

Graphic designer turned AI-native builder. Deterministic scoring, adversarial verification, cost-gated reproducible runs, and I report honest nulls. The four artifacts below share a kernel. That's the proof.

thomas@thomaspeng.ca github.com/7P3ng

0.0%False positives (Quorum K=3)

0.548F1 (FieldAgent CUAD)

576Council tests

Build the substrate. Prove it on multiple problems.

Every artifact below vendors a shared kernel from Quorum's core/: task-aware routing, multi-agent verification, deterministic evaluation. The story is not "four demos" but one architecture, stress-tested across three domains.

Deterministic scoring

No LLM-judge in the success path. Exact match, span-IoU, statistical tests. If I cannot measure it, I do not claim it.

Adversarial verification

K skeptic agents challenge every finding before it surfaces. False positives collapse under pressure.

Honest nulls

Truncation artifacts, model-specific noise, non-significant adaptations. I find them and say so.

Cost-gated runs

~$0.25 per Quorum scan. Reproducible offline with make eval-dry. Engineering discipline, not vibes.

Artifacts

FlagshipShared kernel originartifact / 01

Quorum

Task-aware agent orchestrator with adversarial verification and full tracing.

Primary finding

K=3 adversarial verification cut false positives from 27.8% to 0.0% (95% CI: [11.1, 50.0] to [0, 0]; recall 100% to 77.8%). Tested on a 36-snippet labeled set including prompt-injection traps, using DeepSeek-v4-pro. Held-out real target: 3/3 genuine bugs found, 0 surviving false positives.

27.8% → 0.0%False positive rate (K=3)

3 / 3Held-out bugs found

~$0.25Per run (total)

58Tests, CI green

Cost-routing note. The multi-tier routing harness (DeepSeek to Haiku to Sonnet to Opus) is committed and gated on an Anthropic key. A live cost-per-tier number is not fabricated here.

View repo ↗Open live ↗

Live trace UIOpen live ↗

Open Quorum ↗

Adaptive red-teamVendors Quorum core/artifact / 02

Aegis

Adaptive attacker agent red-teams a target on two harmless proxies. Deterministic scoring (exact match, no LLM judge).

Primary finding (read this first)

A reasoning model is significantly more robust: injection ASR 49.3% vs 68.1% (p=0.0012), canary 10.4% vs 21.5% (p=0.010), overall p=0.0002. But the full defense stack erases that gap entirely: 1.7% vs 2.8% (p=0.40, not significant). The honest conclusion is that layered defenses matter more than model choice.

1.7% vs 2.8%ASR under full defense (p=0.40, ns)

-25%Defense reduction (input classifier)

78Tests, CI green

Adaptation lift. 24.0% to 29.9%: became significant only after scaling the benchmark (McNemar b=17/c=0, p approx 0). Was a null at small n. Scaling is the legit power lever, not p-hacking.

View repo ↗Open live ↗

Live demoOpen live ↗

Open Aegis ↗

CUAD contract riskVendors Quorum core/artifact / 03

FieldAgent

An agent reads a real commercial contract and flags risk-bearing clauses. Graded span-IoU against CUAD gold, no LLM judge.

Primary finding (read this first)

Detection F1 = 0.548 (P = 0.741 / R = 0.435), 95% CI [0.460, 0.637] on 20 held-out CUAD contracts. +0.21 F1 over a keyword floor, robust and baseline-independent. The honest finding: the "agentic chunking lift" looked like +0.45 on DeepSeek due to a truncation artifact. A fair rerun collapses it to +0.07 (CIs overlap), and it ties on Claude Sonnet. This honesty is the point. Party names and figures are redacted in the demo.

0.548F1 (span-IoU, CUAD held-out)

+0.21F1 over keyword floor

+0.07Fair agentic lift (not +0.45)

47Tests, CI green

View repo ↗Open live ↗

Live demoOpen live ↗

Open FieldAgent ↗

Skill-Tuning Council

Methodology / systems designInternal infra, no public URL

A 4-proxy council votes on every self-improvement before it ships. Proxies: taste, pragmatism, intent, anti-drift. Pipeline: adversary generates a candidate change, editors refine it, a merger integrates it, the council votes, and disagreement triggers escalation. 576 tests.

The insight is that self-modifying systems need adversarial checks on every loop iteration, not just at the end. Drift is cheap; catching it early is cheaper.

576Council tests

AdversaryGenerates candidate change

EditorsRefine and constrain

MergerIntegrate candidates

Council4-proxy vote

EscalateDisagreement triggers review

How I build / eval discipline

M-01

No LLM-judge in the success path

Exact match, span-IoU, statistical hypothesis tests. When the measurement is a language model, the measurement is wrong. I use language models as tools, not oracles.

M-02

Adversarial verification before surfacing

K skeptic agents challenge every finding. If it doesn't survive adversarial review, it doesn't ship. Quorum cut false positives from 27.8% to 0.0% this way.

M-03

Cost-gated reproducible runs

~$0.25 per Quorum scan. make eval-dry reproduces offline. Engineering work is reproducible or it is not engineering work.

M-04

Report the honest nulls

FieldAgent's +0.45 lift was a truncation artifact. Aegis's adaptation was a null at small n. I found these and said so. That is the differentiator.

M-05

Shared kernel across artifacts

One core/, three domains. Quorum, Aegis, FieldAgent, and the Council all vendor the same substrate. That is the architecture claim.

M-06

Statistical tests with CIs

95% confidence intervals on every key claim. p-values reported with sample sizes. McNemar for before/after. No cherry-picking the significant result.

Contact

Frontier-lab Applied AI, Forward-Deployed Engineer, Agent Engineer, Design Engineer roles. Serious agentic systems work with research discipline.

Emailthomas@thomaspeng.ca GitHubgithub.com/7P3ng ↗

I build agenticsystems. I evaluatethem honestly.

Build the substrate. Prove it on multiple problems.

Deterministic scoring

Adversarial verification

Honest nulls

Cost-gated runs

Artifacts

Quorum

Aegis

FieldAgent

Skill-Tuning Council

How I build / eval discipline

No LLM-judge in the success path

Adversarial verification before surfacing

Cost-gated reproducible runs

Report the honest nulls

Shared kernel across artifacts

Statistical tests with CIs

Contact

I build agentic
systems. I evaluate
them honestly.