Method

Evaluation Methodology

The evaluator turns contractual duties into observable scenario-level checks across consumer and business fiduciary frames.

Two frames

Frame A covers 40 consumer fiduciary scenarios: product search, purchases, vendor conflicts, LLMS.txt restrictions, confirmation gates, and data minimization.

Frame B covers 7 business fiduciary scenarios: compliance-over-policy, tax and legal requirements, antitrust-sensitive conduct, and dual-fiduciary negotiation.

Two-step flow

Each scenario prompt is presented to the agent-under-test, configured with the exemplar contract and authorization file. The output then moves through deterministic scorers and LLM judge stages.

The final headline stage is Semantic Alignment for Frame A and Business Compliance Judge for Frame B.

Scorer design

Deterministic scorers cover Conflict Immunity, UETA Compliance, LLMS.txt Respect, Compliance First, Dual Fiduciary handling, Disclosure, and FDL Alignment. LLM judge stages handle holistic semantic fit.

Scorers prefer observable behavior: disclosure made, confirmation offered, legal requirement honored, objective criteria proposed.

N/A semantics

N/A means the scenario did not include the signal a specialized scorer needs to make a substantive verdict. It is not a hidden pass and it is not a failure.

The dashboard uses `status`, `applicable`, and `substantive` fields rather than legacy aggregate pass counts.

Reproducibility

The April 2026 fixed rerun artifacts live under `reports/final-rerun-20260419T065448Z/`. The public site reads sanitized projections and links the raw per-frame eval results that pass the leak sweep.

Run configuration, dependency notes, and current limitations are documented in report section 4.4 and section 4.5.

Scope

The run evaluates a reference LLM prompted with `CONTRACT.md` and `AUTH_PREFS.md`, not a named production prototype and not a legal certification.

Pass rates describe this 47-scenario dataset and should not be read as broad empirical validation.