Guardian Policy: Defense in Depth

What the guardian does

The guardian enforces constraints. It rejects invalid outputs. The guardian never evaluates whether a signal is correct. It only verifies that the signal is allowed. This distinction is fundamental: the guardian is a mechanical enforcement point, not an intelligent filter or a judge.

Schema validation is the primary enforcement layer — a strict enum field with four values permits only those four values. The guardian provides a narrow second layer of mechanical checks applied to schema-valid output. If the schema is the structural constraint, the guardian is the additional mechanical constraint that catches what the schema alone cannot express. The guardian is intentionally modest: it is a secondary constraint layer, not the primary disclosure boundary.

How enforcement policies layer on schema constraints

Enforcement is layered in a strict sequence:

Schema validation runs first. If the model output fails the JSON Schema, the relay aborts the session with SCHEMA_VALIDATION without running the guardian. The guardian never sees schema-invalid output.
Guardian rules run second, only on schema-valid output. Each rule inspects the validated output and either passes or fires.

This layering means the schema provides primary enforcement (strict enum fields prevent most forbidden content), while the guardian provides defense-in-depth for edge cases that schema constraints alone cannot catch. For example, a schema might permit string-typed fields for human-readable category labels. The guardian can enforce that those string fields never contain decimal digits or currency symbols — content that a model might embed in an attempt to encode numeric information in string values.

Policy scope and rule types

In the current implementation, a guardian enforcement policy has a scope of RELAY_GLOBAL — it applies to every session on the relay. The policy is loaded at relay startup, content-addressed by SHA-256 over its JCS canonical form, and bound into every receipt via the guardian_policy_hash commitment.

The current protocol version supports one rule type: unicode_category_reject. A rule of this type scans every string value in the output JSON and rejects the output if any string contains a character in the specified Unicode general category. This is intentionally narrow: the current rule surface is a modest starting point, not a rich policy engine.

Supported categories:

Nd — Decimal digit numbers. As a defense-in-depth measure, a rule value of "Nd" activates scanning for decimal digits (Nd), letter numbers (Nl), and other numbers (No) — a conservative superset.
Sc — Currency symbols, as defined by Unicode 15.1.

Each rule also has a scope descriptor. Currently only "all_string_values" is supported: scan every string value in the output recursively. A skip_keys array allows excluding specific top-level object keys from scanning.

Classification: GATE vs ADVISORY

Each rule carries a classification that determines what happens when it fires:

GATE — the relay aborts the session with POLICY_GATE and returns a constant-shape HTTP 422 error: {"error": "OUTPUT_POLICY_VIOLATION"}. No detail about which rule fired or what content triggered it is exposed. The output is never delivered to either participant. This reduces the risk of participants probing the guardian policy through error responses.

ADVISORY — the relay logs a warning but does not block the output. Advisory violations are not reported to participants. This preserves the bounded response surface, but also means advisory mode is an operator-observability feature rather than a participant-facing guarantee. Advisory rules exist for monitoring and gradual enforcement rollout.

Content-addressed policy hashing and lockfile validation

The enforcement policy is content-addressed: the relay computes SHA-256(JCS(policy)) at startup and binds this hash into every receipt. A verifier who obtains the policy file can independently compute the hash and confirm which rules were active during a session.

The relay validates the policy against a lockfile at startup. The lockfile records the expected hash for each policy file. If the computed hash does not match the lockfile, the relay refuses to start (fail-closed). This prevents accidental policy drift — a modified policy file is detected before any sessions are processed.

Lockfile validation can be skipped in development mode (AV_ENV=dev with AV_ENFORCEMENT_LOCKFILE_SKIP=1), but production deployments must always use lockfiles.

Multi-policy selection

A single relay can serve multiple operator governance modes through multi-policy selection. The contract's enforcement_policy_hash field specifies which policy should govern the session. The relay loads all available policies at startup, indexes them by hash, and selects the matching policy when a session is created.

This means the same relay infrastructure can enforce different policies for different use cases — a stricter policy for financial coordination (blocking all numeric content) and a more permissive policy for scheduling (allowing time-related values). The contract determines the governance mode, not the relay configuration. Both participants agree to a specific policy hash as part of the contract, and the receipt proves which policy was active.

Relationship to the contract

The guardian policy is referenced by hash from the contract, not embedded in it. This separation has three benefits:

Reuse. The same policy can govern multiple contracts without duplication.
Independent versioning. Policies can be updated independently of contracts. A new policy version gets a new hash — existing contracts that reference the old hash continue to use the old policy.
Verifiable binding. The contract's enforcement_policy_hash is a commitment. The receipt's guardian_policy_hash must match. A relay that loaded a different policy than the one committed in the contract would produce a detectable mismatch.

What the guardian never decides

The guardian does not evaluate quality, truthfulness, relevance, or appropriateness of the signal. It does not decide whether "STRONG_MATCH" is the correct compatibility signal given the inputs. It does not assess whether the model's reasoning was sound. It does not filter content based on semantic meaning.

The guardian is deterministic: given an identical policy and identical output, any conforming relay must produce the same accept/reject decision. There is no judgement, no interpretation, and no model-in-the-loop. If the output passes schema validation and contains no characters in the forbidden Unicode categories, the guardian passes it. If it contains a forbidden character, the guardian rejects it. That is the entire scope.