Policy / Capability #
who may do what, to which resource, under which context?
policy = rule for deciding authority
capability = credential that carries authority
enforcement = the place where the decision is applied
Role in the catalog: this block is boundary.md’s crossing discipline, promoted to its own module. boundary.md owns where the lines are; this file owns what happens at the gate.
Central tension (this is axis 1’s tradeoff, stated up front):
local, fast, available decisions
vs
fresh, revocable, globally consistent decisions
Design Axes (the core module) #
Axis 1 — Where Authority Lives (the structural cleave) #
lookup: authority is a fact in a store, evaluated at decision time
(RBAC bindings, IAM policies, Zanzibar tuples)
token: authority is carried in the credential, verified locally
(presigned URL, bearer token, macaroon, x.509 SAN)
This changes the interface shape, not just policy:
lookup -> decision needs the store: fresh, revocable, centrally auditable,
but a runtime dependency on every request path
token -> verification needs only a key: fast, partition-tolerant, offline,
but revocation becomes the hard problem and audit moves to issuance
Delegation/attenuation is not a type — it is the signature operation on token-based authority (macaroons: add caveats, never remove; each hop can only shrink the grant).
Interrogation:
Is the decision a lookup or a verification?
What breaks when the authority store is unreachable?
Who was the PDP for a token? (answer: the issuer, at issuance time)
Can a holder attenuate before passing on, or only forward the full grant?
Axis 2 — Decision Model (how the decision is computed) #
Substitutable evaluation languages, freely composed in practice:
role lookup (RBAC): principal -> role -> permitted actions
cheap, legible; pays: role explosion, scope reuse
attribute predicate (ABAC): f(principal attrs, resource attrs, action, context)
expressive; pays: missing/spoofed attributes,
rule-interaction opacity
graph reachability (ReBAC): does a path exist in the relationship graph?
natural for sharing/hierarchy; pays: traversal cost,
inherited access that no one can explain
Compositions are the norm: IAM = ABAC in role clothing; Zanzibar caveats = ABAC embedded in ReBAC.
Interrogation:
Can the decision be explained? (who granted this, via what path/rule?)
Who asserts each attribute, and can the requester influence it?
Is deny explicit or default? Do rules combine as first-deny-wins or any-allow?
For graphs: is depth bounded? Are cycles legal?
Axis 3 — Decision Topology (where evaluation runs) #
in-process, local data: evaluator inside the server (k8s RBAC in apiserver)
local engine, pushed data: sidecar/library + policy bundles (OPA, Istio AuthorizationPolicy)
remote central PDP: Check() per request (ext_authz, custom authz service)
precomputed into token: issuer was the PDP; runtime is
verification only (presigned URL, JWT scopes)
The pushed-bundle case is control-plane/data-plane (boundary.md #13) wearing policy clothes — same machinery and same failure modes as xDS:
bundle version skew across enforcers = xDS staleness
last-good-policy on distribution failure = last-good-config
Central PDP adds the gate’s own availability question:
PDP outage: fail open (availability, security hole)
or fail closed (secure, everything stops)?
decision caches trade the same coin as tokens: speed for staleness
Interrogation:
Where is policy stored / evaluated / enforced? (three different places, usually)
Latency budget: can the request path afford a remote Check()?
What version of policy did this enforcer use, and how would you know?
Fail-open or fail-closed — decided explicitly, per enforcement point?
Axis 4 — Enforcement Position (when in the lifecycle the gate sits) #
admission-time: before state enters the system (k8s webhooks, PSA, quota admission)
connection-time: before bytes flow (mTLS, NetworkPolicy, security groups)
request-time: per call (ext_authz, API authz)
data-access-time: per row/column/field (row-level security, column masking)
consumption-time: metered as capacity is used (ResourceQuota, rate limit descriptors)
Positions differ in what they can see and how long the decision stays true:
admission-time validates a mutation once while the world keeps moving —
"policy race with later controller action" is scheduler.md's
view-vs-reality* at the policy layer.
connection-time decisions outlive the connection's context
(long-lived mTLS conn survives a policy change — revocation again).
earlier positions are cheaper and coarser; later positions see more and cost more.
Interrogation:
What can this position actually observe? (admission can't see runtime behavior)
How long does a decision made here remain in force?
What later actor can invalidate the assumption this gate checked?
Defense in depth: which positions back this one up?
Axis 5 — Principal Substrate (who is asking — the foundation) #
Workload identity is not a policy type; it is what every other axis stands on. You cannot authorize what you cannot name.
human principals: OIDC subject, group membership
workload principals: SPIFFE ID, mTLS SAN, ServiceAccount token, IAM role
attestation: how the credential got bound to the right workload
(SPIRE node+workload attestation; projected, audience-bound SA tokens)
Failure here poisons everything above it:
wrong workload obtains identity -> every policy correctly authorizes the wrong party
credential theft -> identity and authority conflated (deep lesson row 1)
trust bundle skew / rotation break -> valid peers rejected, or dead roots trusted
NAT/proxy strips identity -> policy evaluates against the wrong principal
Interrogation:
Who issued the principal's credential, rooted in what trust bundle?
Is the credential bound (audience, proof-of-possession) or bearer?
How was the workload attested — could another pod obtain this identity?
Does identity survive every hop of the actual traffic path?
Technical Bottleneck: Revocation — the Freshness of Granted Authority* #
the further authority travels from its source —
into a cache, a bundle, a bearer token, a live connection —
the faster checking becomes, and the harder taking it back becomes.
Essential, no general solution: every point on axes 1 and 3 is a stance toward it. Count the doc’s failure modes that are this one problem:
token leaked, revocation hard stale decision cache
old bundle still enforcing stale group membership
stale relationship cache long-lived connection outlives policy change
Known recipes (bounded, composable, none universal):
short expiry rent authority instead of granting it —
the lease (queue/scheduler/state-machine blocks),
applied to permission
introspection / CRL reintroduce the lookup you tried to escape (hybrid)
push invalidation control-plane distribution, with its own skew window
zookie (Zanzibar) consistency token: "how stale may this decision be"
becomes an explicit per-request parameter, not an
ambient property — the flagship recipe
The canonical statement of the bottleneck is Zanzibar’s “new enemy” problem:
1. remove viewer from ACL 2. add secret content
a stale decision that reorders these shows the revoked viewer the new secret —
revocation freshness and content freshness must be causally linked
A strong design says explicitly:
who grants authority,
what authority is represented as,
where the decision is made and enforced,
how stale a decision may be (named, bounded, per path),
how authority is revoked within that bound,
and how every decision is audited.
Gate Protocol (the crossing-point spec — keep) #
General:
authenticate principal
collect request context
resolve policy data
evaluate decision
enforce allow/deny
record audit event
cache decision, if allowed (with named staleness bound)
refresh / revoke / expire authority
Capability lifecycle:
issuer creates scoped credential (issuance = the decision)
holder presents; verifier checks signature, audience, expiry, caveats
resource server enforces
credential expires or is revoked (see bottleneck*)
Central PDP:
PEP receives request
PEP calls Check(principal, action, resource, context [, zookie])
PDP evaluates policy + data
PDP returns allow/deny + reason
PEP enforces and audits
Named Configurations (lookup table) #
Vector = {authority home, decision model, topology, position, principal}.
| Name | Vector | Canonical study object | Signature failure |
|---|---|---|---|
| RBAC | lookup, roles, in-process, request-time, human/SA | k8s RBAC + SubjectAccessReview | role explosion; overbroad admin; stale membership |
| ABAC | lookup, attributes, in-process or PDP, request-time, any | IAM condition evaluation | missing/spoofed attribute; rule-interaction opacity |
| ReBAC | lookup, graph, central + caches + zookies, request-time, human | Zanzibar | traversal cost; stale cache; unexplainable inherited access |
| Bearer capability | token, precomputed scopes, verification-only, request-time, holder | presigned URL; OAuth2 bearer | leak = access; revocation hard; wrong audience accepts |
| Attenuated capability | token + caveat ops, precomputed, verification-only, request-time, delegation chain | Macaroons | unchecked caveat; over-delegation; confused deputy |
| Workload identity | (substrate for all), —, —, —, attested workload | SPIFFE/SPIRE + Envoy SDS | wrong workload attested; rotation failure; bundle skew |
| Central PDP | lookup, any model, remote Check, request-time, any | Envoy ext_authz | PDP outage; fail-open leak; per-request latency; stale cache |
| Policy-as-code bundle | lookup, attributes/rules, local engine + pushed data, request-time, workload | OPA bundles; Cedar | version skew across enforcers; rollout breaks clients |
| Admission policy | lookup, rules, in-process webhook, admission-time, user/SA | k8s validating/mutating webhooks | webhook outage blocks cluster; race with later controllers; fail-open |
| Network/service policy | lookup, identity+L4/L7 rules, pushed to proxies, connection-time, workload | Istio AuthorizationPolicy + mTLS | default-allow surprise; identity lost at proxy/NAT; policy ≠ traffic path |
| Data governance | lookup, classification attrs, engine-embedded, data-access-time, human/service | row/column policy; catalog governance | PII in logs; forbidden-region replica; backup ignores deletion |
| Quota/resource | lookup, counters, in-process admission, consumption-time, tenant | ResourceQuota; rate-limit service | undercount; wrong key; burst bypass; unmetered shared resource |
Vocabulary #
principal subject resource action context
policy role attribute relationship tuple
capability credential token claim scope audience expiry
caveat attenuation delegation confused deputy
trust root attestation rotation
PDP PEP decision reason audit
revocation introspection zookie new enemy
fail-open fail-closed default deny
Deep Lesson #
Policy bugs come from confusing pairs on different axes:
identity vs authority (axis 5 vs axes 1–4: naming ≠ permitting)
name vs principal (axis 5: a string is not an attested party)
authentication vs authorization (substrate vs decision)
token possession vs valid permission (axis 1: bearer ≠ still-authorized — revocation*)
role vs resource-specific access (axis 2: grant scope ≠ evaluation scope)
cache hit vs fresh decision (revocation*: staleness must be named)
network boundary vs trust boundary (boundary.md: mechanism ≠ concern)
fail-open vs availability (axis 3: the gate's own outage is a decision)
Design procedure: attest the principal, choose where authority lives, pick the decision model, place evaluation and enforcement, then name the staleness bound and the revocation path for every cached grant. The named types are recognition shortcuts, not the design space.