Skip to main content
  1. concepts/

Policy / Capability #

who may do what, to which resource, under which context?
policy      = rule for deciding authority
capability  = credential that carries authority
enforcement = the place where the decision is applied

Role in the catalog: this block is boundary.md’s crossing discipline, promoted to its own module. boundary.md owns where the lines are; this file owns what happens at the gate.

Central tension (this is axis 1’s tradeoff, stated up front):

local, fast, available decisions
        vs
fresh, revocable, globally consistent decisions

Design Axes (the core module) #

Axis 1 — Where Authority Lives (the structural cleave) #

lookup:  authority is a fact in a store, evaluated at decision time
         (RBAC bindings, IAM policies, Zanzibar tuples)
token:   authority is carried in the credential, verified locally
         (presigned URL, bearer token, macaroon, x.509 SAN)

This changes the interface shape, not just policy:

lookup -> decision needs the store: fresh, revocable, centrally auditable,
          but a runtime dependency on every request path
token  -> verification needs only a key: fast, partition-tolerant, offline,
          but revocation becomes the hard problem and audit moves to issuance

Delegation/attenuation is not a type — it is the signature operation on token-based authority (macaroons: add caveats, never remove; each hop can only shrink the grant).

Interrogation:

Is the decision a lookup or a verification?
What breaks when the authority store is unreachable?
Who was the PDP for a token? (answer: the issuer, at issuance time)
Can a holder attenuate before passing on, or only forward the full grant?

Axis 2 — Decision Model (how the decision is computed) #

Substitutable evaluation languages, freely composed in practice:

role lookup (RBAC):        principal -> role -> permitted actions
                           cheap, legible; pays: role explosion, scope reuse
attribute predicate (ABAC): f(principal attrs, resource attrs, action, context)
                           expressive; pays: missing/spoofed attributes,
                           rule-interaction opacity
graph reachability (ReBAC): does a path exist in the relationship graph?
                           natural for sharing/hierarchy; pays: traversal cost,
                           inherited access that no one can explain

Compositions are the norm: IAM = ABAC in role clothing; Zanzibar caveats = ABAC embedded in ReBAC.

Interrogation:

Can the decision be explained? (who granted this, via what path/rule?)
Who asserts each attribute, and can the requester influence it?
Is deny explicit or default? Do rules combine as first-deny-wins or any-allow?
For graphs: is depth bounded? Are cycles legal?

Axis 3 — Decision Topology (where evaluation runs) #

in-process, local data:     evaluator inside the server        (k8s RBAC in apiserver)
local engine, pushed data:  sidecar/library + policy bundles   (OPA, Istio AuthorizationPolicy)
remote central PDP:         Check() per request                (ext_authz, custom authz service)
precomputed into token:     issuer was the PDP; runtime is
                            verification only                  (presigned URL, JWT scopes)

The pushed-bundle case is control-plane/data-plane (boundary.md #13) wearing policy clothes — same machinery and same failure modes as xDS:

bundle version skew across enforcers = xDS staleness
last-good-policy on distribution failure = last-good-config

Central PDP adds the gate’s own availability question:

PDP outage: fail open (availability, security hole)
            or fail closed (secure, everything stops)?
decision caches trade the same coin as tokens: speed for staleness

Interrogation:

Where is policy stored / evaluated / enforced? (three different places, usually)
Latency budget: can the request path afford a remote Check()?
What version of policy did this enforcer use, and how would you know?
Fail-open or fail-closed — decided explicitly, per enforcement point?

Axis 4 — Enforcement Position (when in the lifecycle the gate sits) #

admission-time:    before state enters the system   (k8s webhooks, PSA, quota admission)
connection-time:   before bytes flow                (mTLS, NetworkPolicy, security groups)
request-time:      per call                         (ext_authz, API authz)
data-access-time:  per row/column/field             (row-level security, column masking)
consumption-time:  metered as capacity is used      (ResourceQuota, rate limit descriptors)

Positions differ in what they can see and how long the decision stays true:

admission-time validates a mutation once while the world keeps moving —
"policy race with later controller action" is scheduler.md's
view-vs-reality* at the policy layer.
connection-time decisions outlive the connection's context
(long-lived mTLS conn survives a policy change — revocation again).
earlier positions are cheaper and coarser; later positions see more and cost more.

Interrogation:

What can this position actually observe? (admission can't see runtime behavior)
How long does a decision made here remain in force?
What later actor can invalidate the assumption this gate checked?
Defense in depth: which positions back this one up?

Axis 5 — Principal Substrate (who is asking — the foundation) #

Workload identity is not a policy type; it is what every other axis stands on. You cannot authorize what you cannot name.

human principals:    OIDC subject, group membership
workload principals: SPIFFE ID, mTLS SAN, ServiceAccount token, IAM role
attestation:         how the credential got bound to the right workload
                     (SPIRE node+workload attestation; projected, audience-bound SA tokens)

Failure here poisons everything above it:

wrong workload obtains identity    -> every policy correctly authorizes the wrong party
credential theft                   -> identity and authority conflated (deep lesson row 1)
trust bundle skew / rotation break -> valid peers rejected, or dead roots trusted
NAT/proxy strips identity          -> policy evaluates against the wrong principal

Interrogation:

Who issued the principal's credential, rooted in what trust bundle?
Is the credential bound (audience, proof-of-possession) or bearer?
How was the workload attested — could another pod obtain this identity?
Does identity survive every hop of the actual traffic path?

Technical Bottleneck: Revocation — the Freshness of Granted Authority* #

the further authority travels from its source —
into a cache, a bundle, a bearer token, a live connection —
the faster checking becomes, and the harder taking it back becomes.

Essential, no general solution: every point on axes 1 and 3 is a stance toward it. Count the doc’s failure modes that are this one problem:

token leaked, revocation hard        stale decision cache
old bundle still enforcing           stale group membership
stale relationship cache             long-lived connection outlives policy change

Known recipes (bounded, composable, none universal):

short expiry            rent authority instead of granting it —
                        the lease (queue/scheduler/state-machine blocks),
                        applied to permission
introspection / CRL     reintroduce the lookup you tried to escape (hybrid)
push invalidation       control-plane distribution, with its own skew window
zookie (Zanzibar)       consistency token: "how stale may this decision be"
                        becomes an explicit per-request parameter, not an
                        ambient property — the flagship recipe

The canonical statement of the bottleneck is Zanzibar’s “new enemy” problem:

1. remove viewer from ACL     2. add secret content
a stale decision that reorders these shows the revoked viewer the new secret —
revocation freshness and content freshness must be causally linked

A strong design says explicitly:

who grants authority,
what authority is represented as,
where the decision is made and enforced,
how stale a decision may be (named, bounded, per path),
how authority is revoked within that bound,
and how every decision is audited.

Gate Protocol (the crossing-point spec — keep) #

General:

authenticate principal
collect request context
resolve policy data
evaluate decision
enforce allow/deny
record audit event
cache decision, if allowed (with named staleness bound)
refresh / revoke / expire authority

Capability lifecycle:

issuer creates scoped credential (issuance = the decision)
holder presents; verifier checks signature, audience, expiry, caveats
resource server enforces
credential expires or is revoked (see bottleneck*)

Central PDP:

PEP receives request
PEP calls Check(principal, action, resource, context [, zookie])
PDP evaluates policy + data
PDP returns allow/deny + reason
PEP enforces and audits

Named Configurations (lookup table) #

Vector = {authority home, decision model, topology, position, principal}.

NameVectorCanonical study objectSignature failure
RBAClookup, roles, in-process, request-time, human/SAk8s RBAC + SubjectAccessReviewrole explosion; overbroad admin; stale membership
ABAClookup, attributes, in-process or PDP, request-time, anyIAM condition evaluationmissing/spoofed attribute; rule-interaction opacity
ReBAClookup, graph, central + caches + zookies, request-time, humanZanzibartraversal cost; stale cache; unexplainable inherited access
Bearer capabilitytoken, precomputed scopes, verification-only, request-time, holderpresigned URL; OAuth2 bearerleak = access; revocation hard; wrong audience accepts
Attenuated capabilitytoken + caveat ops, precomputed, verification-only, request-time, delegation chainMacaroonsunchecked caveat; over-delegation; confused deputy
Workload identity(substrate for all), —, —, —, attested workloadSPIFFE/SPIRE + Envoy SDSwrong workload attested; rotation failure; bundle skew
Central PDPlookup, any model, remote Check, request-time, anyEnvoy ext_authzPDP outage; fail-open leak; per-request latency; stale cache
Policy-as-code bundlelookup, attributes/rules, local engine + pushed data, request-time, workloadOPA bundles; Cedarversion skew across enforcers; rollout breaks clients
Admission policylookup, rules, in-process webhook, admission-time, user/SAk8s validating/mutating webhookswebhook outage blocks cluster; race with later controllers; fail-open
Network/service policylookup, identity+L4/L7 rules, pushed to proxies, connection-time, workloadIstio AuthorizationPolicy + mTLSdefault-allow surprise; identity lost at proxy/NAT; policy ≠ traffic path
Data governancelookup, classification attrs, engine-embedded, data-access-time, human/servicerow/column policy; catalog governancePII in logs; forbidden-region replica; backup ignores deletion
Quota/resourcelookup, counters, in-process admission, consumption-time, tenantResourceQuota; rate-limit serviceundercount; wrong key; burst bypass; unmetered shared resource

Vocabulary #

principal  subject  resource  action  context
policy  role  attribute  relationship  tuple
capability  credential  token  claim  scope  audience  expiry
caveat  attenuation  delegation  confused deputy
trust root  attestation  rotation
PDP  PEP  decision  reason  audit
revocation  introspection  zookie  new enemy
fail-open  fail-closed  default deny

Deep Lesson #

Policy bugs come from confusing pairs on different axes:

identity            vs  authority              (axis 5 vs axes 1–4: naming ≠ permitting)
name                vs  principal              (axis 5: a string is not an attested party)
authentication      vs  authorization          (substrate vs decision)
token possession    vs  valid permission       (axis 1: bearer ≠ still-authorized — revocation*)
role                vs  resource-specific access (axis 2: grant scope ≠ evaluation scope)
cache hit           vs  fresh decision         (revocation*: staleness must be named)
network boundary    vs  trust boundary         (boundary.md: mechanism ≠ concern)
fail-open           vs  availability           (axis 3: the gate's own outage is a decision)

Design procedure: attest the principal, choose where authority lives, pick the decision model, place evaluation and enforcement, then name the staleness bound and the revocation path for every cached grant. The named types are recognition shortcuts, not the design space.