Secret Management Service (Vault-class)
Secret Management Service (Vault-class) #
This note models a Vault-class secret management service where clients authenticate, read or generate secrets, receive leased dynamic credentials, renew or revoke them, and the system rotates underlying secret material safely at scale.
Step 1 - Normalize #
Assume the baseline prompt is:
- design a secret management service
- clients authenticate and request secrets
- static secrets are stored securely and versioned
- dynamic credentials can be generated with leases
- secrets and credentials can be rotated, renewed, revoked, and audited
- system scales across many tenants, apps, and secret scopes
Normalize into state-affecting paths.
| Requirement | Actor | Operation | State touched | Priority |
|---|---|---|---|---|
| Client authenticates to secret service | Client | state transition | S1update targetAuthSessionState | C1 |
| Admin writes or updates static secret | Admin | overwrite state | S1update targetSecretVersionState | C1 |
| Client reads secret | Client | read source | S1read source targetSecretVersionState | C1 |
| Service issues dynamic leased credential | System | append event | S1create targetLeasedCredential | C1 |
| Client renews leased credential | Client | state transition | S1update targetLeaseState | C1 |
| Client or system revokes leased credential | Client | state transition | S1update targetLeaseState | C1 |
| Admin rotates secret or backend root credential | Admin | state transition | S1update targetSecretRotationState | C1 |
| System updates access policy | Admin | overwrite state | S1update targetSecretAccessPolicy | C1 |
| System emits audit log for secret access or mutation | System | append event | S1create targetSecretAuditEvent | C1 |
| Client reads secret inventory / lease status | Client | read projection | S1read projection targetSecretStatusView | R2 |
| System routes tenant/path shard to current owner | System | read source | S1read source targetPartitionMap | C1 |
| System reassigns shard ownership after node failure | System | state transition | S1update targetPartitionOwnership | C1 |
Notes on normalization #
Important choices:
- auth establishes a current session/token lifecycle
- static secret writes are current-value state with versioning
- reads are security-critical source reads
- dynamic credential issuance is an immutable issuance fact
- lease renew/revoke are lifecycle transitions
- rotation is explicit because secret version and backend validity change over time
- audit is append-only and part of the core product
This system is a hybrid of:
secure current-value secret storagelease-backed dynamic credentialspolicy-gated access
Step 2 - Critical Path Selection #
| Requirement | Priority class | Why |
|---|---|---|
| Authenticate to service | C1 | auth establishes who may read or mutate secrets |
| Write/update static secret | C1 | current secret truth changes future reads |
| Read secret | C1 | wrong allow/deny or wrong secret value is core correctness/security failure |
| Issue dynamic credential | C1 | leased credentials are core product output |
| Renew lease | C1 | stale or invalid renewals affect credential validity |
| Revoke lease | C1 | revocation must affect future usage |
| Rotate secret/backend credential | C1 | rotation correctness affects secrecy and downstream clients |
| Update access policy | C1 | future access and issuance depend on current policy |
| Emit audit log | C1 | auditable access is often a product requirement |
| Read inventory/status | R2 | operational only |
| Route to shard owner | C1 | wrong routing can split secret truth |
| Reassign shard ownership | C1 | failover must preserve secret/lease correctness |
Baseline critical paths #
Main C1 paths:
P1authenticateP2write/update static secretP3read secretP4issue dynamic credentialP5renew leaseP6revoke leaseP7rotate secret/backend credentialP8update access policyP9emit audit eventP10route to shard ownerP11reassign shard ownership
This design is driven by:
- authoritative current secret version
- lease lifecycle for dynamic credentials
- strict policy-gated access
- durable audit trail
Step 3 - Primary State Extraction #
For a Vault-class system, the minimal primary state is the current auth session, current secret version, dynamic credential issuance and lease lifecycle, rotation state, access policy, audit trail, and routing/ownership state.
| Candidate object label | Candidate source | Candidate needed for C1/R1? | Candidate decomposition action | Class | Primary? | Owner | Evolution | Scope kind | Scope value |
|---|---|---|---|---|---|---|---|---|---|
| AuthSessionState | direct noun | Yes | keep as candidate | process | Yes | service | state machine | instance | session_id |
| SecretVersionState | direct noun | Yes | keep as candidate | entity | Yes | service | overwrite | instance | secret_path |
| LeasedCredential | direct noun | Yes | keep as candidate | event | Yes | service | append-only | instance | credential_id |
| LeaseState | lifecycle object | Yes | keep as candidate | process | Yes | service | state machine | instance | lease_id |
| SecretRotationState | lifecycle object | Yes | keep as candidate | process | Yes | service | state machine | instance | secret_path or backend_id |
| SecretAccessPolicy | direct noun | Yes | keep as candidate | entity | Yes | service | overwrite | instance | policy_scope |
| SecretAuditEvent | direct noun | Yes | keep as candidate | event | Yes | service | append-only | instance | audit_id |
| PartitionOwnership | hidden write target | Yes | keep as candidate | process | Yes | service | state machine | instance | shard_id |
| PartitionMap | hidden write target | Yes | keep as candidate | entity | Yes | service | overwrite | collection | tenant/path shards |
| SecretStatusView | derived read model | No | reject as UI artifact | projection | No | derived | overwrite | collection | tenant or path |
Important modeling choices #
SecretVersionState #
Primary because:
- current secret truth is the core product for static secrets
- versioning matters for rollback, history, and rotation
LeasedCredential #
Primary because:
- dynamic credentials are an issuance fact and often need auditability
LeaseState #
Primary because:
- current validity of a dynamic credential is lifecycle state
- states like
ACTIVE,EXPIRED,REVOKED,RENEWED
SecretRotationState #
Primary because:
- rotations often span multiple steps:
- next version created
- propagated
- old version retired
SecretAuditEvent #
Primary because:
- access/mutation audit is usually first-class product behavior
Minimal strict primary set #
AuthSessionStateSecretVersionStateLeasedCredentialLeaseStateSecretRotationStateSecretAccessPolicySecretAuditEventPartitionOwnershipPartitionMap
Step 4 - Hard Invariants #
For a Vault-class secret management service, the hard invariants are about correct policy-gated secret access, one authoritative current secret version, valid lease lifecycle, and safe rotation/revocation.
| Path | Tier | Type | Invariant statement |
|---|---|---|---|
P1 authenticate | HARD | eligibility | Action create_auth_session is valid only if presented auth method, identity, and current SecretAccessPolicy allow session issuance at decision time. |
P2 write/update static secret | HARD | ordering | Secret-version revisions are ordered by monotonic version within secret path scope. |
P3 read secret | HARD | eligibility | Decision read_secret(path) is valid only if current AuthSessionState is active and current SecretAccessPolicy allows access to the current SecretVersionState at decision time. |
P4 issue dynamic credential | HARD | eligibility | Action issue_dynamic_credential is valid only if current session/policy allow issuance and backing secret/backend state is active at decision time. |
P4 issue dynamic credential | HARD | uniqueness | Key credential_id maps to at most one logical outcome issued leased credential within credential scope. |
P5 renew lease | HARD | eligibility | Action renew_lease is valid only if current LeaseState is active and renewable under current policy and backend state at decision time. |
P6 revoke lease | HARD | eligibility | Action revoke_lease is valid only if current LeaseState is revocable at decision time. |
P7 rotate secret/backend credential | HARD | eligibility | Action advance_rotation_state is valid only if current SecretRotationState lifecycle allows the transition at decision time. |
P8 update access policy | HARD | ordering | Access-policy revisions are ordered by monotonic version within policy scope. |
P9 emit audit event | HARD | accounting | SecretAuditEvent corresponds to an actual secret read, write, issuance, renewal, revoke, or rotation action committed by the service. |
P10 route to shard owner | HARD | uniqueness | Key shard_id maps to at most one logical outcome current authoritative owner within shard_id. |
P11 reassign shard ownership | HARD | eligibility | Action reassign_shard is valid only if current owner is failed or relinquished and candidate owner is eligible and sufficiently current on shard_id at decision time. |
What matters most #
1. Secret reads are policy-gated current-state reads #
Wrong allow/deny or stale policy is a security failure.
2. Lease lifecycle is authoritative for dynamic credentials #
Renew/revoke/expire must reflect current state.
3. Rotation is a managed lifecycle #
New secret material must become current without losing auditability or exposing partial state.
4. Audit must reflect committed actions #
Audit records should correspond to actual secret access or mutation events, not attempted but uncommitted actions.
Step 5 - Execution Context #
For the baseline secret-management service:
| Field | Value | Why |
|---|---|---|
| Topology | single service distributed | one logical secret-management system spread across auth, secret, and lease nodes |
| Write coordination scope | per object scope | correctness is per secret path, lease, policy, and shard ownership scope |
| Read consistency target | strong only | stale secret or policy reads are security-critical |
| Holder model | client | clients temporarily hold auth sessions and leased credentials |
| Compensation acceptable? | No | wrong secret disclosure or stale credential validity cannot be safely repaired afterward |
Derived implications #
holder_may_crash = true- clients can disappear while holding leases or sessions
cross_service_write = false- baseline keeps auth, secret, lease, and policy state in one logical service
bounded_staleness_allowed = false- secret access and lease validation should use authoritative state
cross_service_atomicity_required = false- no multi-service transaction in baseline
exclusive_claim_required = true- shard ownership must be exclusive
guarded_by_current_state = true- auth, lease, revoke, and rotation all depend on current state
What this implies #
This pushes us toward:
- one authoritative owner per tenant/path shard
- current-value secret and policy state
- lease-backed dynamic credentials
- append-only audit records
Step 6 - Deterministic Mechanism Selection #
| Path | Write shape | Base mechanism | Required companions |
|---|---|---|---|
P1 authenticate | guarded state transition | CAS on (state, version) or single writer per shard | auth token/session version |
P2 write/update static secret | overwrite current value | CAS on version | secret version |
P3 read secret | read source | direct source read | active session, policy version |
P4 issue dynamic credential | append-only event guarded by current state | leased issuance under active backend state | lease id, expiry |
P5 renew lease | guarded state transition | CAS on (state, version) | lease version |
P6 revoke lease | guarded state transition | CAS on (state, version) | lease version |
P7 rotate secret/backend credential | guarded state transition | lifecycle transition | rotation version/state |
P8 update access policy | overwrite current value | CAS on version | policy version |
P9 emit audit event | append-only event | append log | audit correlation id |
P10 route to shard owner | exclusive claim | lease | fencing token, heartbeat |
P11 reassign shard ownership | guarded state transition | CAS on (state, version) | fencing token, shard catch-up check |
Why these fit #
Static secrets and policies #
These are current-value state, so overwrite fits.
Dynamic credentials #
Issuance is an immutable event, but current validity is captured by LeaseState, so issue plus guarded lease lifecycle fits.
Rotation #
Rotation is lifecycle-managed current state, so guarded transition fits.
Audit #
Audit records are immutable facts, so append-only fits.
Canonical substrate implied #
The baseline now points to:
- sharded secret-management service
- one owner per tenant/path shard
- current secret and policy state
- leased dynamic credentials
- append-only audit trail
Step 7 - Read Model / Source of Truth #
For a Vault-class system, truth is mostly direct source state. Inventory/status views are derived.
| Concept | Truth | Read path | Rebuild path |
|---|---|---|---|
C1 auth/session validity | AuthSessionState | read source directly | authoritative session store |
C2 current static secret version | SecretVersionState | read source directly | authoritative secret store |
C3 issued dynamic credential history | LeasedCredential | read source directly | authoritative issuance store |
C4 current lease lifecycle | LeaseState | read source directly | authoritative lease store |
C5 current rotation lifecycle | SecretRotationState | read source directly | authoritative rotation store |
C6 current access policy | SecretAccessPolicy | read source directly | authoritative policy store |
C7 audit history | SecretAuditEvent | read source directly | authoritative audit store |
C8 shard ownership | PartitionOwnership | read source directly | authoritative ownership store |
C9 shard routing map | PartitionMap | read source directly | authoritative routing metadata |
C10 inventory / lease dashboards | derived from secret, lease, and audit state | materialized view | recompute from authoritative state |
Important point #
For the core semantics:
- secret reads and lease actions use authoritative source state
- audit is source truth, not a derived UI-only artifact
- dashboards are projections
Step 8 - Failure Handling #
| Path | Retry | Competing writers | Crash after commit | Publish failure | Stale holder |
|---|---|---|---|---|---|
P1 authenticate | retry with auth flow/session version | stale auth transition loses guarded update | committed session survives crash if persisted | client may not receive token after creation | stale shard owner blocked by fencing token |
P2 write secret | retry with secret version | stale update loses CAS | committed secret version survives crash if persisted | audit emission may lag but must reflect committed write | stale shard owner blocked by fencing token |
P3 read secret | read retry safe | many readers coexist | node crash drops request only | audit append may retry | stale reads forbidden beyond consistency bound |
P4 issue dynamic credential | retry may issue multiple credentials unless request correlation or UX contract handles it | issuance must be fenced by current session/policy/backend state | committed issuance survives crash if persisted | client may fail to receive returned credential | stale issuer blocked by ownership/version discipline |
P5 renew lease | retry with lease version | stale renew loses guarded transition | committed renewal survives crash if persisted | audit emission may lag | old lease version rejected |
P6 revoke lease | retry with lease version | stale revoke loses guarded transition | committed revoke survives crash if persisted | backend revocation side effect may retry | old lease version rejected |
P7 rotate secret/backend credential | retry with rotation version | stale transition loses guarded update | committed rotation survives crash if persisted | downstream consumers may lag in pickup | n/a |
P8 update policy | retry with policy version | stale update loses CAS | committed policy survives crash if persisted | evaluator caches may lag | n/a |
P9 emit audit event | retry with audit correlation id | duplicates should be deduped or tolerated by audit sink | committed audit survives crash if durable | external sink lag acceptable if source audit persists | n/a |
P10 route to shard owner | retry after refreshing shard map | only one valid owner should exist | if owner changed, refreshed map points to new owner | n/a | stale owner rejected by fencing token |
P11 reassign shard ownership | retry failover transition safely | only one reassignment wins current ownership state | promoted owner crash triggers later reassignment | n/a | old owner fenced and must not continue serving |
What matters most #
1. Secret disclosure is the irreversible action #
Wrong disclosure cannot be compensated later.
2. Dynamic credential issuance and delivery are separate #
The service may issue a leased credential even if the client never receives it.
3. Rotation needs staged lifecycle #
New material must be activated and old material retired safely.
4. Revocation side effects may involve external systems #
But the secret service’s source truth is still the authoritative LeaseState.
Step 9 - Scale Adjustments #
| Hotspot | Type | First response |
|---|---|---|
| high secret-read QPS | read hotspot | shard by tenant/path and cache only with strict TTL/version bounds where allowed |
| dynamic lease churn | write throughput hotspot | batch renewals where possible and tune lease durations |
| audit volume | write throughput hotspot | separate append-only audit pipeline from hot secret path |
| mass rotation events | contention hotspot | rate-limit rotation rollout and stage by scope |
| policy churn | control-plane hotspot | scope policy changes narrowly and propagate incrementally |
| inventory/status queries | read hotspot | serve from derived views, not hot secret stores |
What scales well #
This system scales by:
- sharding secrets by tenant/path
- keeping lease records compact
- separating audit ingestion from secret read latency
- using dynamic credentials to reduce long-lived static secret exposure
What fails first #
Usually:
- very hot secret paths
- lease-renewal storms
- audit write amplification
- poorly staged bulk rotations
Canonical design conclusion #
The mechanical outcome is:
- primary state:
AuthSessionStateSecretVersionStateLeasedCredentialLeaseStateSecretRotationStateSecretAccessPolicySecretAuditEventPartitionOwnershipPartitionMap
- critical invariants:
- guarded auth/session access
- one authoritative current secret version per path
- dynamic credential issuance valid only under current policy/backend state
- renew/revoke valid only for current lease state
- rotation is a first-class lifecycle transition
- audit corresponds to committed actions
- mechanisms:
- overwrite current value for static secrets and policy
- append issuance and audit records
- guarded lease and rotation transitions
- fenced shard ownership
- reads:
- direct authoritative reads for secret and lease truth
- derived views for inventory and status
Polished interview answer #
I’d build the secret-management system as a sharded strongly consistent service with one authoritative owner per tenant or secret-path shard. Static secrets are stored as current versioned values behind strict policy checks, while dynamic credentials are issued as leased immutable credentials backed by authoritative lease state. Reads, renewals, revocations, and rotations all operate on current policy, session, secret, and lease truth, and every committed action emits an append-only audit record. The main scaling levers are more shards, compact lease records, a separate audit pipeline, staged rotation workflows, and careful limits on caching for security-critical reads.
Concrete Substrate #
I’ll choose a sharded strongly consistent secret-management service with current secret state, lease-backed dynamic credentials, and append-only audit records as the concrete baseline, because it matches the mechanics we derived:
- current-value secret and policy state
- lease-backed dynamic credentials
- guarded rotation and revocation
- append-only audit
- one owner per shard
Concrete tech family:
- secret service in
GoorRust - authoritative state in a replicated metadata store or service-owned Raft state machine
- secure storage/encryption:
- envelope encryption with KMS/HSM-backed master keys
- metadata/control:
- built-in Raft consensus per shard or a small etcd-like control layer
Each shard leader stores:
- current
AuthSessionState - current
SecretVersionState LeasedCredentialissuance references- current
LeaseState - current
SecretRotationState - current
SecretAccessPolicy - append-only
SecretAuditEvent
Operation Layer #
1. Write static secret #
API
PutSecret(path, secret_value, expected_version?)
Initiator
- admin
Entry point
- secret API
Authoritative decider
- shard owner for
path
Precondition
- caller session active
- policy allows write
- version matches if optimistic concurrency used
Transition
- overwrite
SecretVersionState(path) - append audit event
Response
{version}
2. Read secret #
API
GetSecret(path)
Initiator
- client
Entry point
- secret API
Authoritative decider
- shard owner for
path
Precondition
- caller session active
- policy allows read
Transition
- none on source truth
- append audit event
Response
{secret_value, version}
3. Issue dynamic credential #
API
IssueDynamicCredential(backend, role, ttl)
Initiator
- client
Entry point
- dynamic-secret API
Authoritative decider
- shard owner for backend/path plus backend plugin state
Precondition
- caller session active
- policy allows issuance
- backend active
Transition
- append
LeasedCredential - create
LeaseState = ACTIVE(expiry) - append audit event
Response
{credential, lease_id, expiry}
4. Renew or revoke lease #
API
RenewLease(lease_id, ttl)/RevokeLease(lease_id)
Initiator
- client or system
Entry point
- lease API
Authoritative decider
- shard owner for lease
Precondition
- current lease state allows transition
Transition
- guarded update of
LeaseState - append audit event
Response
{expiry}or{revoked: true}
5. Rotate secret or backend root #
API
RotateSecret(path_or_backend, action, expected_version?)
Initiator
- admin/system
Entry point
- rotation API
Authoritative decider
- shard owner for path/backend
Precondition
- current rotation state allows transition
Transition
- guarded update of
SecretRotationState - update current
SecretVersionStateor backend credential linkage - append audit event
Entry Point vs Decider vs Responder #
| Path | Entry point | Authoritative decider | Physical responder | Logical responder |
|---|---|---|---|---|
| authenticate / secret read/write | secret API | shard owner | API node | secret service |
| dynamic credential issuance | dynamic-secret API | shard owner + backend state | API node | secret service |
| renew / revoke lease | lease API | shard owner | API node | secret service |
| rotate secret | rotation API | shard owner | control/API node | secret service |
| shard failover | follower / coordination layer | shard quorum / lease store | new leader / control plane | secret service |
Concrete HLD #
Main components:
- auth/session frontend
- authenticates callers and creates sessions/tokens
- secret shard owners
- authoritative owners for secret, lease, and policy truth
- dynamic-secret backend plugins
- generate leased credentials for databases/cloud systems/etc.
- audit pipeline
- stores append-only committed audit events
- metadata/control service
- tracks shard ownership and routing
- encryption/KMS layer
- protects stored secret material
Short Interview Version #
I’d build the secret-management system as a sharded strongly consistent service with one authoritative owner per tenant or secret-path shard. Static secrets are stored as current versioned values behind strict policy checks, while dynamic credentials are issued as leased immutable credentials backed by authoritative lease state. Reads, renewals, revocations, and rotations all operate on current policy, session, secret, and lease truth, and every committed action emits an append-only audit record. The main scaling levers are more shards, compact lease records, a separate audit pipeline, staged rotation workflows, and careful limits on caching for security-critical reads.