Skip to main content
  1. System Design Components/

Platform Regeneration Analysis #

An Artificial Chemistry Toolkit for Platform Engineers #


Why This Document Exists #

Every platform team eventually hits the same wall: the architecture diagram says one thing, production behaves differently, and nobody can explain why adding a simple new service caused a cascade of failures last Tuesday.

Traditional architecture frameworks describe what a system is. This toolkit describes how a system lives — how it sustains itself, where it’s fragile, how it evolves, and why some changes are safe while others trigger instability.

The framework comes from Artificial Chemistry (AChem), a field that studies how autonomous components with local interaction rules self-organize into persistent structures. It turns out that modern platforms — microservice ecosystems, Kubernetes clusters, internal developer platforms — behave like chemical systems whether we design them that way or not.

This is not a theoretical exercise. Every concept maps to a concrete diagnostic question, and every diagnosis leads to a specific architectural intervention.

The document builds from first principles. Part 1 establishes the chemical vocabulary. Part 2 introduces the three properties that determine system behavior. Part 3 provides the core diagnostic tool. Parts 4 and 5 catalogue failure modes and evolutionary dynamics. If you’re already familiar with AChem concepts, skip to Part 3.


Part 1: Seeing Your Platform as a Chemistry #

Before we can diagnose anything, we need to learn to see a platform the way a chemist sees a reactor. This requires identifying four things: the molecules, the reactions, the reaction rules, and the reactor itself.

Molecules: The Autonomous Units #

In chemistry, a molecule is a self-contained entity with defined properties and interaction surfaces. In a platform, a molecule is any component that:

  • Has its own identity and lifecycle
  • Exposes interfaces that other components can interact with
  • Can exist independently (even if it’s useless alone)

What counts as a molecule depends on the level of analysis. At the infrastructure level, Pods, Services, Deployments, ConfigMaps, and CRDs are molecules. At the application level, microservices, databases, message queues, and caches are molecules. At the platform level, CI pipelines, operators, monitoring systems, and even human roles are molecules.

The first step in any analysis is to ask: What are the molecules in this system?

This is not the same as listing components on an architecture diagram. Architecture diagrams often conflate molecules with their containers (a “service” box might contain three distinct molecules: the application, its sidecar, and its config), or omit molecules that aren’t part of the designed system but participate in its behavior (the on-call engineer, the shared CI runner, the DNS service).

A good molecule inventory is exhaustive and honest. Include everything that participates in the system’s behavior, whether it was designed to or not.

Practical exercise: For a subsystem you own, list every component that participates in its lifecycle — from code commit to production operation to failure recovery. Include infrastructure, tooling, and human roles. You’ll typically find 10-15 molecules where your architecture diagram shows 3-5.

Reactions: What Happens When Molecules Meet #

In chemistry, a reaction occurs when molecules interact and produce something — a new molecule, a change of state, or a release of energy. In a platform, a reaction occurs whenever one component’s behavior causes a change in another.

Examples of platform reactions:

  • A Git push triggers a CI pipeline run → reaction between Git repo and CI pipeline
  • A CI pipeline creates a container image → reaction producing a new molecule
  • A Deployment controller detects fewer Pods than desired and creates new ones → reaction restoring equilibrium
  • An HPA reads metrics and scales a Deployment → reaction modifying an existing molecule
  • A monitoring system detects a failure and alerts a human → reaction crossing the automation boundary
  • An on-call engineer diagnoses a bug and commits a fix → reaction feeding back into the regeneration cycle

The crucial insight: nobody orchestrates these reactions centrally. Each reaction is a local interaction between molecules following local rules. The CI pipeline doesn’t know about the HPA. The monitoring system doesn’t know about the CI pipeline. Global system behavior emerges from local reactions.

This is exactly how chemistry works. No central brain tells molecules when to react. The reactor provides conditions, molecules collide, and reactions happen according to local rules.

Practical exercise: For the same subsystem, list every interaction between the molecules you identified. For each, note: what triggers the reaction, what it produces, and whether it’s automated or requires human intervention.

Reaction Rules: What Governs Interactions #

A reaction rule defines when a reaction occurs and what it produces. In chemistry, reaction rules are governed by physics — molecular compatibility, energy levels, catalysts. In a platform, reaction rules are encoded in:

  • Controller logic: The reconciliation loop that watches desired state vs. actual state and acts on the difference
  • Pipeline definitions: The steps that transform code into running services
  • Operator code: The custom logic that manages complex stateful systems
  • Event handlers: The code that responds to messages and triggers downstream actions
  • Human procedures: The runbooks, escalation policies, and institutional knowledge that govern human reactions

A critical observation: in an ideal system, reaction rules would be inspectable, composable, and analyzable. You could look at the complete set of rules and predict system behavior. In practice, platform reaction rules are scattered across controller binaries, pipeline YAML, operator code, and human heads. They’re written in different languages, operate at different abstraction levels, and interact in ways nobody fully understands.

This fragmentation of reaction rules is one of the deepest structural problems in platform engineering. It means that no one can predict the global behavior of the system from its parts — not because the system is inherently unpredictable, but because the reaction rules are opaque and dispersed.

Practical exercise: For each reaction you identified, locate where the reaction rule lives. Is it in a controller? A pipeline? A runbook? A person’s head? How many different “languages” are your reaction rules written in?

The Reactor: Where Chemistry Happens #

The reactor is the environment that hosts molecules and enables reactions. In chemistry, it’s the vessel — a test tube, a cell membrane, a planet’s atmosphere. In a platform, the reactor is:

  • The Kubernetes control plane and scheduler
  • The event bus or message broker
  • The CI/CD infrastructure
  • The cloud provider’s compute and networking layer

The reactor’s properties profoundly affect system behavior. The most important property is mixing:

Well-mixed reactor: Every molecule is visible to every other molecule. Kubernetes with flat namespaces is a well-mixed reactor — any controller can watch any resource, any service can call any other service. Well-mixed reactors enable serendipitous reactions (useful combinations that nobody planned) but also unintended reactions (failures that nobody predicted).

Compartmentalized reactor: Molecules are grouped into compartments. Interactions within a compartment are free; interactions across compartments are controlled and explicit. VPC-isolated environments, strongly enforced namespace boundaries, and cell-based architectures are compartmentalized reactors. They prevent unintended reactions but make intended cross-boundary interactions harder.

Flow reactor: Molecules enter, products leave, nothing recycles. A stateless CI/CD pipeline is a flow reactor — code goes in, artifacts come out, no state persists. Flow reactors are predictable but can’t self-organize.

Most real platforms are hybrids. The Kubernetes control plane is well-mixed. The network layer might be compartmentalized. The CI/CD path is a flow reactor. Understanding which parts of your platform use which reactor type is essential for predicting where emergent behavior will appear (well-mixed regions) and where rigidity will frustrate change (compartmentalized regions).

Practical exercise: For your platform, identify which regions are well-mixed, which are compartmentalized, and which are flow reactors. Are the reactor types appropriate for each region’s risk tolerance?

Catalysts: Enablers That Aren’t Consumed #

In chemistry, a catalyst enables a reaction without being consumed by it. It lowers the energy barrier, making reactions happen that wouldn’t otherwise occur, and it’s still there afterward to enable the next reaction.

In a platform, catalysts are:

  • Controllers that watch for state changes and create or modify resources
  • Operators that manage complex lifecycle operations
  • Sidecars that provide cross-cutting capabilities (logging, security, networking)
  • Admission controllers that validate or mutate resources as they’re created

A pure catalyst is stateless and local — it enables a specific reaction between specific molecules and doesn’t accumulate global knowledge or control. This is the ideal. In practice, many platform catalysts violate this ideal by growing into centralized decision engines (see Catalyst Brain Formation in Part 5).

The diagnostic question for catalysts: Does this component enable reactions, or does it control them? Enablers are healthy catalysts. Controllers-that-control-everything are brains disguised as catalysts.

Putting It Together: The AChem Mapping #

AChem ConceptPlatform EquivalentKey Question
MoleculeService, pod, CRD, pipeline, database, human roleWhat are the autonomous units?
ReactionState change caused by one component affecting anotherWhat happens when components interact?
Reaction ruleController logic, pipeline definition, operator code, runbookWhat governs when and how reactions occur?
ReactorControl plane, event bus, scheduler, cloud substrateWhat environment hosts the molecules?
CatalystController, operator, sidecar, admission webhookWhat enables reactions without being consumed?
Reactor mixingWell-mixed / compartmentalized / flowWhich components can see and interact with which others?
graph TB subgraph Reactor["🧪 Reactor (Control Plane, Event Bus)"] subgraph Molecules["Molecules (Autonomous Components)"] M1[Service] M2[Pod] M3[CRD] M4[Pipeline] end subgraph Catalysts["⚡ Catalysts (Enable Reactions)"] C1[Controller] C2[Operator] C3[Sidecar] end M1 -->|Reaction
governed by| RR1[Reaction Rule:
Controller Logic] M2 -->|Reaction
governed by| RR2[Reaction Rule:
Pipeline Def] C1 -.->|enables| M1 C2 -.->|enables| M2 end

Once you can see your platform through this mapping, the diagnostic tools in the rest of this document become natural rather than abstract.


Part 2: Three Properties That Determine System Behavior #

With the chemical vocabulary established, we can now describe three properties that determine whether a platform subsystem is stable, fragile, evolvable, or chaotic. These three properties are the foundation for every diagnostic and intervention in this toolkit.

Closure #

A set of components is closed if every interaction between them produces something that stays within the set. Nothing leaks out. Nothing unexpected escapes the boundary.

Most platform teams think about encapsulation in terms of static boundaries — API contracts, service meshes, namespace isolation. Closure is the dynamic version: when these components interact at runtime, do the effects stay within the boundary?

Consider a seemingly well-encapsulated microservice. It has clean APIs, its own database, and well-defined event contracts. But at runtime: an error condition triggers a log entry, the log entry matches a monitoring alert rule, the alert pages a human, and the human modifies a different service’s configuration to work around the issue. The reaction products (the alert, the page, the manual config change) escaped the boundary. The subsystem is not closed.

This matters because systems that aren’t closed require external energy — human intervention, cross-team coordination, other systems stepping in — to remain stable. The more organizational leakage, the more external energy required, and the less resilient the system becomes.

Diagnostic question: For a subsystem, trace every output it produces at runtime — not just designed outputs, but error events, logs, metrics, alerts, manual interventions triggered. If any of those outputs cause changes outside the subsystem boundary, the system is not closed. Map where the leaks are.

Self-Maintenance #

A self-maintaining system continuously produces the components it needs to survive. If a component degrades, other components in the set recreate it.

Kubernetes reconciliation is the canonical example: a ReplicaSet detects a missing Pod and creates a new one. But self-maintenance applies far beyond Kubernetes controllers.

The diagnostic question is:

If I walk away from this subsystem for six months, will it still be running?

If yes, it’s self-maintaining. The components produce and regenerate each other in a closed cycle.

If no, identify what decays. That component isn’t being regenerated by anything else in the set. It requires external energy — typically a human — to persist.

A strict definition matters here. A system with humans in its regeneration cycle is not self-maintaining — it’s an externally maintained subsystem. This is not a failure; many systems legitimately require human involvement. But calling it “self-maintaining” when it isn’t masks operational risk and prevents clear thinking about automation priorities.

Diagnostic question: For each component in a subsystem, ask: “What produces this? What maintains this? If it disappears, does anything in the set recreate it?” Components with no answers to these questions are fragility points.

Constructive Dynamics #

In a conservative system, the set of component types is fixed. Services interact in known ways, and nothing fundamentally new appears. You deploy a set of services, they run, they interact within the designed space. Most systems are designed to be conservative.

In a constructive system, interactions produce new types — things that didn’t exist before with their own novel behaviors. Kubernetes with CRDs and operators is constructive: teams can define entirely new resource types and new controllers that manage them. The npm ecosystem is constructive: new packages introduce new abstractions that change how other packages are used.

Constructive dynamics are powerful — they allow a platform to evolve beyond its original design. They’re also dangerous — the number of possible interactions grows combinatorially with the number of component types. At some point, no human can predict system behavior.

The diagnostic question: Can this platform produce new component types that nobody originally designed? If yes, the follow-up is: Are those constructive dynamics constrained or unbounded?

Unconstrained constructive dynamics guarantee eventual chaos. Constrained constructive dynamics enable evolution. The difference is governance — not bureaucracy, but explicit rules about what new types can be introduced, what they can interact with, and how they must declare their interfaces.

Diagnostic question: In the last year, what new component types appeared in your platform that weren’t in the original design? Were those introductions reviewed? Did they declare their interaction interfaces? How many of them introduced unexpected interactions with existing components?


Part 3: The Regeneration Graph #

With the chemical vocabulary (Part 1) and the three key properties (Part 2) established, we can now introduce the core diagnostic tool: the regeneration graph.

What It Is #

A regeneration graph is a directed graph where nodes are molecules (components) and edges represent generative relationships — which components create and maintain which others.

This is fundamentally different from a dependency graph, which shows what a component needs to function. A regeneration graph shows what produces and sustains a component. The distinction is critical: a component can appear in a dependency graph (the system needs it at runtime) but have no incoming edges in the regeneration graph (nothing recreates it if it fails). That component is your highest-risk element.

Edge Types #

A regeneration graph uses two edge types:

  • Creates: Molecule A produces Molecule B (e.g., a Deployment creates Pods, a CI pipeline creates a container image)
  • Maintains: Molecule A keeps Molecule B operational (e.g., HPA maintains Pod count, a database operator maintains replication)

A separate dependency graph uses one edge type:

  • Depends on: Molecule A requires Molecule B at runtime (e.g., Pods depend on Database)

Keep these on separate graphs. Mixing generative and consumption relationships muddies the analysis. The power comes from comparing the two: a component that appears in the dependency graph but has no incoming edges in the regeneration graph is your highest-risk element — the system needs it but can’t recreate it.

How to Draw One #

Step 1: Define the subsystem boundary. Choose a domain service, a platform subsystem, or an infrastructure layer. Don’t try to graph the entire platform at once.

Step 2: Perform a molecule inventory. List every component that participates in the subsystem’s lifecycle — from code commit to production operation to failure recovery. Include infrastructure, tooling, pipelines, controllers, and human roles. Be exhaustive. You’ll typically find 10-15 molecules where your architecture diagram shows 3-5.

Step 3: Identify reactions. For each pair of molecules, ask: does one create, maintain, or trigger the other? Document every generative interaction. Note what triggers each reaction and what it produces.

Step 4: Classify reaction rules. For each reaction, identify where the rule governing it lives. Is it a Kubernetes controller? A pipeline definition? Operator code? A human runbook? Note whether the rule is inspectable, versioned, and composable — or opaque and scattered.

Step 5: Draw regeneration edges. Using only Creates and Maintains relationships, draw the directed graph. This is the regeneration graph.

Step 6: Draw the dependency graph separately. Using only Depends On relationships, draw a second directed graph.

Step 7: Analyze the topology. Look for the patterns described below.

Regeneration Graph Patterns #

Seed Molecule. A component with high generative fan-out — it creates or maintains many others. If it dies, the subsystem cannot regenerate. Seed molecules must be highly reliable, simple, and ideally redundant. In many platforms, the CD pipeline is the seed molecule, which is structurally risky if it lives outside the subsystem boundary.

Fragility Point. A component that appears in the dependency graph but has no incoming edges in the regeneration graph. The system needs it but cannot recreate it. Databases without operators are the classic example.

Inert Intermediate. A component that lies on a regeneration path but doesn’t create or maintain anything itself — it’s a pass-through. Container registries, DNS services, certificate authorities, and secret stores are typical inert intermediates. They don’t participate in reactions directly, but their absence breaks the regeneration chain. They are structurally invisible until they fail, at which point everything downstream stops. They are a specific and underappreciated class of fragility.

Self-Maintaining Cycle. A closed loop of regeneration edges where components produce and maintain each other without external input. This is a true organization in AChem terms — a self-sustaining set of molecules. Self-maintaining cycles are natural candidates for domain boundaries because they represent structurally coherent units that can operate independently.

Human-Regenerated Node. Any component whose only incoming regeneration edge comes from a human role. These are operational burdens and automation targets. A system with humans in its critical regeneration cycle is not self-maintaining — it’s an externally maintained subsystem, regardless of how automated the rest looks.

Regeneration Latency #

Not all regeneration edges are equal. Each edge has a characteristic latency — how long it takes for the regeneration to occur:

Latency ClassMechanismTypical Time
Controller-basedKubernetes reconciliation, operator loopsSeconds
Operator-basedDatabase operators, cert-managerMinutes
Pipeline-basedCI/CD pipelinesMinutes to hours
Human-basedOn-call investigation and fixHours to days

A system’s recovery time cannot be faster than the slowest regeneration edge in its critical path. If your regeneration cycle includes a human node with hours-to-days latency, your actual recovery bound is hours to days — even if every automated component acts in seconds.

This connects regeneration analysis directly to SLO thinking: your recovery SLO is constrained by regeneration latency on the critical path. Any SLO target faster than the slowest critical edge is a fiction.

Regeneration Redundancy #

For each critical component, count the number of distinct regeneration paths:

  • Multiple automated paths: Resilient. Component can be recreated even if one path fails.
  • One automated path + one manual backup: Adequate. Manual backup covers automated failure but at higher latency.
  • Single path only: Fragile. If that path breaks, the component cannot be regenerated.
  • No path: Critical fragility. The component exists but nothing in the system can recreate it.

Resilient platforms have redundant regeneration edges for their most critical components. This gives platform teams a concrete metric: regeneration redundancy per component.


Part 4: Worked Example #

System: Kubernetes Microservice with CI/CD #

A typical production service: code in Git, CI pipeline builds containers, CD pipeline deploys to a Kubernetes cluster, controllers keep it running, humans intervene on failures.

Step 1: Subsystem Boundary #

Subsystem: payment-service

Step 2: Molecule Inventory #

MoleculeTypeRole
Git repositorySource controlSource of truth for service code
CI pipelineAutomationBuilds container images from code
Container imageArtifactImmutable build output
Container registryInfrastructureStores and serves container images
CD pipelineAutomationDeploys manifests to cluster
Kubernetes DeploymentResourceDeclares desired Pod state
PodsComputeRunning service instances
HPAControllerScales Pods based on metrics
DatabaseStateful servicePersistent data store
MonitoringObservabilityDetects failures and anomalies
On-call engineerHuman roleDiagnoses and fixes issues

Note: the architecture diagram for this service typically shows 3-4 boxes. The molecule inventory reveals 11 participants in the system’s lifecycle.

Step 3: Identify Reactions #

ReactionTriggerProduct
Git push → CI runCode changeCI pipeline execution
CI execution → image buildPipeline logicContainer image
Image → Registry storageCI pipeline pushStored artifact
Registry → CD pullCD pipeline triggerImage available for deployment
CD execution → Deployment updatePipeline logicUpdated Kubernetes resource
Deployment → Pod creationController reconciliationRunning Pod instances
HPA → Pod scalingMetrics thresholdAdjusted Pod count
Pod failure → Monitoring alertMetric anomalyAlert notification
Alert → Human investigationAlerting ruleHuman awareness of issue
Human diagnosis → Git fixHuman judgmentCode or config change

Step 4: Classify Reaction Rules #

ReactionRule LocationInspectable?Composable?
Git → CIWebhook + pipeline YAMLYesPartially
CI → ImagePipeline definitionYesPartially
CD → DeploymentPipeline or GitOps configYesPartially
Deployment → PodsKubernetes controller (compiled binary)NoNo
HPA → PodsHPA controller + metrics configPartiallyNo
Monitoring → AlertAlert rules (Prometheus, etc.)YesYes
Human → Git fixRunbook + tribal knowledgePartiallyNo

Observation: reaction rules are scattered across at least five different systems (Git webhooks, CI YAML, Kubernetes controllers, HPA config, alerting rules, human runbooks). No single place describes the complete reaction behavior of this subsystem. This is the “reaction rules written in different dialects” problem.

Step 5: Regeneration Graph (Creates and Maintains Only) #

graph TD OnCall[On-call engineer] -->|fixes| Git[Git repo] Git -->|triggers| CI[CI pipeline] CI -->|creates| Image[Container image] Image -->|stored in| Registry[Registry
inert intermediate] Registry -->|feeds| CD[CD pipeline] CD -->|creates| Deployment[Deployment] Deployment -->|creates| Pods[Pods] HPA[HPA] -->|maintains| Pods Pods -->|metrics| Monitoring[Monitoring] Monitoring -->|alerts| OnCall style Registry stroke:#ff4444,stroke-width:3px style Pods stroke:#44ff44,stroke-width:3px

Step 6: Dependency Graph (Separate) #

graph LR Pods -->|depends on| Database[(Database)] Pods -->|depends on| Monitoring[Monitoring] Deployment -->|depends on| Image[Container image
in Registry] style Database stroke:#ff4444,stroke-width:3px style Pods stroke:#44ff44,stroke-width:3px

Step 7: Cross-Graph Diagnosis #

Closure test: The regeneration cycle is: Pods → Monitoring → On-call → Git → CI → Registry → CD → Deployment → Pods. This cycle includes a human node (on-call engineer). Classification: externally maintained subsystem, not a closed organization.

Seed molecule: The CD pipeline. If it stops, no new deployments occur and certain failure classes cannot be recovered. The seed molecule sits outside the cluster boundary — structurally risky because it’s not regenerated by anything inside the subsystem.

Fragility point: The database appears in the dependency graph (Pods depend on it) but has no incoming edges in the regeneration graph. Nothing in the system creates or maintains it. If it fails catastrophically, recovery requires manual intervention outside the subsystem’s chemistry. Highest-risk component.

Inert intermediate: The container registry lies on the critical regeneration path (CI → Registry → CD) but creates and maintains nothing. If the registry goes down, the entire regeneration chain breaks silently. This class includes DNS, certificate authorities, and secret stores — structurally invisible until they fail.

Regeneration redundancy:

ComponentPrimary PathBackup PathRedundancy
PodsDeployment createsManual kubectlYes (automated + manual)
DeploymentCD pipeline createsManual kubectlMinimal (manual backup only)
DatabaseNoneManual restore from backupNone — critical risk
RegistryExternal managed serviceNone in-systemNone

Regeneration latency on critical path:

graph LR PF[Pod failure] -->|seconds| M[Monitoring] M -->|minutes| OCA[On-call alert] OCA -->|hours| HD[Human diagnosis
+ fix] HD -->|minutes| CI[CI build] CI -->|seconds| R[Registry] R -->|minutes| CD[CD deploy] CD -->|seconds| DU[Deployment
update] DU -->|seconds| PR[Pods
restored] style OCA stroke:#ff8800,stroke-width:3px style HD stroke:#ff0000,stroke-width:4px style PR stroke:#44ff44,stroke-width:3px

Recovery bound: hours to days. The human edge dominates despite all automated components acting in seconds to minutes. Any SLO target faster than this is unrealistic for failures requiring code changes.

Interventions #

1. Move the seed molecule inside the boundary (GitOps).

Replace the external CD pipeline with a GitOps controller (ArgoCD, Flux) running inside the cluster. The cluster itself maintains the controller via its own reconciliation. The seed molecule moves from an external, unregeneratable dependency to an internal, self-maintaining component. This is why GitOps is structurally superior to push-based CD — it’s not a workflow preference, it’s an organizational closure improvement.

2. Bring the database into the regeneration graph.

Add a database operator (CloudNativePG, Zalando Postgres Operator). The operator creates and maintains the database instance. Pods produce metrics the operator consumes for scaling and failover decisions. A new self-maintaining cycle forms: Operator → Database → Pods → Metrics → Operator. The database transforms from a fragility point to a member of a self-maintaining organization.

graph TD Operator[Database
Operator] -->|creates & maintains| DB[(Database)] DB -->|serves| Pods[Pods] Pods -->|produce| Metrics[Metrics] Metrics -->|inform| Operator style DB stroke:#44ff44,stroke-width:3px style Operator stroke:#4444ff,stroke-width:3px style Pods stroke:#44ff44,stroke-width:3px

3. Remove the human from the critical regeneration path.

Replace Monitoring → On-call → Git fix → CI → CD → Deployment with Monitoring → auto-rollback controller → Deployment. For the class of failures addressable by rollback, recovery latency drops from hours-to-days to seconds-to-minutes. The human remains available for novel failures but is no longer on the critical path for known failure modes.

4. Add regeneration redundancy for inert intermediates.

Add registry mirroring or caching within the cluster boundary. If the external registry fails, the local cache serves images for redeployment. The inert intermediate gains a backup regeneration path.

After all interventions:

PropertyBeforeAfter
Organization typeExternally maintainedClosed, self-maintaining
Seed moleculeExternal (CD pipeline)Internal (GitOps controller)
DatabaseFragility pointIn regeneration cycle
Human in critical pathYesNo (available for novel failures)
RegistryInert, no redundancyCached, redundant
Recovery latencyHours to daysSeconds to minutes

Part 5: Failure Archetypes #

These are named evolutionary failure modes specific to platform systems. They emerge naturally from the chemical dynamics described in Parts 1-2. When you identify one, check for the others — they frequently compound.

1. Parasitic Catalysis #

What it is: A catalyst (support component) that consumes more resources than the reactions it enables. In chemical terms, the catalyst’s presence inhibits the system rather than enabling it.

Examples: An observability sidecar that saturates CPU. A security agent that adds 200ms latency to every request. A logging pipeline that generates more data than the application itself.

Detection: Resource usage dominated by platform components rather than application workloads. Performance degrades when platform features are enabled. The ratio of “platform molecules” to “application molecules” in resource consumption is inverted.

Intervention: Isolate the catalyst — reduce its scope of reaction. Rate-limit its activity. Replace with a lighter-weight mechanism. Catalysts should be nearly invisible in resource consumption; when they’re not, they’ve become parasites.

Compounds with: Constructive Avalanche — each new platform feature adds another potentially parasitic catalyst.

2. Reaction Domain Collision #

What it is: Two catalysts (controllers, operators, automation systems) that watch and modify the same molecule. In chemistry, this produces unpredictable products or oscillation. In platforms, it produces flapping states and race conditions.

Examples: HPA and a custom autoscaler both scaling a Deployment. ArgoCD and Flux both reconciling the same manifests. Platform team’s Terraform and app team’s Pulumi both modifying the same infrastructure.

Detection: Oscillating resource states (scaling up and down repeatedly). Conflicting entries in controller logs. Resources that drift from desired state despite active reconciliation. The phrase “it keeps reverting” in incident channels.

Intervention: Assign single reaction authority per resource type. Every molecule should have exactly one catalyst that creates it and one that maintains it. If two catalysts must coexist on the same resource, define explicit priority ordering and ensure the lower-priority catalyst defers.

Compounds with: Organizational Leakage — hidden dependencies create reaction surfaces where external catalysts interfere.

3. Organizational Leakage #

What it is: A subsystem that appears closed (self-contained) but has hidden reaction pathways crossing its boundary. In chemical terms, the organization’s membrane has holes.

Examples: A “self-contained” microservice that silently depends on a shared library’s runtime behavior. An “independent” domain that requires a central config service to start. A service that works in staging but fails in production because staging has different DNS.

Detection: The subsystem fails when an apparently unrelated service is modified or removed. Post-mortems contain the phrase “we didn’t know it depended on that.” The dependency graph reveals connections that the regeneration graph and architecture diagram don’t show.

Intervention: Make all cross-boundary reactions explicit. Move shared components either fully inside the organization boundary (internalize them) or fully outside with explicit, versioned interfaces (externalize them). The worst state is implicit dependence — it prevents closure without being visible.

Compounds with: Reaction Domain Collision — leaked dependencies create surfaces where external catalysts interfere with internal molecules.

4. Constructive Avalanche #

What it is: One new molecule type triggers a cascade of additional molecule types, rapidly expanding the reactor’s chemical vocabulary. In a conservative system, adding a new service is routine. In a constructive avalanche, adding one new abstraction requires inventing three more.

Examples: A new CRD requires an operator, which requires a webhook, which requires cert-manager, which requires a secret store. A new microservice framework requires its own sidecar, log format, metrics pipeline, and deployment template.

Detection: A single feature request introduces three or more new platform components. The count of distinct resource types or operators increases faster than the count of application services. Teams can no longer enumerate all the molecule types in the reactor.

Intervention: Collapse abstractions — question whether each intermediate molecule is truly necessary. Introduce platform-level primitives that absorb common functionality. Constrain the extension model so new types must compose from existing primitives rather than introducing entirely new ones.

Compounds with: Catalyst Brain Formation — someone builds a mega-controller to manage the avalanche, which becomes a centralized bottleneck.

5. Catalyst Brain Formation #

What it is: A catalyst (controller, operator, CI/CD system) accumulates reaction rules until it becomes a centralized decision engine that controls the entire platform. In AChem terms, a local catalyst has evolved into a global reactor controller — violating the principle that system behavior should emerge from local reactions.

Examples: A CI/CD system that encodes all deployment logic, environment rules, and rollback policies. A platform operator that manages all service types with a single complex reconciliation loop. An admission controller with hundreds of policy rules covering every resource type.

Detection: All platform changes require modifying one central component. New service types or deployment patterns are blocked until the platform team updates the “brain.” Teams queue behind a single bottleneck for platform changes. The component’s codebase grows monotonically while no other component’s does.

Intervention: Split the catalyst into domain-local catalysts. Remove global assumptions. Introduce extension points so teams can customize behavior without modifying the central component. The architectural goal is many small catalysts with local scope, not one large catalyst with global scope.

Compounds with: Attractor Ossification — the brain becomes the attractor that resists all architectural change.

6. Attractor Ossification #

What it is: A self-maintaining organization that works so well it becomes impossible to change. The reconciliation loops that make it self-healing also make it change-resistant — every modification gets “healed” back to the current state. The system is trapped in an attractor basin.

Examples: A deployment pipeline so embedded that migrating to a new approach requires rewriting everything simultaneously. A service mesh configuration that automatically reverts manual changes, even intentional ones. A Kubernetes cluster where the GitOps controller immediately reverts any architectural experiment.

Detection: Attempts to introduce new patterns are automatically reversed by existing controllers. Platform migrations require “big bang” cutovers because gradual change is prevented by self-healing. The phrase “we can’t change X without changing everything” appears in architectural discussions.

Intervention: Attractor deformation. Do not attempt abrupt replacement. Instead:

  1. Parallel attractor formation. Introduce the new pattern alongside the existing one for a single service. Both attractors coexist. Neither threatens the other.
  2. Prove self-maintenance. Verify the new-pattern service recovers from failures independently, without the old system’s intervention.
  3. Gradual migration. Move services one by one. Each migration shrinks the old attractor basin and grows the new one. The system is never in an unstable state because at least one self-maintaining organization is always running.
  4. Decommission. Remove the old pattern only after no services remain in its basin.

This is the strangler fig pattern understood chemically. The AChem framing explains why it works: you never cross an instability barrier because both attractors are self-maintaining throughout the transition.

Compounds with: Catalyst Brain Formation — the ossified attractor is often maintained by a catalyst brain, and both must be addressed together.

Archetype Interaction Map #

Failure archetypes compound. When you identify one, check for its common companions:

graph TD CA[Constructive Avalanche] -->|triggers| CBF[Catalyst Brain Formation] CBF -->|triggers| AO[Attractor Ossification] OL[Organizational Leakage] -->|enables| RDC[Reaction Domain Collision] PC[Parasitic Catalysis] -->|amplified by| CA AO -->|blocks fix of| All[All other archetypes] style AO stroke:#ff0000,stroke-width:4px style CA stroke:#ff8800,stroke-width:3px style CBF stroke:#ff8800,stroke-width:3px style OL stroke:#4444ff,stroke-width:3px style RDC stroke:#4444ff,stroke-width:3px style PC stroke:#44ff44,stroke-width:3px

Attractor Ossification is the terminal archetype. Once present, it prevents remediation of everything else. Address it first, or at least in parallel with other interventions.


Part 6: Platform Phase Diagram #

Platforms don’t mature along a linear ladder. They move between states, can regress, and often exist in multiple states simultaneously across different subsystems. The AChem framing gives us a phase diagram — a map of possible states and the transitions between them.

The Four States #

Conservative–Closed. Few molecule types. Strong constraints. Hard to extend. The system is stable and predictable but rigid. A legacy monolith platform or a tightly locked-down Kubernetes cluster lives here. Risk: inability to adapt to new requirements. Teams start working around the platform rather than with it.

Conservative–Open. Standard reaction protocols (interaction patterns) but weak organizational boundaries. The system has stable reactions but hidden cross-boundary dependencies — organizational leakage is the dominant pathology. Microservices with shared databases are the classic example. Risk: cascading failures from leaked dependencies.

Constructive–Unconstrained. New molecule types appear constantly without review. CRD and operator proliferation. Multiple competing abstractions for the same function. This is where constructive avalanches happen. A large Kubernetes platform with uncontrolled extensions lives here. Risk: platform entropy, debugging nightmares, unpredictable emergent behavior.

Constructive–Constrained. New types emerge within guardrails. A controlled extension model exists, domain boundaries are enforced, and reaction rules are standardized. A mature internal developer platform with a reviewed plugin model lives here. This is the target state for most evolving platforms. Risk: governance overhead, but manageable.

Diagnostic Indicators #

SignalConservative–ClosedConservative–OpenConstructive–UnconstrainedConstructive–Constrained
New component types per quarter~0~0Many, unreviewedSteady, governed
Extension requestsRejected/backloggedN/AAnyone can add anythingReviewed and approved
Surprise dependencies in post-mortemsRareCommonVery commonRare
Competing abstractions for same functionNoneNoneMultipleConsolidated
Onboarding time trendStableSlowly growingGrowing each quarterStable
Incident blast radiusContainedUnpredictable cascadesUnpredictableContained

Transition Triggers #

Conservative–Closed → Constructive–Unconstrained. Triggered by: new team with different requirements, acquisition integrating a different tech stack, major feature expansion. This is the most common and dangerous transition — it often happens without anyone noticing until entropy is high.

Constructive–Unconstrained → Constructive–Constrained. Triggered by: deliberate platform team intervention. This is the highest-value intervention zone and where most platform engineering effort should focus.

Constructive–Constrained → Conservative–Closed. Triggered by: overly rigid governance, catalyst brain formation, loss of extension capability. The platform becomes stable but stagnant.

Any state → Constructive–Unconstrained. This is the default entropy direction. Without active constraint, platforms drift toward unconstrained constructive dynamics. This is not a failure of discipline — it’s a thermodynamic tendency. Maintaining the Constructive–Constrained state requires ongoing energy expenditure (governance, review, consolidation).

stateDiagram-v2 [*] --> CC: Initial Platform CC: Conservative-Closed
(Stable but rigid) CO: Conservative-Open
(Stable reactions,
weak boundaries) CU: Constructive-Unconstrained
(High entropy,
uncontrolled growth) CCon: Constructive-Constrained
(Controlled evolution) CC --> CU: New requirements
Different tech stack
Feature expansion CU --> CCon: Platform team
intervention
(Highest value) CCon --> CC: Rigid governance
Brain formation
Loss of extension CC --> CO: Weak boundaries CO --> CU: Uncontrolled
extensions CCon --> CU: Loss of
constraints
(Default entropy) CO --> CU: Default
entropy style CU stroke:#ff4444,stroke-width:4px style CCon stroke:#44ff44,stroke-width:4px style CC stroke:#ffaa00,stroke-width:3px style CO stroke:#4444ff,stroke-width:3px

The Platform Evolution Cycle #

Most platforms cycle through a repeating pattern:

  1. New needs appear that the current platform can’t serve
  2. New molecule types are introduced to address them
  3. The system enters a Constructive–Unconstrained phase
  4. Platform team introduces constraints and consolidates
  5. New stable organizations form
  6. The system settles into Conservative–Open or Constructive–Constrained
  7. Cycle repeats

Understanding this cycle prevents two common mistakes: treating the Constructive–Unconstrained phase as a crisis (it’s a natural phase, not a failure) and treating the Conservative–Closed phase as success (it’s stability at the cost of adaptability).

Intervention Playbook: Unconstrained → Constrained #

This is the most valuable transition to execute deliberately:

Step 1: Census the reaction space. Count molecule types (services, CRDs, operators, event types), reaction rules (controllers, pipelines, webhooks), and reaction protocols (HTTP, gRPC, events, shared DB). Automate this census and run it weekly. The count is a leading indicator of platform entropy.

Step 2: Identify competing molecules. Find multiple molecule types serving the same function — two logging sidecars, three deployment mechanisms, four secret management approaches. These are consolidation candidates.

Step 3: Enforce reaction interface declarations. Each molecule type must declare what it reacts with and what it produces. This is the organizational equivalent of type-checking — it makes hidden interactions explicit and prevents unintended cross-boundary reactions.

Step 4: Introduce compartments. Create boundaries within which molecules interact freely, with controlled and explicit transport across boundaries. Not advisory boundaries (naming conventions, team agreements) but enforced ones (network policy, admission control, RBAC).

Step 5: Establish a constructive dynamics budget. Set an acceptable rate for new molecule type introduction — not zero (that’s Conservative–Closed) but bounded and reviewed. This prevents drift back to Unconstrained while preserving the ability to evolve.


Appendix A: Complete AChem-to-Platform Mapping #

AChem ConceptPlatform EquivalentDiagnostic Question
MoleculeService, pod, CRD, operator, pipeline, database, human roleWhat are the autonomous units?
ReactionState change where one component affects anotherWhat happens when components interact?
Reaction ruleController logic, pipeline def, operator code, runbookWhat governs when and how reactions occur?
ReactorControl plane, event bus, scheduler, cloud substrateWhat environment hosts the molecules?
CatalystController, operator, sidecar, admission webhookWhat enables reactions without being consumed?
Reactor mixingWell-mixed / compartmentalized / flowWhich components can see and interact with which?
OrganizationSelf-maintaining set of components in a closed cycleWhich subsystems sustain themselves?
ClosureNo reaction products escape the boundary at runtimeDo interactions stay within the boundary?
Self-maintenanceComponents regenerate each other in a cycleWill this survive without human intervention?
Constructive dynamicsNew CRDs, operators, resource types appearingIs the platform producing new component types?
Seed moleculeComponent with high generative fan-outWhat, if lost, prevents system regeneration?
Fragility pointRuntime dependency with no regeneration edgeWhat does the system need but can’t recreate?
Inert intermediatePass-through on regeneration pathWhat’s invisible until it fails?
Attractor basinStable config maintained by reconciliation loopsWhat state does the system “heal” back to?
Attractor deformationStrangler fig, gradual migrationHow do we change the stable state without instability?
Regeneration latencyTime for a regeneration edge to take effectHow fast can each component be recreated?
Regeneration redundancyCount of distinct regeneration paths per componentHow many ways can each component be recreated?

Appendix B: Quick-Start Checklist #

For your next architecture review, work through these five steps for one subsystem:

1. Perform the molecule inventory and reaction mapping. List every component that participates in the subsystem’s lifecycle — services, controllers, pipelines, databases, registries, monitoring, human roles. Identify the reactions between them: what triggers each interaction, what it produces, and where the reaction rule governing it lives.

2. Draw the regeneration graph. Using only Creates and Maintains edges, draw the directed graph. Separately draw the dependency graph using Depends On edges. Compare them. Components in the dependency graph with no incoming regeneration edges are your fragility points. Components with high generative fan-out are your seed molecules. Pass-throughs with no regeneration of their own are your inert intermediates.

3. Test for closure and self-maintenance. Does the regeneration cycle include humans or components outside the subsystem boundary? If yes, it’s not self-maintaining. Identify what would need to change to close the organization — which human-regenerated nodes could be automated, which external dependencies could be internalized.

4. Measure regeneration latency and redundancy. For each edge in the critical regeneration path, assign a latency class (seconds, minutes, hours, days). The slowest edge is your recovery bound — any SLO faster than this is a fiction. For each critical component, count regeneration paths. Single-path and no-path components are your highest resilience risks.

5. Scan for failure archetypes. Check each of the six archetypes: parasitic catalysis, reaction domain collision, organizational leakage, constructive avalanche, catalyst brain formation, attractor ossification. When you find one, check for its companions using the interaction map. Remember that attractor ossification is the terminal archetype — it blocks remediation of everything else.

These five steps, applied to one subsystem at a time, will surface structural risks that traditional architecture diagrams and dependency graphs miss.


There's no articles to list here yet.