Security and Observability

Table of Contents

Security: Listeners, Identity, and Access Control #

Security in a streaming cluster is not a single mechanism — it is a layered architecture. Transport security (TLS) establishes encrypted channels. Authentication (SASL/OIDC) establishes identity. Authorization (ACLs) enforces what each identity is permitted to do. Misconfiguring any layer invalidates the ones above it.

TLS and mTLS: Listener Isolation #

Listener isolation: internal mTLS vs external TLS+SASL

Redpanda exposes multiple listeners — network endpoints that can be independently configured for different security profiles. The canonical pattern is two listeners:

Internal listener — used by brokers to communicate with each other (Raft replication, controller RPCs). This traffic never leaves the private network. mTLS is appropriate here: each broker presents a certificate, and only brokers with valid certificates are admitted.
External listener — used by clients (producers, consumers, admin tools). TLS without mutual authentication is the common default: the broker presents a certificate, the client validates it, but the client does not present one. Authentication is handled at the SASL layer.

Listener isolation matters because mixing internal and external traffic on a single listener makes it impossible to apply different security policies to each. A misconfigured external client that somehow presents a valid internal certificate would be admitted to the internal listener — a privilege escalation that cannot happen if the listeners are physically separate.

SASL and OIDC: Authentication and Principal Mapping #

SASL is the authentication layer. Redpanda supports several SASL mechanisms:

SASL/PLAIN — username and password transmitted in the clear over TLS. Simple to configure, adequate for internal or low-risk environments.
SASL/SCRAM — challenge-response protocol that never transmits the password. The standard production choice for password-based authentication.
SASL/OAUTHBEARER with OIDC — the client obtains a JWT from an identity provider (Auth0, Okta, Keycloak) and presents it to the broker. The broker validates the token against the provider’s JWKS endpoint. This integrates streaming authentication with the organization’s existing identity infrastructure.

After authentication, the broker maps the authenticated identity to a principal — the name used in ACL rules. For OIDC, the mapping is configurable: you can extract the principal from the sub claim, the email claim, or a custom claim. The mapping must be consistent across all brokers; a mismatch causes authentication to succeed but authorization to fail, producing confusing Topic Authorization Failed errors.

ACL Design: Least-Privilege and Prefixed Patterns #

ACL separation: data plane vs control plane permissions

ACLs define what each principal can do on which resources. The resource types are: Topic, Group, TransactionalId, and Cluster. The operations are: READ, WRITE, CREATE, DELETE, ALTER, DESCRIBE, ALTER_CONFIGS, DESCRIBE_CONFIGS.

Two ACL patterns that scale:

Prefixed resource patterns allow a single ACL to cover all resources matching a prefix. Instead of creating one ACL per topic, a service owning all topics under payments- gets one prefixed ACL:

rpk security acl create \
  --allow-principal User:payments-service \
  --operation READ,WRITE,DESCRIBE \
  --topic payments- \
  --resource-pattern-type PREFIXED

This scales with topic proliferation and aligns with team-based topic ownership.

Separation of Data Plane and Control Plane permissions prevents application bugs from modifying topology:

Data Plane — READ, WRITE on Topics; READ on Groups. Given to application principals.
Control Plane — CREATE, DELETE, ALTER_CONFIGS on Topics. Given to CI/CD principals only.

An application principal with Data Plane access cannot delete its own topic. A CI/CD principal with Control Plane access cannot read the messages within topics it manages. An application bug cannot delete a topic. A rogue CI/CD pipeline cannot exfiltrate data.

Debugging Authorization Failures #

The most common authorization failure pattern: a client reports Topic Authorization Failed or times out on metadata requests. The root cause is usually one of three things:

Missing DESCRIBE permission — clients need DESCRIBE on a topic to fetch metadata (partition count, leader location). Without it, the client cannot discover where to send produce or fetch requests. The error surfaces as a timeout or Unknown Topic, not as an explicit auth error.
LITERAL vs PREFIXED mismatch — an ACL created as LITERAL for topic-a does not match topic-a-v2. Check the resource pattern type when ACLs appear correct but access is still denied.
Principal mapping inconsistency — the authenticated principal does not match the principal in the ACL. For OIDC, log the extracted principal at DEBUG level to verify the claim mapping produces the expected value.

Observability: SLIs, Golden Signals, and Burn-Rate Alerting #

Observability in a streaming cluster is not the number of dashboards — it is the speed at which you can navigate from a user-facing symptom to a root cause. True observability is defined by one question: can you answer “is the cluster healthy and are users harmed?” within minutes?

The Observability Hierarchy #

Observability hierarchy: from user impact to deep dive

Effective observability is structured as a hierarchy, not a flat collection of metrics:

User Impact (Symptom) — defined by SLOs. Is availability below 99.9%? Is P99 latency above 500ms?
SLI Breach (Alert) — a quantitative signal tied directly to user impact. Fires when the SLO is threatened.
Golden Signals (Dashboard) — latency, traffic, errors, saturation. Triage starts here.
Subsystem Suspects — correlate golden signals to narrow the hypothesis: disk I/O, CPU, network, replication lag.
Deep Dive (Trace/Log) — investigate the specific component the evidence points to.

Alerting on anything below level 2 generates noise. A disk I/O spike is a cause, not a symptom. Alert on symptoms. Investigate causes.

Availability SLI: Server-Side Errors Only #

The availability SLI measures the fraction of produce requests that succeed. A naive query counting all non-200 responses creates false alarms — a client requesting a non-existent topic generates a client-side error (4xx equivalent) that is not a cluster health issue.

The correct PromQL query isolates server-side errors (5xx status codes):

(
  sum(rate(redpanda_kafka_request_total{request="produce",status=~"5.*"}[5m]))
  /
  sum(rate(redpanda_kafka_request_total{request="produce"}[5m]))
) > 0.01

An error rate above 1% over 5 minutes indicates a severe degradation in service quality. Latency SLIs should use P99, not average. Average latency hides the “hiccups” caused by leader elections, disk contention, or GC pauses that only affect a fraction of requests but are exactly what users experience as intermittent failures.

Saturation: Reactor Utilization, Not CPU Load #

In a thread-per-core architecture, traditional CPU load averages are misleading. The system is designed to pin threads to cores — a busy reactor at 100% utilization looks like 100% CPU on that core, but the cluster may have 15 other idle cores. The relevant metric is reactor utilization: the percentage of time the reactor’s event loop is busy processing events versus sleeping.

redpanda_cpu_busy_seconds_total normalized by the window duration provides this signal. Visualized as a heatmap across all cores, it reveals hot shard problems: one core saturated at 100% while others are at 20%. Scaling horizontally does not fix a hot shard — the bottleneck is a single partition bound to a single core. The solution is partition rebalancing or changing the partition key strategy.

Consumer Lag: Correlate with Throughput #

Lag alone is not an incident — growing lag is. The three lag patterns from Chapter 4 map to three Prometheus query patterns:

High lag + high throughput — kafka_consumer_group_lag high, redpanda_kafka_request_total{request="fetch"} high. Consumer under-provisioned.
High lag + zero throughput — lag high, fetch rate zero. Consumer availability incident.
Growing lag + low broker throughput — lag growing, fetch rate low despite consumer being active. Broker-side bottleneck.

Burn-Rate Alerting: Error Budget Over Time #

Burn-rate alerting: dual-window error budget consumption

Threshold alerts (latency > 500ms) are noisy. They fire on transient spikes that recover before the on-call engineer can respond. A more precise approach is burn-rate alerting, derived from SRE error budget methodology.

If the SLO is 99.9% availability over 30 days, the error budget is 43.2 minutes. A burn rate of 1 consumes the budget at exactly the rate it replenishes. A burn rate of 14.4 exhausts the entire monthly budget in 2 days (30 / 14.4).

Alerting on high burn rates with dual windows prevents both false positives and missed incidents:

- alert: HighErrorBudgetBurn
  expr: (
    rate(redpanda_kafka_request_errors_total[1h]) / rate(redpanda_kafka_request_total[1h])
    > 14.4 * (1 - 0.999)
  ) and (
    rate(redpanda_kafka_request_errors_total[5m]) / rate(redpanda_kafka_request_total[5m])
    > 14.4 * (1 - 0.999)
  )
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Error budget burning too fast (Burn Rate > 14.4)"

The 1-hour window ensures the condition is historically significant. The 5-minute window ensures it is currently active. Both must be true before the alert fires. This eliminates brief spikes from triggering pages while catching sustained degradation within minutes.

Under-Replicated Partitions: The Leading Indicator #

redpanda_cluster_partition_under_replicated_replicas is often the first signal of a node failure or network partition. An under-replicated partition means at least one ISR member has fallen behind. If this metric is non-zero for longer than the standard recovery time, the cluster is running in a degraded state: new writes are being accepted with a reduced durability guarantee.

Alert on this metric with a tight threshold:

redpanda_cluster_partition_under_replicated_replicas > 0

A brief spike during a rolling restart is expected. A sustained non-zero value is a durability incident — new writes with acks=all are being committed to a smaller ISR than configured, potentially violating min.insync.replicas thresholds silently if min.insync.replicas is set below the replication factor.