Failure Modes and What They Mean
Table of Contents
Failure Modes and What They Mean #
AMQP systems fail in specific, well-defined ways. Each failure mode has a different implication for message delivery — some lose messages, some cause duplicates, some cause re-ordering. Understanding them precisely is the difference between a messaging system that works and one that silently loses data.
Failure 1: Connection Drop #
What happens: the TCP connection between client and broker is lost — network failure, OS kill, broker restart, or heartbeat timeout.
From the broker’s perspective: all unacknowledged messages on all channels of that connection are returned to their queues (ready state). All pending publisher confirms for that connection are cancelled — the producer does not know if its last published messages were committed.
From the client’s perspective: the client’s channels become invalid. Any in-flight method responses are lost.
Message fate:
- Messages published but not yet received by broker: lost (the TCP buffer may have been dropped).
- Messages received by broker but not yet confirmed: unknown — may or may not have been persisted.
- Messages delivered to consumer but not acked: re-queued, re-delivered to another consumer.
- Messages delivered and acked before the drop: safely delivered, ack was processed.
Recovery action:
- Reconnect with exponential backoff.
- Re-open channels.
- Re-declare consumers (
basic.consume). - Re-publish any unconfirmed messages (at-least-once — may cause duplicates; consumers must be idempotent).
- Re-declare topology if using ephemeral queues.
Heartbeat configuration matters: without heartbeats (heartbeat=0), connection drops are detected only when the next TCP write fails — which could be seconds or minutes later. With heartbeat=60, drops are detected within ~120 seconds (2× the heartbeat interval).
Failure 2: Channel Error (406, 404, 403, etc.) #
What happens: the broker closes a channel due to a protocol error.
Common triggers:
406 PRECONDITION_FAILED: redeclaring a queue with different properties.404 NOT_FOUND: publishing to a non-existent exchange, or consuming from a non-existent queue.403 ACCESS_REFUSED: user lacks permission.405 RESOURCE_LOCKED: accessing another connection’s exclusive queue.
Message fate: messages in flight on the closed channel at the time of error are returned to their queues (unacked) or lost (unconfirmed publishes).
Recovery action: the channel error carries a clear error message. Fix the underlying cause (use the correct exchange name, match the queue declaration parameters, fix permissions) and open a new channel. The connection remains open — only the channel is closed.
Detecting and reacting:
def on_channel_closed(channel, reason):
print(f"Channel closed: {reason}")
# reason is (reply_code, reply_text)
# 406: fix declaration parameters
# 404: queue/exchange does not exist — create it
# 403: fix user permissions
channel.add_on_close_callback(on_channel_closed)
Failure 3: Unroutable Messages (Discarded Silently) #
What happens: a message is published to an exchange that has no binding matching the routing key. The message is silently discarded.
Why this is dangerous: the producer gets no error. The broker accepts the publish and confirms it (if confirms are enabled). The message is simply gone.
Example: publishing to exchange events with routing key order.shipped, but no queue is bound with that key. The broker confirms the publish. The message is lost.
Mitigation 1: mandatory=True and basic.return callback:
channel.add_on_return_callback(on_return)
channel.basic_publish(
exchange='events',
routing_key='order.shipped',
body=body,
mandatory=True
)
def on_return(channel, method, properties, body):
# method.reply_code=312, reply_text='NO_ROUTE'
log.error(f"Unroutable: {method.routing_key}")
# Publish to a fallback queue or alert
Mitigation 2: Alternate exchange:
channel.exchange_declare(
'events',
'topic',
arguments={'alternate-exchange': 'unrouted'}
)
Unroutable messages go to unrouted exchange instead of being discarded.
When confirms mislead: publisher confirms ack after the broker accepts the message — not after routing succeeds. A confirmed message can still be unrouted if mandatory=False. Do not interpret a confirm as “message will be consumed.”
Failure 4: Consumer Crash (Unacknowledged Messages) #
What happens: a consumer crashes after receiving messages but before acknowledging them.
Message fate: all unacked messages for the crashed consumer’s channel are returned to the queue in ready state. They are re-delivered to another consumer (or the same consumer after reconnect) with redelivered=True.
Risk: the consumer may have partially processed a message. On re-delivery, the consumer must either:
- Re-execute (if the operation is idempotent).
- Check if already processed (using
message-idor application state).
def on_message(channel, method, properties, body):
if method.redelivered and is_already_processed(properties.message_id):
channel.basic_ack(delivery_tag=method.delivery_tag)
return
process(body)
record_as_processed(properties.message_id)
channel.basic_ack(delivery_tag=method.delivery_tag)
Prefetch and crash amplification: with prefetch_count=100, a consumer crash returns 100 messages to the queue simultaneously. If multiple consumers crash (thundering herd restart), the queue receives a burst of re-queued messages. This is usually fine — the re-queued messages are just processed by surviving consumers.
Failure 5: Poison Messages (Requeue Loop) #
What happens: a message causes the consumer to crash or nack every time it is processed. With requeue=True, the message is immediately re-delivered to the same or another consumer, creating an infinite loop.
Symptoms:
- Queue depth stays constant (message is delivered and re-queued immediately).
- Consumer logs show repeated failures on the same message.
- High CPU on broker and consumer (loop overhead).
- Other messages in the queue are starved (especially with
prefetch_count=1).
Mitigation 1: Quorum queue delivery limit:
channel.queue_declare(
'tasks',
durable=True,
arguments={
'x-queue-type': 'quorum',
'x-delivery-limit': 5,
'x-dead-letter-exchange': 'dlx',
}
)
After 5 delivery attempts, the message is dead-lettered. The consumer can still nack with requeue=True — the broker counts deliveries.
Mitigation 2: Consumer-side retry counting:
def on_message(channel, method, properties, body):
headers = properties.headers or {}
retry_count = headers.get('x-retry-count', 0)
if retry_count >= 5:
channel.basic_nack(delivery_tag=method.delivery_tag, requeue=False)
# Dead-letter via DLX
return
try:
process(body)
channel.basic_ack(delivery_tag=method.delivery_tag)
except Exception:
# Re-publish with incremented retry count
channel.basic_nack(delivery_tag=method.delivery_tag, requeue=False)
headers['x-retry-count'] = retry_count + 1
channel.basic_publish(
exchange='',
routing_key=method.routing_key,
body=body,
properties=pika.BasicProperties(headers=headers)
)
Mitigation 3: Always nack with requeue=False for schema/logic errors: if the message body is malformed or the consumer cannot understand the schema, there is no point in requeueing — it will fail every time. Dead-letter immediately.
Failure 6: Broker Memory Pressure #
What happens: the broker’s memory usage exceeds the configured high-watermark (vm_memory_high_watermark, default 40% of RAM).
Phase 1 — blocked: the broker sends connection.blocked to all producer connections. Producers that handle this signal pause publishing.
Phase 2 — throttling: producers that do not handle connection.blocked are TCP-flow-controlled — the broker stops reading from their TCP socket. This causes the TCP send buffer on the producer to fill up, which blocks further writes from the producer’s write() call.
Phase 3 — persistent message paging: the broker begins paging unacknowledged messages from memory to disk to reduce memory pressure. This causes a significant throughput drop.
Message fate during memory pressure: messages already queued are not lost — they may be paged to disk. New publishes from blocked connections are queued in the TCP buffer on the client side. When the broker recovers (memory drops below low-watermark), connection.unblocked is sent and publishing resumes.
Mitigation:
- Monitor queue depth and consumer lag — growing queues indicate producers outpacing consumers.
- Set
x-max-lengthon queues to prevent unbounded growth. - Use quorum queues (spooled to disk by default, less memory pressure).
- Scale consumers to drain queues faster.
- Configure
vm_memory_high_watermarkappropriately for the available RAM.
connection.add_on_connection_blocked_callback(
lambda connection, reason: log.warning(f"Connection blocked: {reason}")
)
connection.add_on_connection_unblocked_callback(
lambda connection: log.info("Connection unblocked")
)
Failure 7: Consumer Cancellation (Queue Deleted) #
What happens: a queue is deleted while one or more consumers are active on it. The broker sends basic.cancel to all consumers on that queue.
Message fate: all messages in the deleted queue are lost (unless the queue is a quorum queue and the deletion is graceful — but queue deletion is always destructive).
Recovery: the consumer’s basic.cancel callback should handle re-subscription if appropriate. If the queue was intentionally deleted (topology change), the consumer should react by reconnecting to the new queue.
def on_consumer_cancelled(method):
log.warning(f"Consumer {method.consumer_tag} cancelled by broker")
# Re-declare queue and re-subscribe if the queue should exist
channel.add_on_cancel_callback(on_consumer_cancelled)
Failure 8: Split Brain in RabbitMQ Clusters #
What happens: a network partition separates a RabbitMQ cluster into two groups that cannot communicate. Each partition continues operating independently.
Classic queue behavior during partition: depends on the partition handling strategy:
ignore(default): both partitions continue operating. Queues on each side are independent. Messages published to one side are not visible from the other. After healing, the cluster detects divergence and requires manual operator intervention.pause-minority: the smaller partition (fewer than N/2 nodes) pauses — refuses connections. The larger partition continues. No split brain, but availability loss.autoheal: after partition heals, the cluster automatically picks a winner and loses the other partition’s messages.
Quorum queue behavior during partition: quorum queues require a majority of replicas to be available (Chapter 6 of this series). If a partition isolates a minority of replicas, writes to the majority-side queue succeed; writes to the minority-side queue fail (no quorum). On partition heal, Raft reconciles — no messages are lost from the majority side.
Recommendation: use quorum queues for durable messaging in clustered RabbitMQ. Classic mirrored queues have complex split-brain behaviors that require careful operator intervention.
Failure Summary Table #
| Failure | Messages affected | Recovery |
|---|---|---|
| Connection drop (producer) | Unconfirmed publishes | Re-publish with confirms; consumers deduplicate |
| Connection drop (consumer) | Unacked messages | Auto re-queued; re-delivered with redelivered=True |
| Channel error (404, 406) | None lost | Fix cause; open new channel |
| Unroutable message | Lost silently | Use mandatory=True or alternate exchange |
| Consumer crash | Unacked → re-queued | Re-delivered; consumer must be idempotent |
| Poison message loop | Blocked in loop | Delivery limit (quorum) or consumer-side retry limit |
| Broker memory pressure | None lost (blocked) | Monitor queue depth; set x-max-length |
| Queue deleted with consumers | All messages in queue | Topology management; consumer cancel callback |
| Cluster partition (classic) | May diverge | Quorum queues; partition handling strategy |
| Cluster partition (quorum) | Minority writes fail | Majority continues; no data loss |
End-to-End Crash Safety Checklist #
For a message to be crash-safe across all failure scenarios:
- Queue declared as
durable=True - Message published with
delivery_mode=2(persistent) - Publisher confirms enabled on producer channel
- Producer re-publishes unconfirmed messages on reconnect
- Consumer uses manual ack (
auto_ack=False) - Consumer checks
redeliveredflag and is idempotent (message_iddeduplication) - Dead-letter exchange configured for rejected/expired messages
-
x-delivery-limitset on quorum queues to handle poison messages -
heartbeatconfigured (60s recommended) to detect dead connections - Connection and channel close callbacks implemented for recovery
With all of the above, the system provides at-least-once delivery with effective exactly-once processing through idempotency.