Failure Modes and What They Mean

Table of Contents

Failure Modes and What They Mean #

AMQP systems fail in specific, well-defined ways. Each failure mode has a different implication for message delivery — some lose messages, some cause duplicates, some cause re-ordering. Understanding them precisely is the difference between a messaging system that works and one that silently loses data.

Failure 1: Connection Drop #

What happens: the TCP connection between client and broker is lost — network failure, OS kill, broker restart, or heartbeat timeout.

From the broker’s perspective: all unacknowledged messages on all channels of that connection are returned to their queues (ready state). All pending publisher confirms for that connection are cancelled — the producer does not know if its last published messages were committed.

From the client’s perspective: the client’s channels become invalid. Any in-flight method responses are lost.

Message fate:

Messages published but not yet received by broker: lost (the TCP buffer may have been dropped).
Messages received by broker but not yet confirmed: unknown — may or may not have been persisted.
Messages delivered to consumer but not acked: re-queued, re-delivered to another consumer.
Messages delivered and acked before the drop: safely delivered, ack was processed.

Recovery action:

Reconnect with exponential backoff.
Re-open channels.
Re-declare consumers (basic.consume).
Re-publish any unconfirmed messages (at-least-once — may cause duplicates; consumers must be idempotent).
Re-declare topology if using ephemeral queues.

Heartbeat configuration matters: without heartbeats (heartbeat=0), connection drops are detected only when the next TCP write fails — which could be seconds or minutes later. With heartbeat=60, drops are detected within ~120 seconds (2× the heartbeat interval).

Failure 2: Channel Error (406, 404, 403, etc.) #

What happens: the broker closes a channel due to a protocol error.

Common triggers:

406 PRECONDITION_FAILED: redeclaring a queue with different properties.
404 NOT_FOUND: publishing to a non-existent exchange, or consuming from a non-existent queue.
403 ACCESS_REFUSED: user lacks permission.
405 RESOURCE_LOCKED: accessing another connection’s exclusive queue.

Message fate: messages in flight on the closed channel at the time of error are returned to their queues (unacked) or lost (unconfirmed publishes).

Recovery action: the channel error carries a clear error message. Fix the underlying cause (use the correct exchange name, match the queue declaration parameters, fix permissions) and open a new channel. The connection remains open — only the channel is closed.

Detecting and reacting:

def on_channel_closed(channel, reason):
    print(f"Channel closed: {reason}")
    # reason is (reply_code, reply_text)
    # 406: fix declaration parameters
    # 404: queue/exchange does not exist — create it
    # 403: fix user permissions

channel.add_on_close_callback(on_channel_closed)

Failure 3: Unroutable Messages (Discarded Silently) #

What happens: a message is published to an exchange that has no binding matching the routing key. The message is silently discarded.

Why this is dangerous: the producer gets no error. The broker accepts the publish and confirms it (if confirms are enabled). The message is simply gone.

Example: publishing to exchange events with routing key order.shipped, but no queue is bound with that key. The broker confirms the publish. The message is lost.

Mitigation 1: mandatory=True and basic.return callback:

channel.add_on_return_callback(on_return)
channel.basic_publish(
    exchange='events',
    routing_key='order.shipped',
    body=body,
    mandatory=True
)

def on_return(channel, method, properties, body):
    # method.reply_code=312, reply_text='NO_ROUTE'
    log.error(f"Unroutable: {method.routing_key}")
    # Publish to a fallback queue or alert

Mitigation 2: Alternate exchange:

channel.exchange_declare(
    'events',
    'topic',
    arguments={'alternate-exchange': 'unrouted'}
)

Unroutable messages go to unrouted exchange instead of being discarded.

When confirms mislead: publisher confirms ack after the broker accepts the message — not after routing succeeds. A confirmed message can still be unrouted if mandatory=False. Do not interpret a confirm as “message will be consumed.”

Failure 4: Consumer Crash (Unacknowledged Messages) #

What happens: a consumer crashes after receiving messages but before acknowledging them.

Message fate: all unacked messages for the crashed consumer’s channel are returned to the queue in ready state. They are re-delivered to another consumer (or the same consumer after reconnect) with redelivered=True.

Risk: the consumer may have partially processed a message. On re-delivery, the consumer must either:

Re-execute (if the operation is idempotent).
Check if already processed (using message-id or application state).

def on_message(channel, method, properties, body):
    if method.redelivered and is_already_processed(properties.message_id):
        channel.basic_ack(delivery_tag=method.delivery_tag)
        return
    process(body)
    record_as_processed(properties.message_id)
    channel.basic_ack(delivery_tag=method.delivery_tag)

Prefetch and crash amplification: with prefetch_count=100, a consumer crash returns 100 messages to the queue simultaneously. If multiple consumers crash (thundering herd restart), the queue receives a burst of re-queued messages. This is usually fine — the re-queued messages are just processed by surviving consumers.

Failure 5: Poison Messages (Requeue Loop) #

What happens: a message causes the consumer to crash or nack every time it is processed. With requeue=True, the message is immediately re-delivered to the same or another consumer, creating an infinite loop.

Symptoms:

Queue depth stays constant (message is delivered and re-queued immediately).
Consumer logs show repeated failures on the same message.
High CPU on broker and consumer (loop overhead).
Other messages in the queue are starved (especially with prefetch_count=1).

Mitigation 1: Quorum queue delivery limit:

channel.queue_declare(
    'tasks',
    durable=True,
    arguments={
        'x-queue-type': 'quorum',
        'x-delivery-limit': 5,
        'x-dead-letter-exchange': 'dlx',
    }
)

After 5 delivery attempts, the message is dead-lettered. The consumer can still nack with requeue=True — the broker counts deliveries.

Mitigation 2: Consumer-side retry counting:

def on_message(channel, method, properties, body):
    headers = properties.headers or {}
    retry_count = headers.get('x-retry-count', 0)

    if retry_count >= 5:
        channel.basic_nack(delivery_tag=method.delivery_tag, requeue=False)
        # Dead-letter via DLX
        return

    try:
        process(body)
        channel.basic_ack(delivery_tag=method.delivery_tag)
    except Exception:
        # Re-publish with incremented retry count
        channel.basic_nack(delivery_tag=method.delivery_tag, requeue=False)
        headers['x-retry-count'] = retry_count + 1
        channel.basic_publish(
            exchange='',
            routing_key=method.routing_key,
            body=body,
            properties=pika.BasicProperties(headers=headers)
        )

Mitigation 3: Always nack with requeue=False for schema/logic errors: if the message body is malformed or the consumer cannot understand the schema, there is no point in requeueing — it will fail every time. Dead-letter immediately.

Failure 6: Broker Memory Pressure #

What happens: the broker’s memory usage exceeds the configured high-watermark (vm_memory_high_watermark, default 40% of RAM).

Phase 1 — blocked: the broker sends connection.blocked to all producer connections. Producers that handle this signal pause publishing.

Phase 2 — throttling: producers that do not handle connection.blocked are TCP-flow-controlled — the broker stops reading from their TCP socket. This causes the TCP send buffer on the producer to fill up, which blocks further writes from the producer’s write() call.

Phase 3 — persistent message paging: the broker begins paging unacknowledged messages from memory to disk to reduce memory pressure. This causes a significant throughput drop.

Message fate during memory pressure: messages already queued are not lost — they may be paged to disk. New publishes from blocked connections are queued in the TCP buffer on the client side. When the broker recovers (memory drops below low-watermark), connection.unblocked is sent and publishing resumes.

Mitigation:

Monitor queue depth and consumer lag — growing queues indicate producers outpacing consumers.
Set x-max-length on queues to prevent unbounded growth.
Use quorum queues (spooled to disk by default, less memory pressure).
Scale consumers to drain queues faster.
Configure vm_memory_high_watermark appropriately for the available RAM.

connection.add_on_connection_blocked_callback(
    lambda connection, reason: log.warning(f"Connection blocked: {reason}")
)
connection.add_on_connection_unblocked_callback(
    lambda connection: log.info("Connection unblocked")
)

Failure 7: Consumer Cancellation (Queue Deleted) #

What happens: a queue is deleted while one or more consumers are active on it. The broker sends basic.cancel to all consumers on that queue.

Message fate: all messages in the deleted queue are lost (unless the queue is a quorum queue and the deletion is graceful — but queue deletion is always destructive).

Recovery: the consumer’s basic.cancel callback should handle re-subscription if appropriate. If the queue was intentionally deleted (topology change), the consumer should react by reconnecting to the new queue.

def on_consumer_cancelled(method):
    log.warning(f"Consumer {method.consumer_tag} cancelled by broker")
    # Re-declare queue and re-subscribe if the queue should exist

channel.add_on_cancel_callback(on_consumer_cancelled)

Failure 8: Split Brain in RabbitMQ Clusters #

What happens: a network partition separates a RabbitMQ cluster into two groups that cannot communicate. Each partition continues operating independently.

Classic queue behavior during partition: depends on the partition handling strategy:

ignore (default): both partitions continue operating. Queues on each side are independent. Messages published to one side are not visible from the other. After healing, the cluster detects divergence and requires manual operator intervention.
pause-minority: the smaller partition (fewer than N/2 nodes) pauses — refuses connections. The larger partition continues. No split brain, but availability loss.
autoheal: after partition heals, the cluster automatically picks a winner and loses the other partition’s messages.

Quorum queue behavior during partition: quorum queues require a majority of replicas to be available (Chapter 6 of this series). If a partition isolates a minority of replicas, writes to the majority-side queue succeed; writes to the minority-side queue fail (no quorum). On partition heal, Raft reconciles — no messages are lost from the majority side.

Recommendation: use quorum queues for durable messaging in clustered RabbitMQ. Classic mirrored queues have complex split-brain behaviors that require careful operator intervention.

Failure Summary Table #

Failure	Messages affected	Recovery
Connection drop (producer)	Unconfirmed publishes	Re-publish with confirms; consumers deduplicate
Connection drop (consumer)	Unacked messages	Auto re-queued; re-delivered with redelivered=True
Channel error (404, 406)	None lost	Fix cause; open new channel
Unroutable message	Lost silently	Use mandatory=True or alternate exchange
Consumer crash	Unacked → re-queued	Re-delivered; consumer must be idempotent
Poison message loop	Blocked in loop	Delivery limit (quorum) or consumer-side retry limit
Broker memory pressure	None lost (blocked)	Monitor queue depth; set x-max-length
Queue deleted with consumers	All messages in queue	Topology management; consumer cancel callback
Cluster partition (classic)	May diverge	Quorum queues; partition handling strategy
Cluster partition (quorum)	Minority writes fail	Majority continues; no data loss

End-to-End Crash Safety Checklist #

For a message to be crash-safe across all failure scenarios:

Queue declared as durable=True
Message published with delivery_mode=2 (persistent)
Publisher confirms enabled on producer channel
Producer re-publishes unconfirmed messages on reconnect
Consumer uses manual ack (auto_ack=False)
Consumer checks redelivered flag and is idempotent (message_id deduplication)
Dead-letter exchange configured for rejected/expired messages
x-delivery-limit set on quorum queues to handle poison messages
heartbeat configured (60s recommended) to detect dead connections
Connection and channel close callbacks implemented for recovery

With all of the above, the system provides at-least-once delivery with effective exactly-once processing through idempotency.