101 ways to break your RabbitMQ height:65

~$ whoami

Denis GERMAIN

Senior SRE at height:40

Tech blogger blog.zwindler.fr*

width:50 @zwindler / @zwindler_rflx

#geek #SF #running


*Slides are available on the blog

101 ways to break your RabbitMQ height:65

A little bit of context

width:700 center

A little bit of context (2)

  • Historically on-prem monoliths

  • Start again, from scratch

    • Migrate to cloud SaaS
    • Switch to microservices
  • How can we secure the messages these µ-services (and on-prem equipments) exchange?

    • RabbitMQ height:45

Why RabbitMQ?

  • Broad range of supported protocols
  • Extra layer of abstraction with « Exchanges »
  • Clustering, high availability and replication features
  • Management UI and REST API
  • On-the-fly configuration

What could possibly go wrong ?

center width:500

Let's break RabbitMQ!

Installation & design

OS parameters

I want 0 downtime, let's put RabbitMQ in cluster!

  • Nodes share all logical objects except queues
  • Only one node owns a queue
    • Nodes know which queue is owned by which node
    • Messages are transparently redirected to the right node

RabbitMQ cluster

center height:500

Queues are not magically replicated

  • Nodes are still SPOF 😱
  • Queues owned by a downed node become unavailable
    • Non-durable queues are destroyed
    • Durable queues are blocked

center width:550

RabbitMQ cluster

center height:500

What you really want when thinking clusters

2 strategies:

  • Don't configure queues as durable, retry and redeclare
  • Flag some queues as « highly available »
    • "classic" HA queues (1 leader + promotable mirrors)
    • Quorum queues (3.8+, raft consensus)

Beware of the HA queues and parameters

HA queues :

  • need more RAM
  • latency ++ & throughput --
  • multiple modes (with implication)
    • ha-mode (how many mirrors)
    • ha-sync-mode (may block queues)

Use odd number of nodes

RabbitMQ usage

Security

  • Contexts can be isolated with "vhosts"
  • ACLs (vhosts & queues permissions) can be assigned to users
    • configure / read / write
    • can be applied with regex
  • Useful tip for my past self
    • Put ACLs from the beginning!

New connections love TCP packets

width:800

Connections / channels

A bit less obvious:

  • Don't open a channel for each publish
  • Don't forget to close channels 😱

height:250 height:250

Prefetch

  • Send messages to consumers even if ack hasn't yet been received

  • Useful if you app can process messages really fast or in parallel

  • More messages in the wild when something wrong happens 😈

center height:210

Examples of issues with prefetch

  • Don't assume your app/framework correctly handles parallel treatment of messages
  • Don't forget to catch errors (especially in threads)

center height:300

Leaverage the power of TTLs and DLQs

  • Keep your queues short and empty them quickly
    • If posible add TTLs, queue length and message max size
    • Be careful when reject + requeing

center

Observe

  • RabbitMQ management plugin is cool
    • put it on more than one node
  • Grafana dashboards are better
    • Use prometheus to scrape your cluster/queues
    • 3.8+: rabbitmq-prometheus plugin rather than external exporters

“Now, I know what to fix!”

center

Conclusion

  • RabbitMQ clusters aren't what you think they are
    • What you want are quorum queues / HA queues
  • RabbitMQ ain't Kafka
    • don't store millions messages for a long time
  • Use connections / channels wisely
  • BONUS:
    • Secure from the start (ACLs)
    • observe, observe, observe!!

That's all folks

width:800 center

Do you have any questions?

center width:500

Bibliography

Best practices and advices

Deploying highly available message brokers in a cloud based environment with no prior experience has brought its share of surprises.