You should have a good understanding of message queues. Some examples are Kafka, RabbitMQ, JMS, and ZeroMQ.
Generally speaking a message queue or message broker is a dedicated system to allow distributed systems to “produce”, or publish, to a centralized service and for “consumers” to read those messages without having to directly talk to each other. It enables the producers and consumers to be decoupled. Producers can continue writing their data to the broker (almost always a highly available cluster in production environments) even if there are no consumers reading that data. It enables consumers to stay up and connected to their upstream data source, the broker, even if all of the consumers are down either for maintenance or because of an outage.
Following are the basic asynchronous message patterns:
- Message Queues: Producers generate messages and write them to the queue. Consumers asynchronously connect to and consume the messages, removing the messages from the queue. Each message (typically) is consumed by only one consumer.
- Publish/Subscribe: Producers generate messages and write them to the queue. Multiple Consumers asynchronously connect to and consume each message. This pattern typically includes message ordering and retains the messages for some time such that Consumers can re-read messages if required.
The primary use case for RabbitMQ is when there is a need for more complicated filtering and routing of messages and this is where RabbitMQ excels; the ability to define specific routing needs and per message guarantees. Rules can be chained together to enable RabbitMQ to filter and route messages and subsets of messages to different queues to enable multiple Consumers to read different groups of messages. Messages are written to Exchanges instead of directly to queues.
- Intelligent broker/Dumb Consumer: Details on message routing is encapsulated in the broker and not consumer.
- General purpose Broker
- Push Based: Messages are pushed to consumers
- Exchanges: Producers write messages to exchanges which are then routed to queues or other exchanges based on header attributes, bindings and routing keys. Consumers then read from queues. A binding defines a link between an exchange and a queue. A routing key is an attribute of the message that the exchange uses to determine how to route the message. Unlike Kafka, clients can create their own exchanges. RabbitMQ includes the following exchange types:
- Direct: A queue is bound to an exchange based on a binding key which matches a routing key that is contained in the message. Any message that arrives at a Direct exchange is filtered based on the defined routing key and then delivered to the configured queue that matches the binding/routing key. Multiple bindings can be defined for a Direct exchange.
- Fanout: Routes copies of messages to all of the queues that are bound to the exchange regardless of the routing key. This is used when the same message needs to be processed by multiple consumers that each act on the data in different ways.
- Topic: Messages are routed to multiple queues based on a wildcard matches between routing keys and a defined routing pattern specified in the queue binding.
- Headers: Messages are routed based on message header values instead of routing keys. Messages are routed to a specific queue if the header matches a valued specified in the binding.
- Dead Letter Exchange: An exchange that will capture and process messages for which no matching queue can be found. Without a DLE defined messages will otherwise be silently dropped.
- Queues: FIFO based queues.
- Complex Routing: As mentioned before, this is the key discriminator for RabbitMQ. It enables the centralized processing and routing of messages such that consumers do not need to filter the messages and then re-publish them back to the broker for other consumers to read.
- Where message ordering is not required
- Where message priorities are required
The primary use case for Kafka is as a data bus and it is optimized for raw throughput. The basic concept is a distributed append only log that multiple consumers can read from based on a given offset in that log. Kafka uses a basic pub/sub architecture and does not include the routing features of RabbitMQ.
- Scalability: Kafka’s architecture enables significant horizontal scalability. Additionally, it can be distributed across data centers to provide for DR solutions.
- Fast: Kafka is super fast and can handle enormous throughput.
- Message Durability: Messages are stored and available to “re-read”
- Availability: With a multiple node cluster Kafka can provide a HA solution for messaging with very little effort
- Dumb Broker/Intelligent Consumer: The onus is on the consumer to determine details on how messages are to be processed and/or re-routed.
- Pull Based: Consumers pull messages from the broker.
- Guaranteed Message Ordering
Kafka uses topics instead of queues. There are no exchanges in Kafka and message “routing” is more simplified. Topics are divided into partitions which enable multiple consumers to read a subset of the data partitioned by a predefined key. For example: we have a retail store that streams sales events from each store to Kafka. Each message includes the store id in the message. The store id is used as the source data for a hashing algorithm so that all of the messages from “store1” always end up on the same partitions. This enables a large data set to be divided up such that all of the messages on a given partition for a given key can be read by a single consumer, while enabling horizontal scalability as the number of messages and/or keys increase.
Messages are written to a log and the consumer’s offset in that log is maintained. This enables consumers to pick up where they left off, and/or “rewind” and re-read messages if necessary as long as that data has not been aged out of the log.
- High Throughput Data Ingest
- Stream Processing: There are a number of libraries and APIs that enable stream processing, the generation of running counts, averages, and other computations on data in-flight.
- Event Reconciliation: Because Kafka can retain messages for some amount of time (limited based on the disk size on which the partition resides), it enables the ability to rewind and re-calculate or re-play a series of events to either validate other processes or re-process events in the event of data corruption of already consumed and processed output data.
- Log Aggregatons
- Where message ordering is required