Kafka Architecture - Draft
What is Kafka
- Stream processing Engine
- Written in Scala and Java
- Aims to provide Unified, high-throughput, low-latency platform for handling real time data feeds.
- Used for real time data pipelines and streaming apps
- Built on top of the ZooKeeper Synchronization service
Popular messaging
queues in the Market
- Kafka - Scalable, Fault tolerant, High R/W throughput, No Single point of failure, Durable(messages persisted in the disk and replicated within the cluster to prevent data loss), Support compression, support data retention
- Handles the Hunderds of Megabytes R/W's per second from thousands of clients
- No Master/Slave architecture - No single point of failure
- Guarantees ordering of messages - provides total order of messages within the partition.
- Same roles is assigned to all the nodes in the cluster
- Compression saves the storage, improves the processing performance
- Kafka is 5 times better than RabbitMQ
- 90K messages per second, each record of 2 or 3 KB Size
- RabbitMQ - 2K messages per second, each record of 2 or 3 KB Size
- ActiveMQ -
- HornetQ - 20K messages per second, each record of 2 or 3 KB size
What will happen if
the topic size reaches to br
oker disk size?
Partition is a
logical distribution of a topic.
A Broker can
contain, multiple partitions of a topic.
Partition is
replicated, not the topic. i.e. Partitions of the topic are replicated.
One partition is the
leader of the partition(master partition) and replica of same partition is
treated as follower.
R/W happens on the
leader only i.e. Producer and consumer operates on the leader/primary/master
partition.
Each partition can
be leader or replica.
Leader is
responsible for all the writes and reads of the topic.
Replication provides
fault tolerance.
Mirroring used for
geographical replication.
=> Replication,
is with in the geographical location, mirroring is across geo graphical
location.
Broker is completely
independent from producer and consumer... E.g. How much data is consumed by the
consumer..
Reading by the
consumer... Recommended or ideal scenario is read one partition by one consumer
instance from the a consumer group.
With in a consumer
group, the data won't be read by the other instances.
Same record can be
read by the same instances in different consumer groups, e.g. Record 1 is read
by C2 in Consumer Group A, same can be done in Consumer group B.
Fault tolerance is
achieved with in the consumer group.
Why Kafka?
- We can decouple the rate at which the messages are produced and transferred to Kafka server from the rate at which the messages are consumed.
- High throughput, scalable, fault-tolerance, built-in partition and replication.
- Unified platform for handling all the real time data feeds.
Customers who uses
it:
- Loggly - Popular Cloud based log management and analytics service provider
- RichRelevance - Personalized customer experience for shopping
- 100000+ events per second
- Preservation is needed for them for several hours
- Guaranteed log, no record loss is acceptable.
- No Single point of failure - Age based retention to purge old data on disks
- Moving TB's of data everyday without single record failure.
- Low latency 99.99999% of data comes from the disk cache + RAM, rarely hits disk
- Performance - One of the customer group(eight threads) can process about 2,00,000 events per second, draining from 192 partitions, spared across three brokers.
- Scalability - Increasing the partition count per topic and can add new nodes at runtime.
- Can we increase the Topics at runtime??
- LinkedIn - For activity Stream data and operational Metrics
- Newsfeed, LinkedIn Today, in addition to offline analytics from system like Hadoop
- Twitter - As part of Storm stream processing Engine
- NetFlix - Real time monitoring and event processing pipeline
- SemaText, foursquare, Cisco, LinkSmart, OOYALA
Definition: Publish-subscribe messaging system,
which is distributed, scalable, fault-tolerant, partitioned, replicated, highly
available log service
ISR - In Sync
Replicas
ZooKeeper
- High-performance co-ordination service for distributed applications.
- Cluster coordinator, which stores the Topic offset ID
- Used for communicate between two nodes.
If preferred replica
is not in sync replica, the controller will fail to move the leadership to the
preferred replica.
The maximum size of
the message that Kafka can receive is 1000000 Byets, or 1 Million Bytes.
Types of message
transfer methods: Queuing and Publish-Subscribe.
Types of API's:
Producer, Consumer, Connector, Stream Processors
- Producer API - Permits an Application to publish a stream of records to one or more kafka topics.
- Consumer API - Subscribe to one or more topics and process the stream of records produced to them in an application.
- Streams API - Gives permission to an application for transforming input streams to output streams.
- Connector API - Allows Building and running reusable producers or consumers that connects Kafka topics to existing applications or data systems. E.g. Connector to RDBMS, might capture every change to the table.
A topic is a
category or feed name to which messages are published.
Preferred replica,
in-sync replica..
Kafka cluster
retains all the published messages whether the messages have been consumed or
not for a configured period of time.
The communication
between the clients and servers is done with a simple, high-performance,
language agnostic TCP Protocol.
ISR - In Sync
Replicas is the subset of replicas that are currently alive and caught up-to
the leader.
Offset is controlled
by the consumer, normally the consumer will advance the offset linearly as it
consumes.
Each Kafka Partition has one server, which acts as a
Leader
Per-partition
ordering combined with the ability to partition data by key is sufficient for
most of the applications.
Topic is nothing but
a category
Allows large number
of permanent or ad-hoc consumers.
Kafka vs Flume
No Buffer data/
|
persist the data
|
Possible
configurations are:
- Broker co-ordination... For Master slave configurations
- Log.retention is the amount of time to keep a log segment before it is deleted
Architecture:
Components of Kafka:
Topic:
Collection
of messages are topics.
Replicate
refers to copies
Replication
and partitioning enables fault tolerance and scalability
Producer:
- Publishes the message(s) to topics
Consumer:
- Subscribes to topics
- Reads the messages from the topic
- Processes the messages
Broker:
- Manages Storage of messages in the topic(s)
If Kafka has more
than one broker, then it is called Kafka cluster
Zookeeper:
- Offers the brokers with metadata about the processes running in the system
- Facilitates health checking
- Broker leadership election
Partition:
- Partition refers to logical division
Kafka Use cases:
Kafka Broker: Node
on(or in?) the Kafka Cluster. It's use is to persist and replicate the data.
Kafka Cluster:
Kafka Consumer:
Consumes the message from the Topic.
Kafka Topic:
Container, where the messages will be pushed from the Producer.
Kafka Partition:
Messaging System
Usage: Transfer the data from One Application to another Application.
Types of Messaging
System: 1. Point to Point 2. Publish-Subscribe
Point to Point
Message System:
- Messages persisted in the Queue
- Particular message can be consumed by Maximum one Consumer, even if one or more consumers consume the message from the queue
- As soon as it reads, it disappears the message from the queue
Publish-Subscribe
Messaging System:
- Message persist in the topic
- Kafka consumer can subscribe to one or more topic, and consume all the messages in that topic.
- Producers refers to Publishers and consumers are Subscribers
Benefits of Using
Kafka:
- Tracking web site activities by storing/sending the events for real-time
- Alerting and reporting Operational Metrics
- Transforming data into Standard format
- Continuous processing of streaming data to the topics.
Comments
Post a Comment