Kafka Architecture - Draft


What is Kafka
  • Stream processing Engine
  • Written in Scala and Java
  • Aims to provide Unified, high-throughput, low-latency platform for handling real time data feeds.
  • Used for real time data pipelines and streaming apps
  • Built on top of the ZooKeeper Synchronization service
  

Popular messaging queues in the Market
  1. Kafka - Scalable, Fault tolerant, High R/W throughput, No Single point of failure, Durable(messages persisted in the disk and replicated within the cluster to prevent data loss), Support compression, support data retention
    1. Handles the Hunderds of Megabytes R/W's per second from thousands of clients
    2. No Master/Slave architecture - No single point of failure
    3. Guarantees ordering of messages - provides total order of messages within the partition.
    4. Same roles is assigned to all the nodes in the cluster
    5. Compression saves the storage, improves the processing performance
    6. Kafka is 5 times better than RabbitMQ
    7. 90K messages per second, each record of 2 or 3 KB Size
  2. RabbitMQ - 2K messages per second, each record of 2 or 3 KB Size
  3. ActiveMQ -
  4. HornetQ - 20K messages per second, each record of 2 or 3 KB size

What will happen if the topic size reaches to br
oker disk size?

Partition is a logical distribution of a topic.
A Broker can contain, multiple partitions of a topic.

Partition is replicated, not the topic. i.e. Partitions of the topic are replicated.
One partition is the leader of the partition(master partition) and replica of same partition is treated as follower.
R/W happens on the leader only i.e. Producer and consumer operates on the leader/primary/master partition.
Each partition can be leader or replica.
Leader is responsible for all the writes and reads of the topic.
Replication provides fault tolerance.

Mirroring used for geographical replication.

=> Replication, is with in the geographical location, mirroring is across geo graphical location.

Broker is completely independent from producer and consumer... E.g. How much data is consumed by the consumer..

Reading by the consumer... Recommended or ideal scenario is read one partition by one consumer instance from the a consumer group.
With in a consumer group, the data won't be read by the other instances.


Same record can be read by the same instances in different consumer groups, e.g. Record 1 is read by C2 in Consumer Group A, same can be done in Consumer group B.
Fault tolerance is achieved with in the consumer group.

Why Kafka?
  • We can decouple the rate at which the messages are produced and transferred to Kafka server from the rate at which the messages are consumed.
  • High throughput, scalable, fault-tolerance, built-in partition and replication.
  • Unified platform for handling all the real time data feeds.


Customers who uses it:
  1. Loggly - Popular Cloud based log management and analytics service provider
  2. RichRelevance - Personalized customer experience for shopping
    1. 100000+ events per second
    2. Preservation is needed for them for several hours
    3. Guaranteed log, no record loss is acceptable.
    4. No Single point of failure - Age based retention to purge old data on disks
    5. Moving TB's of data everyday without single record failure.
    6. Low latency 99.99999% of data comes from the disk cache + RAM, rarely hits disk
    7. Performance - One of the customer group(eight threads) can process about 2,00,000 events per second, draining from 192 partitions, spared across three brokers.
    8. Scalability - Increasing the partition count per topic and can add new nodes at runtime.
    9. Can we increase the Topics at runtime??
  3. LinkedIn - For activity Stream data and operational Metrics
    1. Newsfeed, LinkedIn Today, in addition to offline analytics from system like Hadoop
  4. Twitter - As part of Storm stream processing Engine
  5. NetFlix - Real time monitoring and event processing pipeline
  6. SemaText, foursquare, Cisco, LinkSmart, OOYALA

Definition: Publish-subscribe messaging system, which is distributed, scalable, fault-tolerant, partitioned, replicated, highly available log service

ISR - In Sync Replicas
ZooKeeper
  • High-performance co-ordination service for distributed applications.
  • Cluster coordinator, which stores the Topic offset ID
  • Used for communicate between two nodes.

If preferred replica is not in sync replica, the controller will fail to move the leadership to the preferred replica.
The maximum size of the message that Kafka can receive is 1000000 Byets, or 1 Million Bytes.
Types of message transfer methods: Queuing and Publish-Subscribe.

Types of API's: Producer, Consumer, Connector, Stream Processors
  1. Producer API - Permits an Application to publish a stream of records to one or more kafka topics.
  2. Consumer API - Subscribe to one or more topics and process the stream of records produced to them in an application.
  3. Streams API - Gives permission to an application for transforming input streams to output streams.
  4. Connector API - Allows Building and running reusable producers or consumers that connects Kafka topics to existing applications or data systems. E.g. Connector to RDBMS, might capture every change to the table.


A topic is a category or feed name to which messages are published.

Preferred replica, in-sync replica..

Kafka cluster retains all the published messages whether the messages have been consumed or not for a configured period of time.
The communication between the clients and servers is done with a simple, high-performance, language agnostic TCP Protocol.

ISR - In Sync Replicas is the subset of replicas that are currently alive and caught up-to the leader.

Offset is controlled by the consumer, normally the consumer will advance the offset linearly as it consumes.

Each Kafka Partition has one server, which acts as a Leader

Per-partition ordering combined with the ability to partition data by key is sufficient for most of the applications.


Topic is nothing but a category

Allows large number of permanent or ad-hoc consumers.


Kafka vs Flume
No Buffer data/
persist the data

Possible configurations are:
  1. Broker co-ordination... For Master slave configurations
  2. Log.retention is the amount of time to keep a log segment before it is deleted


Architecture:

Components of Kafka:
Topic:
Collection of messages are topics.
Replicate refers to copies

Replication and partitioning enables fault tolerance and scalability
Producer:
     - Publishes the message(s) to topics

Consumer:
  • Subscribes to topics
  • Reads the messages from the topic
  • Processes the messages

Broker:
  • Manages Storage of messages in the topic(s)

If Kafka has more than one broker, then it is called Kafka cluster

Zookeeper:
  • Offers the brokers with metadata about the processes running in the system
  • Facilitates health checking
  • Broker leadership election


Partition:
  • Partition refers to logical division

Kafka Use cases:

Kafka Broker: Node on(or in?) the Kafka Cluster. It's use is to persist and replicate the data.
Kafka Cluster:
Kafka Consumer: Consumes the message from the Topic.
Kafka Topic: Container, where the messages will be pushed from the Producer.
Kafka Partition:

Messaging System Usage: Transfer the data from One Application to another Application.
Types of Messaging System: 1. Point to Point 2. Publish-Subscribe

Point to Point Message System:

  1. Messages persisted in the Queue
  2. Particular message can be consumed by Maximum one Consumer, even if one or more consumers consume the message from the queue
  3. As soon as it reads, it disappears the message from the queue

Publish-Subscribe Messaging System:
  1. Message persist in the topic
  2. Kafka consumer can subscribe to one or more topic, and consume all the messages in that topic.
  3. Producers refers to Publishers and consumers are Subscribers

Benefits of Using Kafka:
  1. Tracking web site activities by storing/sending the events for real-time
  2. Alerting and reporting Operational Metrics
  3. Transforming data into Standard format
  4. Continuous processing of streaming data to the topics.



Comments

Popular posts from this blog

Out Of Memory in Spark(OOM) - Typical causes and resolutions

map vs flatMap in Spark

Spark Persistence(Caching)