Kafka Architecture

What is Kafka

Stream processing Engine
Written in Scala and Java
Aims to provide Unified, high-throughput, low-latency platform for handling real time data feeds.
Used for real time data pipelines and streaming apps
Built on top of the ZooKeeper Synchronization service

Popular messaging queues in the Market

Kafka - Scalable, Fault tolerant, High R/W throughput, No Single point of failure, Durable(messages persisted in the disk and replicated within the cluster to prevent data loss), Support compression, support data retention

Handles the Hunderds of Megabytes R/W's per second from thousands of clients
No Master/Slave architecture - No single point of failure
Guarantees ordering of messages - provides total order of messages within the partition.
Same roles is assigned to all the nodes in the cluster
Compression saves the storage, improves the processing performance
Kafka is 5 times better than RabbitMQ
90K messages per second, each record of 2 or 3 KB Size

RabbitMQ - 2K messages per second, each record of 2 or 3 KB Size
ActiveMQ -
HornetQ - 20K messages per second, each record of 2 or 3 KB size

What will happen if the topic size reaches to br

oker disk size?

Partition is a logical distribution of a topic.

A Broker can contain, multiple partitions of a topic.

Partition is replicated, not the topic. i.e. Partitions of the topic are replicated.

One partition is the leader of the partition(master partition) and replica of same partition is treated as follower.

R/W happens on the leader only i.e. Producer and consumer operates on the leader/primary/master partition.

Each partition can be leader or replica.

Leader is responsible for all the writes and reads of the topic.

Replication provides fault tolerance.

Mirroring used for geographical replication.

=> Replication, is with in the geographical location, mirroring is across geo graphical location.

Broker is completely independent from producer and consumer... E.g. How much data is consumed by the consumer..

Reading by the consumer... Recommended or ideal scenario is read one partition by one consumer instance from the a consumer group.

With in a consumer group, the data won't be read by the other instances.

Same record can be read by the same instances in different consumer groups, e.g. Record 1 is read by C2 in Consumer Group A, same can be done in Consumer group B.

Fault tolerance is achieved with in the consumer group.

Why Kafka?

We can decouple the rate at which the messages are produced and transferred to Kafka server from the rate at which the messages are consumed.
High throughput, scalable, fault-tolerance, built-in partition and replication.
Unified platform for handling all the real time data feeds.

Customers who uses it:

Loggly - Popular Cloud based log management and analytics service provider
RichRelevance - Personalized customer experience for shopping

100000+ events per second
Preservation is needed for them for several hours
Guaranteed log, no record loss is acceptable.
No Single point of failure - Age based retention to purge old data on disks
Moving TB's of data everyday without single record failure.
Low latency 99.99999% of data comes from the disk cache + RAM, rarely hits disk
Performance - One of the customer group(eight threads) can process about 2,00,000 events per second, draining from 192 partitions, spared across three brokers.
Scalability - Increasing the partition count per topic and can add new nodes at runtime.
Can we increase the Topics at runtime??

LinkedIn - For activity Stream data and operational Metrics

Newsfeed, LinkedIn Today, in addition to offline analytics from system like Hadoop

Twitter - As part of Storm stream processing Engine
NetFlix - Real time monitoring and event processing pipeline
SemaText, foursquare, Cisco, LinkSmart, OOYALA

Definition: Publish-subscribe messaging system, which is distributed, scalable, fault-tolerant, partitioned, replicated, highly available log service

ISR - In Sync Replicas

ZooKeeper

High-performance co-ordination service for distributed applications.
Cluster coordinator, which stores the Topic offset ID
Used for communicate between two nodes.

If preferred replica is not in sync replica, the controller will fail to move the leadership to the preferred replica.

The maximum size of the message that Kafka can receive is 1000000 Byets, or 1 Million Bytes.

Types of message transfer methods: Queuing and Publish-Subscribe.

Types of API's: Producer, Consumer, Connector, Stream Processors

Producer API - Permits an Application to publish a stream of records to one or more kafka topics.
Consumer API - Subscribe to one or more topics and process the stream of records produced to them in an application.
Streams API - Gives permission to an application for transforming input streams to output streams.
Connector API - Allows Building and running reusable producers or consumers that connects Kafka topics to existing applications or data systems. E.g. Connector to RDBMS, might capture every change to the table.

A topic is a category or feed name to which messages are published.

Preferred replica, in-sync replica..

Kafka cluster retains all the published messages whether the messages have been consumed or not for a configured period of time.

The communication between the clients and servers is done with a simple, high-performance, language agnostic TCP Protocol.

ISR - In Sync Replicas is the subset of replicas that are currently alive and caught up-to the leader.

Offset is controlled by the consumer, normally the consumer will advance the offset linearly as it consumes.

Each Kafka Partition has one server, which acts as a Leader

Per-partition ordering combined with the ability to partition data by key is sufficient for most of the applications.

Topic is nothing but a category

Allows large number of permanent or ad-hoc consumers.

Kafka vs Flume

No Buffer data/

persist the data

Possible configurations are:

Broker co-ordination... For Master slave configurations
Log.retention is the amount of time to keep a log segment before it is deleted

Architecture:

Components of Kafka:

Topic:

Collection of messages are topics.

Replicate refers to copies

Replication and partitioning enables fault tolerance and scalability

Producer:

- Publishes the message(s) to topics

Consumer:

Subscribes to topics
Reads the messages from the topic
Processes the messages

Broker:

Manages Storage of messages in the topic(s)

If Kafka has more than one broker, then it is called Kafka cluster

Zookeeper:

Offers the brokers with metadata about the processes running in the system
Facilitates health checking
Broker leadership election

Partition:

Partition refers to logical division

Kafka Use cases:

Kafka Broker: Node on(or in?) the Kafka Cluster. It's use is to persist and replicate the data.

Kafka Cluster:

Kafka Consumer: Consumes the message from the Topic.

Kafka Topic: Container, where the messages will be pushed from the Producer.

Kafka Partition:

Messaging System Usage: Transfer the data from One Application to another Application.

Types of Messaging System: 1. Point to Point 2. Publish-Subscribe

Point to Point Message System:

Messages persisted in the Queue
Particular message can be consumed by Maximum one Consumer, even if one or more consumers consume the message from the queue
As soon as it reads, it disappears the message from the queue

Publish-Subscribe Messaging System:

Message persist in the topic
Kafka consumer can subscribe to one or more topic, and consume all the messages in that topic.
Producers refers to Publishers and consumers are Subscribers

Benefits of Using Kafka:

Tracking web site activities by storing/sending the events for real-time
Alerting and reporting Operational Metrics
Transforming data into Standard format
Continuous processing of streaming data to the topics.

Search This Blog

SparkScalaNotes

Kafka Architecture - Draft

Comments

Post a Comment

Popular posts from this blog

Out Of Memory in Spark(OOM) - Typical causes and resolutions

When to use RDD?

map vs flatMap in Spark