SparkScalaNotes

Posts

Showing posts from August, 2018

Kafka Architecture - Draft

August 27, 2018

What is Kafka Stream processing Engine Written in Scala and Java Aims to provide Unified, high-throughput, low-latency platform for handling real time data feeds. Used for real time data pipelines and streaming apps Built on top of the ZooKeeper Synchronization service Popular messaging queues in the Market Kafka - Scalable, Fault tolerant, High R/W throughput, No Single point of failure, Durable(messages persisted in the disk and replicated within the cluster to prevent data loss), Support compression, support data retention Handles the Hunderds of Megabytes R/W's per second from thousands of clients No Master/Slave architecture - No single point of failure Guarantees ordering of messages - provides total order of messages within the partition. Same roles is assigned to all the nodes in the cluster Compression saves the storage, improves the processing performance Kafka is 5 times bett...

repartition vs coalesce

August 20, 2018

Repartition Coalesce Increasing and decreasing the number of partitions Can increase and decrease the number of partitions Can only decrease the number of partitions Data movement More, as it does full shuffle and creates new partitions Less; Avoids full shuffle, by combining the existing partitions. Example RDD is representing the data on 4 nodes: Node 1: a,b,c,d Node 2: e,f,g,h Node 3: i,j,k,l Node 4: m,n Each node ’ s data represents a partition. After running the repartition to make two partitions of the resultant data, it could become: Node 5: a,b,f,h,i,j,n Node 6: c,d,e,g,k,l,m The new distribution is random(or evenly moved) RDD is representing the data on 4 nodes: Node 1: a,b,c,d Node 2: e,f,g,h Node 3: i,j,k,l Node 4: m,n Each node ’ s data represents a partition. After running coalesce to make into two partitions, it could become Node 1: a,b,c,d,i,j,...

DataFrame vs Dataset

August 17, 2018

DataFrame DataSet (RDD + DF) Definition Distributed collection of rows(data) organized into named columns. Extension of DF API Distributed collection of strongly typed Domain Specific Objects. Regeneration of Domain specific object Can ’ t regenerate original domain specific object Can regenerate original domain specific object from JVM object Static typing/Compile time safety DF API trying to access a column which doesn ’ t exists in the table, doesn ’ t give compile time error, detects attribute error at runtime Gives an error at compile time Individual attribute of an object To access individual attribute, we need to serialize the entire object No need to serialize the entire object to access the individual element of an object Programming language support Python, R, Scala and Java Scala and Java...

DAG in Spark and Usage

August 17, 2018

DAG - Directed Acyclic Graph DAG is a set of vertices and edges Vertices represent RDD Edges represent Operations that needs to be applied on RDD DAG is a finite graph with no directed cycles, which means it have finite vertices and edges. Each Spark Application is converted into DAG DAG Shows the complete task i.e. Transformation(s) and Action DAG shows different stages of Spark Job. Logical DAG of operations is created implicitly by the Spark Program. or in other words, user code defines DAG of RDDs For each Action, spark creates a DAG and submits to the DAG Scheduler. DAG scheduler divides the operators into stages of tasks.(This is the final result of DAG scheduler) DAG Scheduler pipelines the operators together. e.g. Many map operations can be scheduled on a single stage. DAG helps in achieving Fault tolerance. Lost RDD can be recovered. MR has two levels: map and reduce, DAG can have multiple levels, because of this executing SQL queries is more flexible. DA...

Advantages of Immutability in distributed systems and in programming

August 08, 2018

Specific to Distributed systems: 1. Performance – simple, easy to share the RDD with multiple processing elements 2. Fault tolerance – re-creatable on failure 3. Caching and in-memory processing - References can be cached as they are not going to change. 4. Multiple threads can access the partition 5. Sharing - Safe to share across processes 6. Processing is easy - makes it easy to parallelize, as there are no conflicts 7. Replication - internal state will be in consistent even if you have an exception. 8. Rules out the potential problems due to updates from multiple threads. 9. Can easily live in memory as on disk, this makes it reasonable to easily move operations that hit disk to i...