Posts

GroupByKey vs ReduceByKey

Image
GroupByKey and ReduceByKey => Both are transformations GroupByKey ReduceByKey Groups the dataset based on the key Grouping + Aggregation Shuffling More shuffling Less shuffling Combiner No Combiner Aggregates the data before shuffling Which one is better? All the key,value pairs are shuffled around. This is a lot of unnecessary data transfer on the network. Works better on larger dataset, because spark knows that it can combine the output with a common key on each partition before shuffling the data. The function that ’ s passed to reduceByKey will be called again to reduce all the values from each partition to produce the final result. Only one output for each key at each partition to send over network. Partitioner method Will be called on each and every key.  will be called once per key in the partition Disk spill If there is more data to b...

Checklist/Questionnaire for performance improvements in Spark Application

Image
The performance tuning areas are broadly categorized into the following areas: Cluster/Hardware provisioning Environment Memory management Code level @ Cluster level: 1. Number of resources allocated for the spark appliation      i. Number of cores - Are the cores sufficient to process the application? Will it help if we improve the number of cores?     ii. RAM - Is it good enough? will it help if we improve the RAM size? 2. Priority of the job submitted - Are we getting the chance to run the job in the shared cluster? 3. Speculative executions - is it enabled, it will help when we have straggler tasks 4. Network speed -  Are we having good network speed? Is it the bottleneck? 5. Data locality - Is it with in the cluster or getting from some other cluster? @ Environment level: Level of parallelism/Number of partitions -  Block size, may be impacted on the application Serialization format - Are we depending on the default configura...

Concurrency vs Parallel

Concurrency Parallelism There are multiple things in progress Concurrent/Multiple things are processing at the same time. E.g: Juggler juggling many balls, juggler is throwing one ball per hand at a time. E.g: Multiple jugglers juggles multiple balls simultaneously Multiple execution flow with a potential for a shared resource. E.g. Two threads comparing for a single IO port Splitting a problem into multiple chunks. E.g. Parsing a big file by splitting and processing the splits Two queues accessing one ATM machine Two queues accessing two ATM machines. Multiple tasks can be performed in overlapping the time periods with the shared resources. Task is divided into multiple sub-tasks, which can be run independently. E.g. Multi tasking on a single-core machine Running the tasks at the same time on multi-core machine Context swit...

map vs flatMap in Spark

Image
map flatMap Returns single element based on the function/custom business logic/algorithm Returns zero or more elements based on the function/custom business logic/algorithm Single element as return type Iterator with our return values, but we don’t return an iterator of RDD’s, return RDD that consists of the elements from all the iterators. e.g. val list = 1 to 5 toList list: List[Int] = List(1, 2, 3, 4, 5) list.map(_.to(3)) List[scala.collection.immutable.Range.Inclusive] = List(Range(1, 2, 3), Range(2, 3), Range(3), Range(), Range()) E.g. val list = 1 to 5 toList list: List[Int] = List(1, 2, 3, 4, 5) list.flatMap(_.to(3)) List[Int] = List(1, 2, 3, 2, 3, 3)   Returns  RDD of lists ; refer to the example Returns  RDD of elements map operation only Equivalent to map followed by flatten Returns the elements of the RDD Returns the ele...