GroupByKey vs ReduceByKey
GroupByKey and ReduceByKey => Both are transformations GroupByKey ReduceByKey Groups the dataset based on the key Grouping + Aggregation Shuffling More shuffling Less shuffling Combiner No Combiner Aggregates the data before shuffling Which one is better? All the key,value pairs are shuffled around. This is a lot of unnecessary data transfer on the network. Works better on larger dataset, because spark knows that it can combine the output with a common key on each partition before shuffling the data. The function that ’ s passed to reduceByKey will be called again to reduce all the values from each partition to produce the final result. Only one output for each key at each partition to send over network. Partitioner method Will be called on each and every key. will be called once per key in the partition Disk spill If there is more data to b...