GroupByKey vs ReduceByKey

GroupByKey and ReduceByKey => Both are transformations

	GroupByKey	ReduceByKey
	Groups the dataset based on the key	Grouping + Aggregation
Shuffling	More shuffling	Less shuffling
Combiner	No Combiner	Aggregates the data before shuffling
Which one is better?	All the key,value pairs are shuffled around. This is a lot of unnecessary data transfer on the network.	Works better on larger dataset, because spark knows that it can combine the output with a common key on each partition before shuffling the data. The function that’s passed to reduceByKey will be called again to reduce all the values from each partition to produce the final result. Only one output for each key at each partition to send over network.
Partitioner method	Will be called on each and every key.	will be called once per key in the partition
Disk spill	If there is more data to be shuffled on to a single executor machine than can fit in memory.	No spill
Out or memory exeption can occur	Single key has more key, value pairs than can fit in memory an out of memory exception can occur. However, it flushes the out the data to disk one at a time.	No outOfMemory Exception
Performance impact	When disk spill happens, performance will be severely impacted	No performance impact because of the operation
Preferred functions to be used	combineByKey – can be used when combining the results, but the return type will differ foldByKey – merges the values for each key using an associative function	NA

Example word count:

val input = Array(“a”,”a”,”b”,”b”,”c”,”c”,”c”)

val inputRDD = sc.parallelize(input).map(element => (element,1))

val wordcountUsingGroupByKey = inputRDD.groupByKey.map(words => (words._1, words._2.sum)).collect

val wordcountUsingReduceByKey = inputRDD.reduceByKey(_+_).collect

SparkScalaNotes