repartition vs coalesce

repartition vs coalesce

August 20, 2018

	Repartition	Coalesce
Increasing and decreasing the number of partitions	Can increase and decrease the number of partitions	Can only decrease the number of partitions
Data movement	More, as it does full shuffle and creates new partitions	Less; Avoids full shuffle, by combining the existing partitions.
Example	RDD is representing the data on 4 nodes: Node 1: a,b,c,d Node 2: e,f,g,h Node 3: i,j,k,l Node 4: m,n Each node’s data represents a partition. After running the repartition to make two partitions of the resultant data, it could become: Node 5: a,b,f,h,i,j,n Node 6: c,d,e,g,k,l,m The new distribution is random(or evenly moved)	RDD is representing the data on 4 nodes: Node 1: a,b,c,d Node 2: e,f,g,h Node 3: i,j,k,l Node 4: m,n Each node’s data represents a partition. After running coalesce to make into two partitions, it could become Node 1: a,b,c,d,i,j,k,l Node 2: e,f,g,h,m,n Node 1 and node 2’s data didn’t go through the data movement
Resultant	New partition	Uses existing partition
Partition sizes after repartitions	Roughly equal sized partitions,	Different amounts of data
Speed	Overall repartition is faster because spark is built into work with equal sized partitions.	May run faster than repartition, but unequal sized are generally slower than equal sized partitions
Original partition	Will not be modified	Will not be modified
Internal calls	Repartition calls coalesce with shuffling set to true	This is a direct call
Data distribution	Equal distribution in the resultant partitions	Doesn’t follow the equal distribution in the resultant partitions.
What if we try to increase with coalesce?		No error will be thrown, it just ignores. If we call coalesce on the repartitioned data it works.- why??
Table partition	val colorsRDD = List( (10, "green"), (13, "red"), (15, "blue"), (99, "red"), (67, "Yellow") ) //Converting to DF val convertedDF = colorsRDD.toDF("ID", "color") //Repartitioning resultantRepartition = convertedDF.repartition($"color") ð Creates partitions for each color.
When works well?		RDD with lot of partitions and combining the partitions on a single worker node to produce the final resultant partition.

Note: Repartition is always a costly operation.

Table partitions, spark by default creates 200 partitions.

Partitioning table by a column is similar to indexing a column in RDBMS

Comments