repartition vs coalesce
Repartition
|
Coalesce
|
|
Increasing and decreasing the number of partitions
|
Can increase and decrease the number of partitions
|
Can only decrease the number of partitions
|
Data movement
|
More, as it does full shuffle and creates new partitions
|
Less; Avoids full shuffle, by combining the existing partitions.
|
Example
|
RDD is representing the data on 4 nodes:
Node 1: a,b,c,d
Node 2: e,f,g,h
Node 3: i,j,k,l
Node 4: m,n
Each node’s data represents a partition.
After running the repartition to make two partitions of
the resultant data, it could become:
Node 5: a,b,f,h,i,j,n
Node 6: c,d,e,g,k,l,m
The new distribution is random(or evenly moved)
|
RDD is representing the data on 4 nodes:
Node 1: a,b,c,d
Node 2: e,f,g,h
Node 3: i,j,k,l
Node 4: m,n
Each node’s data represents a partition.
After running coalesce to make into two partitions, it could
become
Node 1: a,b,c,d,i,j,k,l
Node 2: e,f,g,h,m,n
Node 1 and node 2’s
data didn’t go through the data movement
|
Resultant
|
New partition
|
Uses existing partition
|
Partition sizes after repartitions
|
Roughly equal sized partitions,
|
Different amounts of data
|
Speed
|
Overall repartition is faster because spark is built into
work with equal sized partitions.
|
May run faster than repartition, but unequal sized are
generally slower than equal sized partitions
|
Original partition
|
Will not be modified
|
Will not be modified
|
Internal calls
|
Repartition calls coalesce with shuffling set to true
|
This is a direct call
|
Data distribution
|
Equal distribution in the resultant partitions
|
Doesn’t follow the equal distribution in
the resultant partitions.
|
What if we try to increase with coalesce?
|
No error will be thrown, it just ignores.
If we call coalesce on the repartitioned data it works.-
why??
|
|
Table partition
|
val colorsRDD = List(
(10, "green"), (13, "red"), (15, "blue"), (99, "red"), (67, "Yellow") ) //Converting to DF
val convertedDF = colorsRDD.toDF("ID",
"color")
//Repartitioning
resultantRepartition = convertedDF.repartition($"color")
ð Creates partitions for each color.
|
|
When works well?
|
RDD with lot of partitions and combining the partitions on
a single worker node to produce the final resultant partition.
|
Note: Repartition is always a costly operation.
Table
partitions, spark by default creates 200 partitions.
Partitioning table by a column is similar to indexing
a column in RDBMS
Comments
Post a Comment