repartition vs coalesce



Repartition
Coalesce
Increasing and decreasing the number of partitions
Can increase and decrease the number of partitions
Can only decrease the number of partitions
Data movement
More, as it does full shuffle and creates new partitions
Less; Avoids full shuffle, by combining the existing partitions.
Example
RDD is representing the data on 4 nodes:

Node 1: a,b,c,d
Node 2: e,f,g,h
Node 3: i,j,k,l
Node 4: m,n

Each nodes data represents a partition.

After running the repartition to make two partitions of the resultant data, it could become:

Node 5: a,b,f,h,i,j,n
Node 6: c,d,e,g,k,l,m

The new distribution is random(or evenly moved)
RDD is representing the data on 4 nodes:

Node 1: a,b,c,d
Node 2: e,f,g,h
Node 3: i,j,k,l
Node 4: m,n

Each nodes data represents a partition.

After running coalesce to make into two partitions, it could become
Node 1: a,b,c,d,i,j,k,l
Node 2: e,f,g,h,m,n


Node 1 and node 2s data didnt go through the data movement
Resultant
New partition
Uses existing partition
Partition sizes after repartitions
Roughly equal sized partitions,
Different amounts of data
Speed
Overall repartition is faster because spark is built into work with equal sized partitions.
May run faster than repartition, but unequal sized are generally slower than equal sized partitions
Original partition
Will not be modified
Will not be modified
Internal calls
Repartition calls coalesce with shuffling set to true
This is a direct call
Data distribution
Equal distribution in the resultant partitions
Doesnt follow the equal distribution in the resultant partitions.
What if we try to increase with coalesce?

No error will be thrown, it just ignores.

If we call coalesce on the repartitioned data it works.- why??
Table partition
val colorsRDD = List(
  (10, "green"),
  (13, "red"),
  (15, "blue"),
  (99, "red"),
  (67, "Yellow")
)

//Converting to DF
val convertedDF = colorsRDD.toDF("ID", "color")
//Repartitioning
resultantRepartition = convertedDF.repartition($"color")

ð  Creates partitions for each color.

When works well?

RDD with lot of partitions and combining the partitions on a single worker node to produce the final resultant partition.
         

Note: Repartition is always a costly operation.
Table partitions, spark by default creates 200 partitions.
Partitioning table by a column is similar to indexing a column in RDBMS

Comments

Popular posts from this blog

Out Of Memory in Spark(OOM) - Typical causes and resolutions

map vs flatMap in Spark

Spark Persistence(Caching)