In this article, we are going to discuss coalesce and repartition transformations.

Coalesce:

• Useful only to reduce the number of partitions.
• It avoids full data shuffle.
• It may have unequal partitions length.

Example: Let’s say we have four machine or nodes which contains equal number of partitions in each node like below:

Node 1 = 1,2,3
Node 2 = 4,5,6
Node 3 = 7,8,9
Node 4 = 10,11,12

And suppose we require only two partitions then coalesce will work here:
Node 1 = (1,2,3) and (4,5,6)
Node 3 = (7,8,9) and (10,11,12)

Notice that Node 1 and Node 3 did not require its original data to move.

Q-1 How does Coalesce reduce the number of partitions?

val initialRdd = spark.sparkContext.parallelize(1 to 15 ,5)
initialRdd.getNumPartitions //Output 5
initialRdd.saveAsTextFile(“/output_rdd/”) //hdfs path

//Now apply the coalesce method to reduce the number of partitions

val coalesceRDD =initialRdd.coalesce(3)
coalesceRDD.saveAsTextFile(“/coalesce-output/”) //hdfs path

As we can see,initially there were five partitions and then applied coalesce to make three partitions and also data is partially shuffled.

Repartition:

• The repartition method can be used to either increase or decrease the number of partitions in a RDD..
• Full data shuffle.
• It will have equal partitions length.

Example: Let’s say we have four machine or nodes which contains equal number of partitions in each node like below:

Node 1 = 1,2,3
Node 2 = 4,5,6
Node 3 = 7,8,9
Node 4 = 10,11,12

And repartition convert two partition here:
Node 1 = (7,8,9) and (4,5,6)
Node 3 = (1,2,3) and (10,11,12)

Notice that data is fully shuffled.

Q-2 How does Repartion reduce or increase the number of partitions?

val initialRdd = spark.sparkContext.parallelize(1 to 15 ,5)
initialRdd.getNumPartitions //Output 5
initialRdd.saveAsTextFile(“/output_rdd/”) //hdfs path

//Now apply the coalesce method to reduce the number of partitions

val repartitionRDD =initialRdd.repartition(3)
repartitionRDD.saveAsTextFile(“/repartition-output/”) //hdfs path

Q-3- What is the difference between Repartition and Coalesce:-

$${}$$