Spark Analytics on MovieLens Dataset

Tutorial-6 PySpark Coalesce and Repartition

In this article, we are going to discuss coalesce and repartition transformations. Coalesce: Useful only to reduce the number of partitions. It avoids full data shuffle. It may have unequal partitions length. Example: Let's say we have four machine or nodes which contains equal number of partitions in each node

