Here we are going to discuss mapPartitions ,mapPartitionsWithIndex and filter operation.

(a) MapPartitions: This transformation is similar to map,but runs separately on each partition (block) of the RDD. mapPartitions() can be used as an alternative to map() and foreach().

Q-1 How to iterate each partitions using mapPartitions transformation and convert each element in RDD like (value,1)?

val flat_rdd=partition.flatMap(x=>(x.split(” “)))>(y,1))

Q-2 How to sum of the numbers in each partitions using mapPartitions transformation?

val data = spark.sparkContext.parallelize(List(10,11,12,13,14), 2) //Defined two partitions
def mapPartitionSum(numbers : Iterator[Int]) : Iterator[Int] =
var sum = 0
sum = sum +
return Iterator(sum)


//Output – 21,39
//One partition will have 10 and 11 numbers and second one has 12,13,14 numbers .

(b) MapParitionsWithIndex :  This transformation is similar to map Partitions, but also provides func with an integer value representing the index of the partition.

Q-3:How to enable partition indexing to the each element in RDD using mapPartitionsWithindex transformation?

val cars_rdd=spark.sparkContext.parallelize(List(“Audi”,”Verna”,”BMW”,”Baleno”,”Mercedes-Benz”,”Ferrari”),4)
//There will be four partitions

val cars_list=iterator.toList>(x+”–>”+index)).iterator

//Output- Here 0,1,2,3 representing the partitions index

(c) Filter :  This transformation returns a new dataset formed by selecting those elements of the source on which function returns true.

Q-4:How many times the keyword “Sun” come in the input dataset ?

//Input data.txt contains
//The Sun rises in the East and sets in the West.
//The Sun sets in the West and rises in the East

val data_rdd=spark.sparkContext.textFile(“/FileStore/tables/data.txt”)
val flat_rdd=data_rdd.flatMap(x=>x.split(” “))
//Output 2


Leave a Reply

Your email address will not be published. Required fields are marked *

Insert math as
Additional settings
Formula color
Text color
Type math using LaTeX
Nothing to preview