(a) GroupByKey

• On applying groupbyKey, dataset of (K, V) pairs convert into a dataset of (K, Iterable) pairs.
• Lots of unnecessary data transfer over the network.

In the above image, each keys and values are being transferred in the network.

Q-1 Find the total marks of each student using groupByKey transformation?

# Input
# Joseph,Physics,80,100
# Joseph,Chemistry,82,100
# Bob,Physics,45,100
# Bob,English,60,100
# ..,..,..,..
# ..,..,..,..

#Fetch StudentName and Marks only
students_name_marks=students_marks_rdd.map(lambda x:(x.split(“,”)[0],int((x.split(“,”)[2]))))
students_name_marks.collect()

students_name_marks.groupByKey().map(lambda x:(x[0],sum(x[1]))).collect()

#Output
# [(‘Rocky’, 591),(‘Ron’, 293),(‘Tina’, 600),(‘Jimmy’, 455),(‘Harry’, 237),(‘Joseph’, 526),(‘Stephanie’, 465),(‘Williamson’, 366),(‘Bob’, 238)]

(b) ReduceByKey:  This transformation applying on the dataset of (K, V) pairs and returns a dataset of (K, V) pairs where the values for each key are aggregated in the particular node.

• Used when combiner and reducer logic are same.
• Used when input datatype and output datatypes are same.
• Suitable to perform sum ,mix,max operations

Q-2:Find the total marks of each students using ReduceByKey transformation?

# Input
# Joseph,Physics,80,100
# Joseph,Chemistry,82,100
# Bob,Physics,45,100
# Bob,English,60,100
# ..,..,..,..
# ..,..,..,..

#Fetch StudentName and Marks only
students_name_marks=students_marks_rdd.map(lambda x:(x.split(“,”)[0],int((x.split(“,”)[2]))))
students_name_marks.collect()

totalmarks_perStudent=students_name_marks.reduceByKey(lambda accum,marks:accum+marks)
totalmarks_perStudent.collect()

#Output
# [(‘Rocky’, 591),(‘Ron’, 293),(‘Tina’, 600),(‘Jimmy’, 455),(‘Harry’, 237),(‘Joseph’, 526),(‘Stephanie’, 465),(‘Williamson’, 366),(‘Bob’, 238)]

(c)AggregateByKey : AggregateByKey function aggregates the values of each key, using given combine functions and a neutral zero value

• Used when combiner logic and reducer logic are different.
• Used when Input datatype and output datatypes are different.
• ZeroValue indicates the accumulator.
• SeqOp: Here sequential operation is an operation of finding maximum marks (operation at each partition level data)
• CombOp: Here combiner operation is an operation of finding maximum marks from two values (operation on aggregated data of all partitions)

Q-3:Find the total marks and count the subjects of each student using aggregateByKey transformation?

# Input
# Joseph,Physics,80,100
# Joseph,Chemistry,82,100
# Bob,Physics,45,100
# Bob,English,60,100
# ..,..,..,..
# ..,..,..,..

#Fetch StudentName and Marks only
students_name_marks=students_marks_rdd.map(lambda x:(x.split(“,”)[0],int((x.split(“,”)[2]))))
students_name_marks.collect()

aggregate_sum=students_name_marks.aggregateByKey((0,0), \
lambda valueCounter, number : (valueCounter[0] + number, valueCounter[1] + 1), \
lambda valueCounter, nextValueCounter : (valueCounter[0] + nextValueCounter[0], valueCounter[1] + nextValueCounter[1]))
aggregate_sum.collect()
#Output
# [(‘Rocky’, 591),(‘Ron’, 293),(‘Tina’, 600),(‘Jimmy’, 455),(‘Harry’, 237),(‘Joseph’, 526),(‘Stephanie’, 465),(‘Williamson’, 366),(‘Bob’, 238)]

What is the difference between GroupByKey,ReduceByKey and AggregateByKey?

$${}$$