In this article, we are going to discuss about GroupByKey, ReduceByKey and AggregateByKey.

(a) GroupByKey

  • On applying groupbyKey, dataset of (K, V) pairs convert into a dataset of (K, Iterable) pairs.
  • Lots of unnecessary data transfer over the network.

In the above image, each keys and values are being transferred in the network.

Q-1 Find the total marks of each student using groupByKey transformation?

# Input
# Joseph,Physics,80,100
# Joseph,Chemistry,82,100
# Bob,Physics,45,100
# Bob,English,60,100
# ..,..,..,..
# ..,..,..,..

#Fetch StudentName and Marks only
students_name_marks=students_marks_rdd.map(lambda x:(x.split(“,”)[0],int((x.split(“,”)[2]))))
students_name_marks.collect()

students_name_marks.groupByKey().map(lambda x:(x[0],sum(x[1]))).collect()

#Output
# [(‘Rocky’, 591),(‘Ron’, 293),(‘Tina’, 600),(‘Jimmy’, 455),(‘Harry’, 237),(‘Joseph’, 526),(‘Stephanie’, 465),(‘Williamson’, 366),(‘Bob’, 238)]

(b) ReduceByKey:  This transformation applying on the dataset of (K, V) pairs and returns a dataset of (K, V) pairs where the values for each key are aggregated in the particular node.

  • Used when combiner and reducer logic are same.
  • Used when input datatype and output datatypes are same.
  • Suitable to perform sum ,mix,max operations

Q-2:Find the total marks of each students using ReduceByKey transformation?

# Input
# Joseph,Physics,80,100
# Joseph,Chemistry,82,100
# Bob,Physics,45,100
# Bob,English,60,100
# ..,..,..,..
# ..,..,..,..

#Fetch StudentName and Marks only
students_name_marks=students_marks_rdd.map(lambda x:(x.split(“,”)[0],int((x.split(“,”)[2]))))
students_name_marks.collect()

totalmarks_perStudent=students_name_marks.reduceByKey(lambda accum,marks:accum+marks)
totalmarks_perStudent.collect()

#Output
# [(‘Rocky’, 591),(‘Ron’, 293),(‘Tina’, 600),(‘Jimmy’, 455),(‘Harry’, 237),(‘Joseph’, 526),(‘Stephanie’, 465),(‘Williamson’, 366),(‘Bob’, 238)]

(c)AggregateByKey : AggregateByKey function aggregates the values of each key, using given combine functions and a neutral zero value

  • Used when combiner logic and reducer logic are different.
  • Used when Input datatype and output datatypes are different.
  • ZeroValue indicates the accumulator.
  • SeqOp: Here sequential operation is an operation of finding maximum marks (operation at each partition level data)
  • CombOp: Here combiner operation is an operation of finding maximum marks from two values (operation on aggregated data of all partitions)

Q-3:Find the total marks and count the subjects of each student using aggregateByKey transformation?

# Input
# Joseph,Physics,80,100
# Joseph,Chemistry,82,100
# Bob,Physics,45,100
# Bob,English,60,100
# ..,..,..,..
# ..,..,..,..

#Fetch StudentName and Marks only
students_name_marks=students_marks_rdd.map(lambda x:(x.split(“,”)[0],int((x.split(“,”)[2]))))
students_name_marks.collect()

aggregate_sum=students_name_marks.aggregateByKey((0,0), \
lambda valueCounter, number : (valueCounter[0] + number, valueCounter[1] + 1), \
lambda valueCounter, nextValueCounter : (valueCounter[0] + nextValueCounter[0], valueCounter[1] + nextValueCounter[1]))
aggregate_sum.collect()
#Output
# [(‘Rocky’, 591),(‘Ron’, 293),(‘Tina’, 600),(‘Jimmy’, 455),(‘Harry’, 237),(‘Joseph’, 526),(‘Stephanie’, 465),(‘Williamson’, 366),(‘Bob’, 238)]

What is the difference between GroupByKey,ReduceByKey and AggregateByKey?


0 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Insert math as
Block
Inline
Additional settings
Formula color
Text color
#333333
Type math using LaTeX
Preview
\({}\)
Nothing to preview
Insert