What if you need to find the name of the employee with the highest salary. Or get the names of the total employees in each department from the employee table. How would you apply operations on dataframes to get these results?
Now, here comes “Spark Aggregate Functions” into the picture. Well, it would be wonderful if you are known to SQL Aggregate functions. These are much similar in functionality. Aggregate functions are applied to a group of rows to form a single value for every group. So today, we’ll be checking out the below functions:
Before practicing the examples, let’s first create a dataframe to work with.
//Importing import spark.implicits._ //Create Sequence val item=Seq(("Chocolate",12,20,"Deck Store"),("Kurkure",20,25,"Mooney ShopStore"),("Biscuits",6,45,"Deck Store"),("Pen",2,5,"Shyam Store"),("Sheets",5,10,"Shyam Store"),("Novel",10,100,"Mooney ShopStore")) // Create dataframe directly from Sequence val df=item.toDF("Item_Name","Quantity","Price/item","Shop_Name") //View Dataframe df.show() //View Schema df.printSchema()
avg() returns the average of values in a given column. It takes one argument as a column name. We need to import SQL functions to use them.
Question: Create a new column “Total Cost” to find total price of each item. Also calculate the average of the amount spend.
// Importing all SQL functions import org.apache.spark.sql.functions._ // Creating new dataframe with Total Cost column val df_avg=df.withColumn("Total Cost",col("Quantity")*col("Price/item")) // View dataframe df_avg.show() // Applying average function df_avg.select(avg("Total Cost").as("Average Cost")).show()
sum() is used to find the total value in a given column.
Question:Calculate the total number of items purchased.
This aggregate function group rows that are associated with the same values.
Question:How many items are purchased from each shopstore?
Question:Calculate the average amount earn by each shopstore.
// Create a new column val df_avg=df.withColumn("Total Cost",col("Quantity")*col("Price/item")) // Average money earn df_avg.groupBy("Shop_Name").avg("Total Cost").show()
Question:Find the quantity of item purchased maximum from each shopstore?
max() returns the maximum value in a given column.
Question:Find the quantity of item purchased maximum.
min() returns the minimum value in a given column.
Question:Find the quantity of item which is least sold by each Shopstore.
count() returns the number of elements in a column. collect() returns elements of the dataset as a list.
Question:Count the total products purchased.
print("Number of items purchased are: " + df.select(count("Quantity")).collect()(0))
distinct() returns only unique values of a column.
Question:Name all the shopstores he purchased various items from.
So, we saw the following cases in the post:
- We can apply aggregate functions on the dataframe too.
- Multiple aggregate functions can be applied together.
- We can use groupBy along with other functions to calculate measures on the basis of some columns.
Hope this helps!!