What if you need to find the name of the employee with the highest salary. Or get the names of the total employees in each department from the employee table. How would you apply operations on dataframes to get these results?

Now, here comes “Spark Aggregate Functions” into the picture. Well, it would be wonderful if you are known to SQL Aggregate functions. These are much similar in functionality. Aggregate functions are applied to a group of rows to form a single value for every group. So today, we’ll be checking out the below functions:

• avg()
• sum()
• groupBy()
• max()
• min()
• count()
• distinct()

Before practicing the examples, let’s first create a dataframe to work with.

//Importing
import spark.implicits._

//Create Sequence
val item=Seq(("Chocolate",12,20,"Deck Store"),("Kurkure",20,25,"Mooney ShopStore"),("Biscuits",6,45,"Deck Store"),("Pen",2,5,"Shyam Store"),("Sheets",5,10,"Shyam Store"),("Novel",10,100,"Mooney ShopStore"))

// Create dataframe directly from Sequence
val df=item.toDF("Item_Name","Quantity","Price/item","Shop_Name")

//View Dataframe
df.show()

//View Schema
df.printSchema()

#### 1. avg()

avg() returns the average of values in a given column. It takes one argument as a column name. We need to import SQL functions to use them.

Question: Create a new column “Total Cost” to find total price of each item. Also calculate the average of the amount spend.

// Importing all SQL functions
import org.apache.spark.sql.functions._

// Creating new dataframe with Total Cost column
val df_avg=df.withColumn("Total Cost",col("Quantity")*col("Price/item"))

// View dataframe
df_avg.show()

// Applying average function
df_avg.select(avg("Total Cost").as("Average Cost")).show()

#### 2. sum()

sum() is used to find the total value in a given column.

Question:Calculate the total number of items purchased.

df.select(sum("Quantity").as("Total Items")).show()

#### 3. groupBy()

This aggregate function group rows that are associated with the same values.

Question:How many items are purchased from each shopstore?

df.groupBy("Shop_Name").sum("Quantity").show()

Question:Calculate the average amount earn by each shopstore.

// Create a new column
val df_avg=df.withColumn("Total Cost",col("Quantity")*col("Price/item"))
// Average money earn
df_avg.groupBy("Shop_Name").avg("Total Cost").show()

Question:Find the quantity of item purchased maximum from each shopstore?

df.groupBy("Shop_Name").max("Quantity").show()

#### 4. max()

max() returns the maximum value in a given column.

Question:Find the quantity of item purchased maximum.

df.select(max("Quantity").as("Maximum Quantity")).show()

#### 5. min()

min() returns the minimum value in a given column.

Question:Find the quantity of item which is least sold by each Shopstore.

df.groupBy("Shop_Name").min("Quantity").show()

#### 6. count()

count() returns the number of elements in a column. collect() returns elements of the dataset as a list.

Question:Count the total products purchased.

print("Number of items purchased are: " + df.select(count("Quantity")).collect()(0))

#### 7. distinct()

distinct() returns only unique values of a column.

Question:Name all the shopstores he purchased various items from.

df.select(\$"Shop_Name").distinct().show()

So, we saw the following cases in the post:

• We can apply aggregate functions on the dataframe too.
• Multiple aggregate functions can be applied together.
• We can use groupBy along with other functions to calculate measures on the basis of some columns.

Hope this helps!!

-Gargi Gupta

Categories: Big DataScalaSpark

Insert math as
Formula color
Type math using LaTeX
Preview
$${}$$
Nothing to preview
Insert