Hola 😛 Let’s get Started and dig in some essential PySpark functions.
PySpark contains loads of aggregate functions to extract out the statistical information leveraging group by, cube and rolling DataFrames.
Today, we’ll be checking out some aggregate functions to ease down the operations on Spark DataFrames. Before moving ahead, let’s create a dataframe to work with. I have created an Item Dataframe which depicts the purchase of different items and their respective quantity.
#Creating Sequence row=[("Chocolate",2),("Kurkure",5),("Sheets",20),("Kurkure",20),("Chocolate",10),("Chocolate",5)] #Creating Dataframes from List df=spark.createDataFrame(row,["Item_Name","Quantity"]) #View Dataframe df.show() #View Schema df.printSchema()
cube function takes a list of column names and returns possible combinations of grouping columns. We can apply aggregations functions( sum,count,min,max,etc) on the combinations to generate useful information.
In the below statement we have applied cube, count, and sort function together on the columns which generate grand total cases including Null values.
#COUNT FUNCTION df.cube(df["Item_Name"],df["Quantity"]).count().sort("Item_Name","Quantity").show()
Let’s find out how we got this output. cube generates all possible mixtures and takes one column at one time.
- Row 1: Total Rows in DataFrame keeping both column value as NULL.
- Row 2: Count where Quantity is 2.
- Row 5: Count where Quantity is 20. ( It took Item_Name as NULL and all Quantity values one by one)
- Row 9: Count where Item_Name is Chocolate and Quantity is 10 ( Chocolate cases have only those associated Quantity values which are actually present in given dataframe, as it didn’t include 20 as Quantity)
- Row 14: Count where Item_Name is Sheets and Quantity is 20. ( We have only one entry of Sheets)
In a similar fashion, we can obtain the remaining rows.
Here, the count function helps to find the number of items present of a particular type with certain conditions.
#SUM FUNCTION df.cube(df["Item_Name"],df["Quantity"]).sum().sort("Item_Name","Quantity").show()
Using sum() with the cube, we obtained the sum of the total items falling in a particular type.
Note: Order of arguments passed in cube doesn’t matter whether you type, df.cube(df[“Item_Name”],df[“Quantity”]).count().show() or df.cube(df[“Quantity”],df[“Item_Name”]).count().show()
rollup returns the subset of rows returned by the cube. It takes a list of column names as input and finds the possible combinations. We can apply the aggregate function to extract the needed information. The extracted rows are less in number but actually worth using.
Let’s call the rollup function,
As the first column passed is “Item_Name”, rollup doesn’t return the count of those where only “Item_Name” is NULL. Those rows are not present in the table. Let’s find out how we obtained this output:
- Row 1: Consider the column with NULL values as it provides the total number of rows.
- Row 4: Count where Item_Name is Chocolate and Quantity is 5.
- Row 6: Count where Item_Name is Kurkure.
- Row 8: Count where Item_Name is Kurkure and Quantity is 20.
- Row 10: Count where Item_Name is Sheets and Quantity is 20. ( There is no row where only “Item_Name” is NULL, thus rollup doesn’t include them and removes useless information)
As explained above, we can retrieve the other rows as well.
Row with both column NULL is returned because it depicts useful information. As here it tells total rows in the DataFrame.
Let’s check what happens when we switch the order of arguments.
We saw that here we have NULL values in the “Item_Name” column but no rows having only “Quantity” as NULL.
Note: Above experiment implies that the order of arguments in rollup matters. Results obtained form df.rollup(df[“Item_Name”],df[“Quantity”]).count().show() and df.rollup(df[“Quantity”],df[“Item_Name”]).count().show() are different.
Difference between Group By, Cube and Rollup
GROUP BY clause groups the results according to the specified column provided as input and after we can apply aggregate functions on it to obtain the precise output.
cube function calculates the grand total of all permutations of columns including NULL cases. cube is an additional switch to the GROUP BY clause.
rollup is an extension to the GROUP BY clause. It calculates the sub-total of all permutations columns excluding the rows having NULL values only in the first column. It is used to extract summarized information. rollup creates grouping and then applies an aggregate function on them.
I hope you got something valuable out of it. That’s all I have for today. Thank you so much for reading.
Are you curious to learn more about Spark? Don’t hustle!! I will make you in a loop. Stay tuned and check out the other blogs too. Comment down for any corrections/suggestions.