Hey!! Welcome. Let’s check out what we have today in PySpark.

Have you ever thought of using SQL statements in PySpark Dataframe? Is it possible to provide conditions in PySpark to get the desired outputs in the dataframe?

We have studied the case and switch statements in any programming language we practiced. And yes, here too Spark leverages to provides us with “when otherwise” and “case when” statements to reframe the dataframe with existing columns according to your own conditions.

Let’s first do the imports that are needed and create a dataframe. I have chosen a Student-Based Dataframe.

#Sequence
seqdata=[("Candis",27,20,89.2,"F"),("Robert",60,21,85.1,"Other"),("Nanki",46,21,80.1,"F"),("Deck",30,19,90.7,"M")]

#imports
from pyspark.sql.types import *

#Defining schmea
Schema=[StructField("Name",StringType(),False),StructField("RollNo",IntegerType(),True),StructField("Age",IntegerType(),False),StructField("Percentage",DoubleType(),False),StructField("Gender",StringType(),True)]

#Creating Dataframe
df=spark.createDataFrame(spark.sparkContext.parallelize(seqdata),StructType(Schema))

#View Dataframe
df.show()

#View Schema
df.printSchema()

1. when otherwise

Since col and when are spark functions, we need to import them first. We’ll use withcolumn() function. For the first argument, we can use the name of the existing column or new column. In the second argument, we write the when otherwise condition.

If the condition satisfies, it replaces with when value else replaces it with otherwise value.

Question: Create a new column or make changes into the existing one and convert Gender column to its appropriate form?

Hint: F denotes Female


a) Existing Column
#Importing
from pyspark.sql.functions import col,when
df_when = df.withColumn("Gender", when(col("Gender") =="M","Male")
                        .when(col("Gender") =="F","Female")
                       .otherwise("Other"))
#View Dataframe
df_when.show()
b) New Column
df_when2 = df.select(col("*"), when(col("Gender") =="M","Male")
      .when(col("Gender") == "F","Female")
      .otherwise("Unknown").alias("New_gender"))
#  View Dataframe
df_when2.show()

2. case when

Similar to when otherwise, we can use case when with expression expr().

Question: Create a column “Performance” and find it out on the basis of percentage?


# case when
from pyspark.sql.functions import expr
df_case = df.withColumn("Performance", 
      expr("case when Percentage>88.0 then 'Excellent' " +
                       "when Percentage<83.0 then 'Average' " +
                       "else 'Great' end"))
#View Dataframe
df_case.show()
This image has an empty alt attribute; its file name is when-3.png

3. Using & and | operator

To apply multiple conditions in a dataframe using case when and when otherwise, we can use AND(&) and OR(|) operator.

Question: Find Results of students according to multiple conditions of your choice?


data=df.withColumn("Results",
when((df["Age"]>19) & (df["Percentage"]>87.0) , "Pass")
.when((df["Age"]==19) | (df["Percentage"]==87.0) , "Fail")
.otherwise("Unknown"))
data.show()
This image has an empty alt attribute; its file name is when-4.png

In this article, we have learned how to use spark functions like when and expr() as well as combining them with (&) and (|). Leave your words if you liked it or have any issues.

Happy Sparking!!

-Gargi Gupta


0 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Insert math as
Block
Inline
Additional settings
Formula color
Text color
#333333
Type math using LaTeX
Preview
\({}\)
Nothing to preview
Insert