Hey!! Welcome. Let’s check out what we have today in PySpark.
Have you ever thought of using SQL statements in PySpark Dataframe? Is it possible to provide conditions in PySpark to get the desired outputs in the dataframe?
We have studied the case and switch statements in any programming language we practiced. And yes, here too Spark leverages to provides us with “when otherwise” and “case when” statements to reframe the dataframe with existing columns according to your own conditions.
Let’s first do the imports that are needed and create a dataframe. I have chosen a Student-Based Dataframe.
#Sequence seqdata=[("Candis",27,20,89.2,"F"),("Robert",60,21,85.1,"Other"),("Nanki",46,21,80.1,"F"),("Deck",30,19,90.7,"M")] #imports from pyspark.sql.types import * #Defining schmea Schema=[StructField("Name",StringType(),False),StructField("RollNo",IntegerType(),True),StructField("Age",IntegerType(),False),StructField("Percentage",DoubleType(),False),StructField("Gender",StringType(),True)] #Creating Dataframe df=spark.createDataFrame(spark.sparkContext.parallelize(seqdata),StructType(Schema)) #View Dataframe df.show() #View Schema df.printSchema()

1. when otherwise
Since col and when are spark functions, we need to import them first. We’ll use withcolumn() function. For the first argument, we can use the name of the existing column or new column. In the second argument, we write the when otherwise condition.
If the condition satisfies, it replaces with when value else replaces it with otherwise value.
Question: Create a new column or make changes into the existing one and convert Gender column to its appropriate form?
Hint: F denotes Female
a) Existing Column
#Importing from pyspark.sql.functions import col,when df_when = df.withColumn("Gender", when(col("Gender") =="M","Male") .when(col("Gender") =="F","Female") .otherwise("Other")) #View Dataframe df_when.show()

b) New Column
df_when2 = df.select(col("*"), when(col("Gender") =="M","Male") .when(col("Gender") == "F","Female") .otherwise("Unknown").alias("New_gender")) # View Dataframe df_when2.show()

2. case when
Similar to when otherwise, we can use case when with expression expr().
Question: Create a column “Performance” and find it out on the basis of percentage?
# case when from pyspark.sql.functions import expr df_case = df.withColumn("Performance", expr("case when Percentage>88.0 then 'Excellent' " + "when Percentage<83.0 then 'Average' " + "else 'Great' end")) #View Dataframe df_case.show()

3. Using & and | operator
To apply multiple conditions in a dataframe using case when and when otherwise, we can use AND(&) and OR(|) operator.
Question: Find Results of students according to multiple conditions of your choice?
data=df.withColumn("Results", when((df["Age"]>19) & (df["Percentage"]>87.0) , "Pass") .when((df["Age"]==19) | (df["Percentage"]==87.0) , "Fail") .otherwise("Unknown")) data.show()

In this article, we have learned how to use spark functions like when and expr() as well as combining them with (&) and (|). Leave your words if you liked it or have any issues.
Happy Sparking!!
-Gargi Gupta
0 Comments