Spark filter() function is used to filter rows from the dataframe based on given condition or expression. If you are familiar with SQL, then it would be much simpler for you to filter out rows according to your requirements. For example, a list of students who got marks more than a certain limit or list of the employee in a particular department.
We can refer to a column in three ways:
- col(“Column Name”)
- $(“Column Name”)
- df(“Column Name”)
Before we start with examples, first create a dataframe.
//Importing import spark.implicits._ import org.apache.spark.sql.types.{StructField,StructType,StringType,IntegerType,FloatType,DoubleType} //Create Collection Sequence val col = Seq("EmpId","EmpName","Gender","Dept No","Salary") val row =Seq(Row(101,"Beck","F",1,8900),Row(102,"Joe","M",3,7800),Row(103,"Peach","F",1,10000),Row(104,"Bengie","M",2,6000),Row(105,"Barko","M",2,11000)) //Schema val schema= List(StructField("EmpId", IntegerType, true),StructField("EmpName", StringType, false), StructField("Gender", StringType, false),StructField("Dept No", IntegerType, false),StructField("Salary", IntegerType, false)) //creating dataframe val df = spark.createDataFrame( spark.sparkContext.parallelize(row), StructType(schema) ) // view Dataframe df.show() // view Schema df.printSchema()

Below are some ways to apply filter operation:
a) Dataframe Filter() with column operation
You can mention your column condition inside the filter function.
Note: Use “===” for comparison.
Question: Find the names of employees who belongs to department 2.
//filter according to column conditions val df_dept=df.filter($"Dept No" === 2) df_dept.show()

b) Dataframe Filter() with SQL Expression
We can also use SQL expressions to filter dataframe rows.
Question: Filter the employess with Salary more than 8000.
// SQLexpression df.filter("Salary>8000").show()

c) Filtering with multiple conditions
To filter rows based on multiple conditions, we can use the following conditional expressions to combine two or more statements:
- AND(&&)
- NOT(!)
- OR(||)
Question:Filter out male employees with salary more than 6500.
// Filter with Multiple condition df.filter(df("Gender")==="M" && df("Salary")>6500).show()

d) Filtering on an array column
If you want to filter rows from dataframe based on condition applied on the Array type column. Then, how will you apply these SQL expressions on array?
To resolve this, we will use array_contains() SQL function which returns True/False whether a particular value is present in the array or not. Let’s create an Array Type Dataframe.
//Importing import spark.implicits._ import org.apache.spark.sql.types.{StructField,StructType,StringType,IntegerType,FloatType,ArrayType,DoubleType} // Sequence val row =Seq(Row(Row("Rohit","Gupta"),List("Python","C"),"M"),Row(Row("Shanu","Sisodiya"),List("C++","Java","Python"),"F"),Row(Row("Parth","Sisodiya"),List("C++","C#","Java"),"M")) //Schema val Schema = new StructType() .add("Name",new StructType() .add("firstname",StringType) .add("lastname",StringType)) .add("Languages", ArrayType(StringType)) .add("Gender", StringType) //creating dataframe val df = spark.createDataFrame( spark.sparkContext.parallelize(row), Schema ) // view Dataframe df.show() // view Schema df.printSchema()

Question:Filter out the Name of Students who study Java.
// Importing SQL functions import org.apache.spark.sql.functions._ // Filtering conditions df.filter(array_contains($"Languages","Java")).show()

I’ve covered some common operations or ways to filter out rows from the dataframe. Practice them!! Frame your own questions and yeah one homework for you all.
Comment down the expression for Filtering out Name of Student whose Last name is “Sisodiya”.
Hint: You need to use the Nested Struct Column concept. Don’t hustle, it’s similar to what you saw in Array Type. Use “.” operator to refer to child columns.
-Gargi Gupta
0 Comments