Spark filter() function is used to filter rows from the dataframe based on given condition or expression. If you are familiar with SQL, then it would be much simpler for you to filter out rows according to your requirements. For example, a list of students who got marks more than a certain limit or list of the employee in a particular department.

We can refer to a column in three ways:

  • col(“Column Name”)
  • $(“Column Name”)
  • df(“Column Name”)

Before we start with examples, first create a dataframe.

import spark.implicits._
import org.apache.spark.sql.types.{StructField,StructType,StringType,IntegerType,FloatType,DoubleType}

//Create Collection Sequence
val col = Seq("EmpId","EmpName","Gender","Dept No","Salary")
val row =Seq(Row(101,"Beck","F",1,8900),Row(102,"Joe","M",3,7800),Row(103,"Peach","F",1,10000),Row(104,"Bengie","M",2,6000),Row(105,"Barko","M",2,11000))

val schema= List(StructField("EmpId", IntegerType, true),StructField("EmpName", StringType, false),
  StructField("Gender", StringType, false),StructField("Dept No", IntegerType, false),StructField("Salary", IntegerType, false))

//creating dataframe
val df = spark.createDataFrame(
// view Dataframe
// view Schema

Below are some ways to apply filter operation:

a) Dataframe Filter() with column operation

You can mention your column condition inside the filter function.

Note: Use “===” for comparison.

Question: Find the names of employees who belongs to department 2.

//filter according to column conditions
val df_dept=df.filter($"Dept No" === 2)

b) Dataframe Filter() with SQL Expression

We can also use SQL expressions to filter dataframe rows.

Question: Filter the employess with Salary more than 8000.

// SQLexpression

c) Filtering with multiple conditions

To filter rows based on multiple conditions, we can use the following conditional expressions to combine two or more statements:

  • AND(&&)
  • NOT(!)
  • OR(||)

Question:Filter out male employees with salary more than 6500.

// Filter with Multiple condition
df.filter(df("Gender")==="M" && df("Salary")>6500).show()

d) Filtering on an array column

If you want to filter rows from dataframe based on condition applied on the Array type column. Then, how will you apply these SQL expressions on array?

To resolve this, we will use array_contains() SQL function which returns True/False whether a particular value is present in the array or not. Let’s create an Array Type Dataframe.

import spark.implicits._
import org.apache.spark.sql.types.{StructField,StructType,StringType,IntegerType,FloatType,ArrayType,DoubleType}

// Sequence
val row =Seq(Row(Row("Rohit","Gupta"),List("Python","C"),"M"),Row(Row("Shanu","Sisodiya"),List("C++","Java","Python"),"F"),Row(Row("Parth","Sisodiya"),List("C++","C#","Java"),"M"))

val Schema = new StructType()
    .add("Name",new StructType()
    .add("Languages", ArrayType(StringType))
    .add("Gender", StringType)

//creating dataframe
val df = spark.createDataFrame(
// view Dataframe
// view Schema

Question:Filter out the Name of Students who study Java.

// Importing SQL functions
import org.apache.spark.sql.functions._
// Filtering conditions

I’ve covered some common operations or ways to filter out rows from the dataframe. Practice them!! Frame your own questions and yeah one homework for you all.

Comment down the expression for Filtering out Name of Student whose Last name is “Sisodiya”.

Hint: You need to use the Nested Struct Column concept. Don’t hustle, it’s similar to what you saw in Array Type. Use “.” operator to refer to child columns.

-Gargi Gupta


Leave a Reply

Your email address will not be published. Required fields are marked *

Insert math as
Additional settings
Formula color
Text color
Type math using LaTeX
Nothing to preview