Spark filter() function is used to filter rows from the dataframe based on given condition or expression. If you are familiar with SQL, then it would be much simpler for you to filter out rows according to your requirements.

For example, if you wish to get a list of students who got marks more than a certain limit or list of the employee in a particular department.

We can refer to a column in three ways:

  • col(“Column Name”)
  • df[“Column Name”]

Before we start with examples, first create a Dataframe.

from pyspark.sql.types import *

data = [(101,"Beck","F",1,8900),(102,"Joe","M",3,7800),(103,"Peach","F",1,10000),(104,"Bengie","M",2,6000),(105,"Barko","M",2,11000)]


# Create a schema for the dataframe
schema = StructType([
    StructField('EmpId', IntegerType(), True),
    StructField('EmpName', StringType(), True),
    StructField('Gender', StringType(), True),
  StructField('Dept No', StringType(), True),
  StructField('Salary', StringType(), True),
])

# Convert list to RDD
rdd = spark.sparkContext.parallelize(data)

# Create data frame
df = spark.createDataFrame(rdd,schema)

df.show()

df.printSchema()

Below are some ways to apply filter operation:

a) Dataframe Filter() with column operation

You can mention your column condition inside the filter function.

Note: Use “==” for comparison.

Question: Find the names of employees who belongs to department 1.


from pyspark.sql.functions import col
#filter according to column conditions
df_dept=df.filter(col("Dept No") == 1)
df_dept.show()

b) Dataframe Filter() with SQL Expression

We can also use SQL expressions to filter dataframe rows.

Question: Filter the employess with Salary more than 9000.


#SQLexpression
df.filter(col("Salary")>9000).show()

c) Filtering with multiple conditions

To filter rows based on multiple conditions, we can use the following conditional expressions to combine two or more statements:

  • AND(&&)
  • NOT(!)
  • OR(||)

Question:Filter out male employees with Department 2.


#Filter with Multiple condition
df.filter((df["Gender"]=="M") & (df["Dept No"]==2)).show()

d) Filtering on an array column

If you want to filter rows from dataframe based on condition applied on the Array type column. Then, how will you apply these SQL expressions on array?

To resolve this, we will use array_contains() SQL function which returns True/False whether a particular value is present in the array or not. Let’s create an Array Type Dataframe.

from pyspark.sql.types import *

#list
row=[[["Rohit","Gupta"],["Python","C"],"M"],[["Shanu","Sisodiya"],["C++","Java","Python"],"F"],[["Parth","Sisodiya"],["C++","C#","Java"],"M"]]

#Schema
Schema=StructType().add("Name",StructType().add("firstname",StringType()).add("lastname",StringType())).add("Languages", ArrayType(StringType())).add("Gender", StringType())

#creating dataframe
df = spark.createDataFrame(
  spark.sparkContext.parallelize(row),
Schema
)

# view Dataframe
df.show()

# view Schema
df.printSchema()

Question:Filter out the Name of Students who study Python.


from pyspark.sql.functions import *
#Filtering conditions
df.filter(array_contains(df["Languages"],"Python")).show()

I’ve covered some common operations or ways to filter out rows from the dataframe. Practice them!! Frame your own questions and yeah one homework for you all.

Comment down the expression for Filtering out Name of Student whose Last name is “Sisodiya”.

Hint: You need to use the Nested Struct Column concept. Don’t hustle, it’s similar to what you saw in Array Type. Use “.” operator to refer to child columns.

-Gargi Gupta


0 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Insert math as
Block
Inline
Additional settings
Formula color
Text color
#333333
Type math using LaTeX
Preview
\({}\)
Nothing to preview
Insert