Spark filter() function is used to filter rows from the dataframe based on given condition or expression. If you are familiar with SQL, then it would be much simpler for you to filter out rows according to your requirements.
For example, if you wish to get a list of students who got marks more than a certain limit or list of the employee in a particular department.
We can refer to a column in three ways:
- col(“Column Name”)
- df[“Column Name”]
Before we start with examples, first create a Dataframe.
from pyspark.sql.types import * data = [(101,"Beck","F",1,8900),(102,"Joe","M",3,7800),(103,"Peach","F",1,10000),(104,"Bengie","M",2,6000),(105,"Barko","M",2,11000)] # Create a schema for the dataframe schema = StructType([ StructField('EmpId', IntegerType(), True), StructField('EmpName', StringType(), True), StructField('Gender', StringType(), True), StructField('Dept No', StringType(), True), StructField('Salary', StringType(), True), ]) # Convert list to RDD rdd = spark.sparkContext.parallelize(data) # Create data frame df = spark.createDataFrame(rdd,schema) df.show() df.printSchema()
Below are some ways to apply filter operation:
a) Dataframe Filter() with column operation
You can mention your column condition inside the filter function.
Note: Use “==” for comparison.
Question: Find the names of employees who belongs to department 1.
from pyspark.sql.functions import col #filter according to column conditions df_dept=df.filter(col("Dept No") == 1) df_dept.show()
b) Dataframe Filter() with SQL Expression
We can also use SQL expressions to filter dataframe rows.
Question: Filter the employess with Salary more than 9000.
c) Filtering with multiple conditions
To filter rows based on multiple conditions, we can use the following conditional expressions to combine two or more statements:
Question:Filter out male employees with Department 2.
#Filter with Multiple condition df.filter((df["Gender"]=="M") & (df["Dept No"]==2)).show()
d) Filtering on an array column
If you want to filter rows from dataframe based on condition applied on the Array type column. Then, how will you apply these SQL expressions on array?
To resolve this, we will use array_contains() SQL function which returns True/False whether a particular value is present in the array or not. Let’s create an Array Type Dataframe.
from pyspark.sql.types import * #list row=[[["Rohit","Gupta"],["Python","C"],"M"],[["Shanu","Sisodiya"],["C++","Java","Python"],"F"],[["Parth","Sisodiya"],["C++","C#","Java"],"M"]] #Schema Schema=StructType().add("Name",StructType().add("firstname",StringType()).add("lastname",StringType())).add("Languages", ArrayType(StringType())).add("Gender", StringType()) #creating dataframe df = spark.createDataFrame( spark.sparkContext.parallelize(row), Schema ) # view Dataframe df.show() # view Schema df.printSchema()
Question:Filter out the Name of Students who study Python.
from pyspark.sql.functions import * #Filtering conditions df.filter(array_contains(df["Languages"],"Python")).show()
I’ve covered some common operations or ways to filter out rows from the dataframe. Practice them!! Frame your own questions and yeah one homework for you all.
Comment down the expression for Filtering out Name of Student whose Last name is “Sisodiya”.
Hint: You need to use the Nested Struct Column concept. Don’t hustle, it’s similar to what you saw in Array Type. Use “.” operator to refer to child columns.