In this article, I will walk you through commonly used dataframe column operations. Spark withcolumn() is used to rename, drop, change the value of an existing column and to create a new column too.
Let’s create a dataframe first.
from pyspark.sql.types import StructType,StructField,StringType,IntegerType data = [("Nikita",65,"Mr.Pradeep",19,890),("Ayush",22,"Mr.Gopal",20,780),("Parth",27,"Mr.Bharat",21,865),("Ankit",15,"Mr.Naresh",20,680)] # Create a schema for the dataframe schema = StructType([ StructField('Name', StringType(), True), StructField('Roll No', IntegerType(), True), StructField('Fathers Name', StringType(), True), StructField('Age', IntegerType(), True), StructField('Marks', IntegerType(), True), ]) # Convert list to RDD rdd = spark.sparkContext.parallelize(data) # Create data frame df = spark.createDataFrame(rdd,schema) df.show() df.printSchema()

Suppose you want to calculate the Percentage of the Student using an existing column “Marks” or rename any of the columns or create a new column. How would you do that? Let’s check out some ways:
a) Change the value of an existing column
We can update the value of an existing column in the dataframe using withColumn(). We need to pass the column name as the first argument and value to be assigned ( should be column type) as the second argument.
Question: Multiply each row value of “Marks” column by 10.
from pyspark.sql.functions import col # change value of existing column df_value = df.withColumn("Marks",col("Marks")*10) #View Dataframe df_value.show()

b) Derive column from existing column
To create a new column from an existing one, use the New column name as the first argument and value to be assigned to it using the existing column as the second argument.
Question: Add a new column “Percentage” to the dataframe by calculating the percentage of each student using “Marks” column.
#create new column from existing column df_new=df.withColumn("Percentage",(col("Marks")* 100)/1000) #View Dataframe df_new.show()

c) Rename a Dataframe Column
To rename a column, withColumnRenamed() is used. The name of the column to be changed is the first argument and the name required as the second argument.
Question: Rename “Roll No” column to “Enrollment No”.
#rename a column re_df=df.withColumnRenamed("Roll No","Enrollment No") #View Datframe re_df.show()

d) Add a new column with constant value
To add a new column to the dataframe, we use the lit() function as an argument. It assigns a constant value to the dataframe. The first argument is your desired column name and the second is lit() function with value to be assigned. We need to import the function first.
Question: Add a column named “College” to the dataframe with the value “MITRC”.
from pyspark.sql.functions import lit #Add a new column new_col=df.withColumn("College", lit("MITRC")) new_col.show()

e) Drop a column
Use the “drop” function to drop any specific column from the dataframe.
Question: Drop the column “Roll No”.
#drop a column drop_df=df.drop("Roll No") drop_df.show()

NOTE: In all the operations performed above, the function returns a new dataframe instead of updating the existing dataframe.
Create your own dataframe. Play with it !! Apply various functions. If you encounter any issues, comment down. We’ll get back to you. Till then, Stay tuned and check out other blogs too.
-Gargi Gupta
0 Comments