In this article, I will walk you through commonly used dataframe column operations. Spark withcolumn() is used to rename, drop, change the value of an existing column and to create a new column too.
Let’s create a dataframe first.
import spark.implicits._ import org.apache.spark.sql.types.{StructField,StructType,StringType,IntegerType,FloatType,DoubleType} //Create Collection Sequence val col = Seq("Name","Roll No.","Father's Name","Age","Marks") val row =Seq(Row("Nikita",65,"Mr.Pradeep",19,890),Row("Ayush",22,"Mr.Gopal",20,780),Row("Parth",27,"Mr.Bharat",21,865),Row("Ankit",15,"Mr.Naresh",20,680)) //Schema val schema= List(StructField("Name", StringType, false), StructField("Roll No", IntegerType, true),StructField("Father's Name", StringType, false),StructField("Age", IntegerType, false),StructField("Marks", IntegerType, false)) //Creating dataframe val df = spark.createDataFrame( spark.sparkContext.parallelize(row), StructType(schema) ) // View Dataframe df.show() // View Schema df.printSchema()

Suppose you want to calculate the Percentage of the Student using an existing column “Marks” or rename any of the columns or create a new column. How would you do that? Let’s check out some ways:
a) Change the value of an existing column
We can update the value of an existing column in the dataframe using withColumn(). We need to pass the column name as the first argument and value to be assigned ( should be column type) as the second argument.
Question:Multiply each row value of “Marks” column by 10.
// change value of existing column import org.apache.spark.sql.functions.{col} val df_value = df.withColumn("Marks",col("Marks")*10) //View Dataframe df_value.show()

b) Derive new column from existing column
To create a new column from an existing one, use the New column name as the first argument and value to be assigned to it using the existing column as the second argument.
Question:Add a new column “Percentage” to the dataframe by calculating the percentage of each student using “Marks” column.
// create new column from existing column import org.apache.spark.sql.functions.{col} val df_new=df.withColumn("Percentage",(col("Marks")* 100)/1000) // View Dataframe df_new.show()

c) Rename a Dataframe Column
To rename a column, withColumnRenamed() is used. The name of the column to be changed is the first argument and the name required as the second argument.
Question:Rename “Roll No” column to “Enrollment No”.
// rename a column val re_df=df.withColumnRenamed("Roll No","Enrollment No") //View Datframe re_df.show()

d) Add a new column with constant value
To add a new column to the dataframe, we use the lit() function as an argument. It assigns a constant value to the dataframe. The first argument is your desired column name and the second is lit() function with value to be assigned. We need to import the function first.
Question:Add a column named “College” to the dataframe with the value “MITRC”.
// Add a new column import org.apache.spark.sql.functions.{lit} val new_col=df.withColumn("College", lit("MITRC")) // View dataframe new_col.show()

e) Drop a column
Use the “drop” function to drop any specific column from the dataframe.
Question:Drop the column “Roll No”.
// drop a column val drop_df=df.drop("Roll No") // View Dataframe drop_df.show()

NOTE: In all the operations performed above, the function returns a new dataframe instead of updating the existing dataframe.
Create your own dataframe. Play with it !! Apply various functions. If you encounter any issues, comment down. We’ll get back to you. Till then, Stay tuned and check out other blogs too.
-Gargi Gupta
1 Comment
Arushi goyal · March 29, 2020 at 11:54 am
Very well explain!!