In this article, I will walk you through commonly used dataframe column operations. Spark withcolumn() is used to rename, drop, change the value of an existing column and to create a new column too.

Let’s create a dataframe first.

import spark.implicits._
import org.apache.spark.sql.types.{StructField,StructType,StringType,IntegerType,FloatType,DoubleType}

//Create Collection Sequence
val col = Seq("Name","Roll No.","Father's Name","Age","Marks")
val row =Seq(Row("Nikita",65,"Mr.Pradeep",19,890),Row("Ayush",22,"Mr.Gopal",20,780),Row("Parth",27,"Mr.Bharat",21,865),Row("Ankit",15,"Mr.Naresh",20,680))

//Schema
val schema= List(StructField("Name", StringType, false),
  StructField("Roll No", IntegerType, true),StructField("Father's Name", StringType, false),StructField("Age", IntegerType, false),StructField("Marks", IntegerType, false))

//Creating dataframe
val df = spark.createDataFrame(
  spark.sparkContext.parallelize(row),
StructType(schema)
)

// View Dataframe
df.show()

// View Schema
df.printSchema()

Suppose you want to calculate the Percentage of the Student using an existing column “Marks” or rename any of the columns or create a new column. How would you do that? Let’s check out some ways:

a) Change the value of an existing column

We can update the value of an existing column in the dataframe using withColumn(). We need to pass the column name as the first argument and value to be assigned ( should be column type) as the second argument.

Question:Multiply each row value of “Marks” column by 10.

// change value of existing column
import org.apache.spark.sql.functions.{col}
val df_value = df.withColumn("Marks",col("Marks")*10)

//View Dataframe
df_value.show()
b) Derive new column from existing column

To create a new column from an existing one, use the New column name as the first argument and value to be assigned to it using the existing column as the second argument.

Question:Add a new column “Percentage” to the dataframe by calculating the percentage of each student using “Marks” column.

// create new column from existing column
import org.apache.spark.sql.functions.{col}
val df_new=df.withColumn("Percentage",(col("Marks")* 100)/1000)

// View Dataframe
df_new.show()
c) Rename a Dataframe Column

To rename a column, withColumnRenamed() is used. The name of the column to be changed is the first argument and the name required as the second argument.

Question:Rename “Roll No” column to “Enrollment No”.

// rename a column
val re_df=df.withColumnRenamed("Roll No","Enrollment No")
//View Datframe
re_df.show()
d) Add a new column with constant value

To add a new column to the dataframe, we use the lit() function as an argument. It assigns a constant value to the dataframe. The first argument is your desired column name and the second is lit() function with value to be assigned. We need to import the function first.

Question:Add a column named “College” to the dataframe with the value “MITRC”.

// Add a new column
import org.apache.spark.sql.functions.{lit}
val new_col=df.withColumn("College", lit("MITRC"))
// View dataframe
new_col.show()
e) Drop a column

Use the “drop” function to drop any specific column from the dataframe.

Question:Drop the column “Roll No”.

// drop a column
val drop_df=df.drop("Roll No")
// View Dataframe
drop_df.show()

NOTE: In all the operations performed above, the function returns a new dataframe instead of updating the existing dataframe.

Create your own dataframe. Play with it !! Apply various functions. If you encounter any issues, comment down. We’ll get back to you. Till then, Stay tuned and check out other blogs too.

-Gargi Gupta


1 Comment

Arushi goyal · March 29, 2020 at 11:54 am

Very well explain!!

Leave a Reply

Your email address will not be published. Required fields are marked *

Insert math as
Block
Inline
Additional settings
Formula color
Text color
#333333
Type math using LaTeX
Preview
\({}\)
Nothing to preview
Insert