In this article, I will walk you through commonly used dataframe column operations. Spark withcolumn() is used to rename, drop, change the value of an existing column and to create a new column too.

Let’s create a dataframe first.

from pyspark.sql.types import StructType,StructField,StringType,IntegerType

# Create a schema for the dataframe
schema = StructType([
StructField('Name', StringType(), True),
StructField('Roll No', IntegerType(), True),
StructField('Fathers Name', StringType(), True),
StructField('Age', IntegerType(), True),
StructField('Marks', IntegerType(), True),
])

# Convert list to RDD
rdd = spark.sparkContext.parallelize(data)

# Create data frame
df = spark.createDataFrame(rdd,schema)

df.show()

df.printSchema()

Suppose you want to calculate the Percentage of the Student using an existing column “Marks” or rename any of the columns or create a new column. How would you do that? Let’s check out some ways:

##### a) Change the value of an existing column

We can update the value of an existing column in the dataframe using withColumn(). We need to pass the column name as the first argument and value to be assigned ( should be column type) as the second argument.

Question: Multiply each row value of “Marks” column by 10.

from pyspark.sql.functions import col
# change value of existing column
df_value = df.withColumn("Marks",col("Marks")*10)
#View Dataframe
df_value.show()
##### b) Derive column from existing column

To create a new column from an existing one, use the New column name as the first argument and value to be assigned to it using the existing column as the second argument.

Question: Add a new column “Percentage” to the dataframe by calculating the percentage of each student using “Marks” column.

#create new column from existing column
df_new=df.withColumn("Percentage",(col("Marks")* 100)/1000)
#View Dataframe
df_new.show()
##### c) Rename a Dataframe Column

To rename a column, withColumnRenamed() is used. The name of the column to be changed is the first argument and the name required as the second argument.

Question: Rename “Roll No” column to “Enrollment No”.

#rename a column
re_df=df.withColumnRenamed("Roll No","Enrollment No")
#View Datframe
re_df.show()
##### d)Add a new column with constant value

To add a new column to the dataframe, we use the lit() function as an argument. It assigns a constant value to the dataframe. The first argument is your desired column name and the second is lit() function with value to be assigned. We need to import the function first.

Question: Add a column named “College” to the dataframe with the value “MITRC”.

from pyspark.sql.functions import lit
new_col=df.withColumn("College", lit("MITRC"))
new_col.show()
##### e) Drop a column

Use the “drop” function to drop any specific column from the dataframe.

Question: Drop the column “Roll No”.

#drop a column
drop_df=df.drop("Roll No")
drop_df.show()

NOTE: In all the operations performed above, the function returns a new dataframe instead of updating the existing dataframe.

Create your own dataframe. Play with it !! Apply various functions. If you encounter any issues, comment down. We’ll get back to you. Till then, Stay tuned and check out other blogs too.

-Gargi Gupta

$${}$$