In this article, I will walk you through commonly used dataframe column operations. Spark withcolumn() is used to rename, drop, change the value of an existing column and to create a new column too.

Let’s create a dataframe first.

from pyspark.sql.types import StructType,StructField,StringType,IntegerType

data = [("Nikita",65,"Mr.Pradeep",19,890),("Ayush",22,"Mr.Gopal",20,780),("Parth",27,"Mr.Bharat",21,865),("Ankit",15,"Mr.Naresh",20,680)]


# Create a schema for the dataframe
schema = StructType([
    StructField('Name', StringType(), True),
    StructField('Roll No', IntegerType(), True),
    StructField('Fathers Name', StringType(), True),
  StructField('Age', IntegerType(), True),
  StructField('Marks', IntegerType(), True),
])

# Convert list to RDD
rdd = spark.sparkContext.parallelize(data)

# Create data frame
df = spark.createDataFrame(rdd,schema)

df.show()

df.printSchema()
This image has an empty alt attribute; its file name is col.png

Suppose you want to calculate the Percentage of the Student using an existing column “Marks” or rename any of the columns or create a new column. How would you do that? Let’s check out some ways:

a) Change the value of an existing column

We can update the value of an existing column in the dataframe using withColumn(). We need to pass the column name as the first argument and value to be assigned ( should be column type) as the second argument.

Question: Multiply each row value of “Marks” column by 10.

from pyspark.sql.functions import col 
# change value of existing column
df_value = df.withColumn("Marks",col("Marks")*10)
#View Dataframe
df_value.show()
This image has an empty alt attribute; its file name is col-4.png
b) Derive column from existing column

To create a new column from an existing one, use the New column name as the first argument and value to be assigned to it using the existing column as the second argument.

Question: Add a new column “Percentage” to the dataframe by calculating the percentage of each student using “Marks” column.

#create new column from existing column
df_new=df.withColumn("Percentage",(col("Marks")* 100)/1000)
#View Dataframe
df_new.show()
This image has an empty alt attribute; its file name is col-5.png
c) Rename a Dataframe Column

To rename a column, withColumnRenamed() is used. The name of the column to be changed is the first argument and the name required as the second argument.

Question: Rename “Roll No” column to “Enrollment No”.

#rename a column
re_df=df.withColumnRenamed("Roll No","Enrollment No")
#View Datframe
re_df.show()
This image has an empty alt attribute; its file name is col-6.png
d) Add a new column with constant value

To add a new column to the dataframe, we use the lit() function as an argument. It assigns a constant value to the dataframe. The first argument is your desired column name and the second is lit() function with value to be assigned. We need to import the function first.

Question: Add a column named “College” to the dataframe with the value “MITRC”.

from pyspark.sql.functions import lit
#Add a new column
new_col=df.withColumn("College", lit("MITRC"))
new_col.show()
This image has an empty alt attribute; its file name is col-7.png
e) Drop a column

Use the “drop” function to drop any specific column from the dataframe.

Question: Drop the column “Roll No”.

#drop a column
drop_df=df.drop("Roll No")
drop_df.show()
This image has an empty alt attribute; its file name is col-8.png

NOTE: In all the operations performed above, the function returns a new dataframe instead of updating the existing dataframe.

Create your own dataframe. Play with it !! Apply various functions. If you encounter any issues, comment down. We’ll get back to you. Till then, Stay tuned and check out other blogs too.

-Gargi Gupta


0 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Insert math as
Block
Inline
Additional settings
Formula color
Text color
#333333
Type math using LaTeX
Preview
\({}\)
Nothing to preview
Insert