Hey there!!
In today’s article, we’ll be learning how to type cast DataFrame columns as per our requirement.
You might be knowing that Data type conversion is an important step while doing the transformation of the dataframe. Let’s say we would like to add a number to the dataframe column and the column data type is String.
Can we perform addition now ? Answer is No.
The data type of the column should be Integer in case of any mathematical operations. So, we have to convert the data type of the column into Integer.
Now the question arises, How to convert the data type of the column?
1. Change Column type using StructType
The column data type is “String” by default while reading the external file as a dataframe. We will create the list of StructField and use StructType to change the datatype of dataframe columns.
from pyspark.sql.types import * schema= (StructField("Name", StringType(), False), StructField("Roll No", IntegerType(), True),StructField("Father's Name", StringType(), False),StructField("Age", IntegerType(), False),StructField("Marks", IntegerType(), False)) StructType(schema)
Let’s write some code.
Question – Prepare the Dataframe and convert the datatype of each column into suitable datatype using StructField and Struct type?
from pyspark.sql.types import * #Create Collection Sequence col = ["Name","Roll No.","Father's Name","Age","Marks"] row =[("Nikita",65,"Mr.Pradeep",19,890),("Ayush",22,"Mr.Gopal",20,780),("Parth",27,"Mr.Bharat",21,865),("Ankit",15,"Mr.Naresh",20,680)] #Create dataframe df = spark.createDataFrame( spark.sparkContext.parallelize(row), StructType(schema) ) #View table display(df) #View Schema df.printSchema()


2. Change Column type using cast
By using Spark withcolumn on a dataframe, we can convert the data type of any column. The function takes a column name with a cast function to change the type.
Question:Convert the Datatype of “Age” Column from Integer to String.
First, check the data type of “Age”column.
df.select("Age").dtypes

Below is the code to change the datatype:
df_datatype=df.withColumn("Age",df["Age"].cast("String")) df_datatype.printSchema()

3. Change Column type using selectExpr
Question: Convert the datatype of the “Marks” and “Age” column to Double and Float respectively.
df_new = df.selectExpr("cast(Age as Double ) Age", "cast(Marks as Float) Marks") df_new.printSchema()

4. Change Column type using SQL Expression
We’ll create a “view” of our DataFrame using createOrReplaceTempView. It is used to store a table for a particular spark session.
We are randomly choosing a column name and converting its datatype. Let’s check out.
df.createOrReplaceTempView("Table") df_sql = spark.sql("SELECT STRING(Age),Float(Marks) from Table") df_sql.printSchema()

That’s it for today. I hope you grabbed something new. We would love to hear back your answers or any query. You can also share your way of changing column datatype.
-Gargi Gupta
0 Comments