Hey there!!

You might be knowing that data type conversion is an important step while doing the transformation of the dataframe. Let’s say we would like to add a number to the dataframe column and the column data type is String.

Can we perform addition now ? Answer is No.

The data type of the column should be Integer in case of any mathematical operations. So, we have to convert the data type of the column into Integer.

Now the question arises, how to convert the data type of the column?

One Way: Using StructType

The column data type is “String” by default while reading the external file as a dataframe. We will create the list of StructField and use StructType to change the datatype of dataframe columns.


val schema= List(StructField(“Name”, StringType, false),
StructField(“Roll No”, IntegerType, true),StructField(“Father’s Name”, StringType, false),StructField(“Age”, IntegerType, false),StructField(“Marks”, IntegerType, false))

StructType(schema)

Let’s create a dataframe to work with.

Question Prepare the Dataframe and convert the datatype of each column into suitable datatype using StructField and Struct type?


//Importing
import spark.implicits._
import org.apache.spark.sql.types.{StructField,StructType,StringType,IntegerType,FloatType,DoubleType}

//Create Collection Sequence
val col = Seq("Name","Roll No.","Father's Name","Age","Marks")
val row =Seq(Row("Nikita",65,"Mr.Pradeep",19,890),Row("Ayush",22,"Mr.Gopal",20,780),Row("Parth",27,"Mr.Bharat",21,865),Row("Ankit",15,"Mr.Naresh",20,680))

//Schema
val schema= List(StructField("Name", StringType, false),
  StructField("Roll No", IntegerType, true),StructField("Father's Name", StringType, false),StructField("Age", IntegerType, false),StructField("Marks", IntegerType, false))

//Creating dataframe
val df = spark.createDataFrame(
  spark.sparkContext.parallelize(row),
StructType(schema)
)

// View Dataframe
df.show()

// View Schema
df.printSchema()

Another Way: Column DataType Conversion

By using Spark withcolumn on a dataframe, we can convert the data type of any column. The function takes a column name with a cast function to change the type. We need to import the “col” function to address the column. “$” can also be used to refer column of the dataframe.

Question:Convert the Datatype of “Age” Column from Integer to String.


import org.apache.spark.sql.functions.{col}
// change datatype of a column  
val df_datatype=df.withColumn("Age",col("Age").cast("String"))

//Another way to change datatype
val df_datatype=df.withColumn("Age",$"Age".cast("String"))

 // View Schema
df_datatype.printSchema()

Question: Convert the datatype of the “Marks” column from Integer to Float.

import org.apache.spark.sql.functions.{col}
// change datatype of a column  
val df_datatype=df.withColumn("Marks",$"Marks".cast("Float"))
// View Dataframe
df_datatype.printSchema()

I hope you all understood how to change the data type of any column. So, here’s a task for you all. Comment down, How will you convert the datatype of Roll No from Integer to Double?

We would love to hear back your answers or any query. You can also share your way of changing column datatype.

-Gargi Gupta


1 Comment

Nikita Singhal · March 28, 2020 at 6:56 am

Good work. Really helpful😀

Leave a Reply

Your email address will not be published. Required fields are marked *

Insert math as
Block
Inline
Additional settings
Formula color
Text color
#333333
Type math using LaTeX
Preview
\({}\)
Nothing to preview
Insert