Spark DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as structured data files, tables in Hive, external databases, or existing RDDs.

-DataBricks

Let’s catch up on some ways in Part 1 and Part2 to create Spark DataFrames using Scala.

Part1:

  • Create Spark Dataframe using RDD
  • Create Spark Dataframe using List/Sequence
  • Create Spark Dataframe using CSV File
  • Create Spark Dataframe using TXT File
  • Create Spark Dataframe using the JSON File
  • Create Spark Dataframe using Parquet file
  • Create Spark Dataframe using XML file

Part2:

  • Create Spark Dataframe using Hive table
  • Create Spark Dataframe using MySQL table
  • Create Spark Dataframe using S3 bucket
  • Create Spark Dataframe using HBase
  • Create Spark Dataframe using redshift
  • Create Spark Dataframe using Ignite

1. CREATE SPARK DATAFRAME USING RDD

First, create an RDD. We create a sequence and then create RDD by calling parallelize() function from sparkContext.

//Create Collection Sequence
val col = Seq("Name","Roll No.","Father's Name","Age","Marks")
val row =Seq(("Nikita",65,"Mr.Pradeep",19,890),("Ayush",22,"Mr.Gopal",20,780),("Parth", 27,"Mr.Bharat",21,865),("Nanki",60,"Mr.Naresh",20,680))

// Creating RDD using Sequence
val rdd=spark.sparkContext.parallelize(row)

// Get number of partitions of RDD
rdd.getNumPartitions

Val is an immutable data type that represents value.

parallelize() method used to create RDD.

a) toDF()

We have to import spark implicits to access the method toDF(). It helps in converting scala objects(Sequence, RDD) into Dataframe and Dataset.

i) Undefined columns

import spark.implicits._

// Creating DataFrane
val df=rdd.toDF()

//View DataFrame
df.show()

// View Schema
df.printSchema()

ii) Defined Columns

import spark.implicits._

// Creating DataFrane
val df=rdd.toDF(col:_*)

// View DataFrame
df.show()

// View Schema
df.printSchema()

iii) Spark CreateDataFrame

Using createDataFrame from SparkSession is another way to create and it takes rdd object as an argument. and chain with toDF() to specify names to the columns.

// Creating DataFrane
val df=spark.createDataFrame(rdd).toDF(col:_*)

// View DataFrame
df.show()

toDF() is somewhat limited as we can’t define our own schema and nullable flag.

b) spark.createDataFrame()

While defining a schema, we provide column type, its datatype, and a nullable flag value.
We can do this in two ways:

//importing 
import spark.implicits._
import org.apache.spark.sql.types.{StructField, StructType, StringType, IntegerType}

//sequence
val data=Seq(Row("Ankit",79),Row("Shanu",90),Row("Krishna",89))

//Schema
val schema= List(StructField("Your_Name", StringType, false),
  StructField("Your_Marks", IntegerType, true))

//Another way to define Schema
val schema=new StructType().add("Your_Name", StringType).add("Your_Marks", IntegerType)

//creating dataframe
val someDF = spark.createDataFrame(
  spark.sparkContext.parallelize(data),
StructType(Schema)
)

// view Dataframe
someDF.show()

// view Schema
someDF.printSchema()

2. USING toDF() ON LIST/SEQ

To use toDF(), import spark implicits.

//create dataframe using Sequence directly
import spark.implicits._
val row=Seq(("Sakhi","HPS"),("Madhu","APS"))
val col=Seq("Name","School")

// create dataframe
val data=row.toDF(col:_*)

//Using createDataFrame
val data=spark.createDataFrame(row).toDF(col:_*) 

//view dataframe
data.show()

3. CREATE SPARK DATAFRAME FROM CSV

We can create a data frame using the CSV file.

You can download the CSV file from the given link: drinks.csv

i) Reading the CSV File

val fromcsv=spark.read.csv("hdfs://localhost:9000/tables/drinks.csv")

//View dataframe
fromcsv.show()

It doesn’t include Header and creates no Schema. It treats every column as String Type.

ii) Reading with Schema

val fromcsv = spark.read.options(Map("inferSchema"->"true","header"->"true"))
.csv("hdfs://localhost:9000/tables/drinks.csv")

//View dataframe
fromcsv.show()

inferSchema when setting to “true”, automatically sets the datatypes of columns.

4. CREATE SPARK DATAFRAME FROM TXT

You can download the TXT file from the given link: chipotle.txt

//create dataframe from text file
val fromtxt=spark.read.text("hdfs://localhost:9000/tables/chipotle.txt")

//View Dataframe
fromtxt.show()

5. CREATE SPARK DATAFRAME FROM JSON FILE

You can download the JSON file from the given link: employee.json

// Creating Dataframe from json file
val df=spark.read.json("hdfs://localhost:9000/tables/employee.json")

// View Dataframe
df.show()

// View Schema
df.printSchema()

6. CREATE SPARK DATAFRAME FROM PARQUET FILE

Apache Parquet is a columnar file that provides optimizations to speed up queries. It is compatible with many data processing frameworks in Hadoop Echo System. Parquet is more efficient in terms of storage and performance as compared to CSV and JSON.

You can download the PARQUET file from the given link: users.parquet

// Creating dataframe from parquet file
val parqDF = spark.read.parquet("hdfs://localhost:9000/tables/users.parquet")

// View dataframe
parqDF.show()

//View Schema
parqDF.printSchema()

7. CREATE SPARK DATAFRAME FROM XML FILE

We can process the XML file in Apache Spark by enabling spark-XML dependency.

You can download the XML file from the given link: food.xml

// Dependency
<dependency>
     <groupId>com.databricks</groupId>
     <artifactId>spark-xml_2.12</artifactId>
     <version>0.6.0</version>
 </dependency>

//Code
import org.apache.spark.sql.{ DataFrame, SparkSession }

object SparkXML {
 
  def main(args: Array[String]):Unit = {
 
   val spark = SparkSession
      .builder()
      .master("local")
      .appName("Spark XML Example")
      .getOrCreate()
 
  val df = spark.read
      .format("com.databricks.spark.xml")
      .option("rootTag", "breakfast_menu")
      .option("rowTag", "food")
      .load("hdfs://localhost:9000/food.xml")
     
  df.show()

  }
}

-Gargi Gupta



0 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Insert math as
Block
Inline
Additional settings
Formula color
Text color
#333333
Type math using LaTeX
Preview
\({}\)
Nothing to preview
Insert