Spark DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as structured data files, tables in Hive, external databases, or existing RDDs.

-DataBricks

Let’s catch up on some ways to create Spark DataFrames using Python.



1. CREATE PYSPARK DATAFRAME USING RDD

The easiest way to create PySpark DataFrame is to go with RDD. First, let’s create one. We create a sequence and then create RDD by calling parallelize() function from sparkContext.

#Create Collection Sequence
col = ["Name","Roll No.","Father's Name","Age","Marks"]
row =[("Nikita",65,"Mr.Pradeep",19,890),("Ayush",22,"Mr.Gopal",20,780),("Parth", 27,"Mr.Bharat",21,865),("Nanki",60,"Mr.Naresh",20,680)]

# Creating RDD using Sequence
rdd=spark.sparkContext.parallelize(row)
a) toDF()

Post creating the RDD, let’s use toDF() to form the PySpark DataFrame. By default, it considers column name as “_1″,”_2” and so on.

You can use display() to view DataFrame as a Table.

i) Undefined columns
#Creating DataFrane
df=rdd.toDF()

#View DataFrame
df.show()

#View Schema
df.printSchema()

display(df)
ii) Defined Columns
# Creating DataFrane
df=rdd.toDF(col)

# View DataFrame
df.show()
iii) PySpark CreateDataFrame

Using createDataFrame from SparkSession is another way to create and it takes rdd object as an argument and chain with toDF() to specify names to the columns.

#Creating DataFrane
df=spark.createDataFrame(rdd).toDF(*col)

# View DataFrame
df.show()

toDF() is somewhat limited as we can’t define our own schema and nullable flag.

b) spark.createDataFrame()

While defining a schema, we provide column type, its datatype, and a nullable flag value. Let’s write some code:

#importing 
from pyspark.sql.types import *

#sequence
data=[("Ankit",79),("Shanu",90),("Krishna",89)]

#Schema
schema= [StructField("Your_Name", StringType(), False),
  StructField("Your_Marks", IntegerType(), True)]

#creating dataframe
someDF = spark.createDataFrame(
  spark.sparkContext.parallelize(data),
StructType(Schema)
)

#view Dataframe
someDF.show()

# view Schema
someDF.printSchema()

2. CREATE PYSPARK DATAFRAME USING LIST

#create dataframe using Sequence directly
row=[("Sakhi","HPS"),("Madhu","APS")]
col=["Name","School"]

#Using createDataFrame
data = spark.createDataFrame(row,col) 

#view dataframe
data.show()

3. CREATE SPARK DATAFRAME FROM CSV

We can create a data frame using the CSV file.

You can download the CSV file from the given link: drinks.csv

i) Reading the CSV File
fromcsv=spark.read.csv("hdfs://localhost:9000/tables/drinks.csv")

#View dataframe
fromcsv.show()

It doesn’t include Header and creates no Schema. It treats every column as String Type.

ii) Reading with Schema
fromcsv=spark.read.format("csv").option("header","true").option("inferSchema","true").load("hdfs://localhost:9000/tables/drinks.csv")

#View dataframe
fromcsv.show()

inferSchema when setting to “true”, automatically sets the datatypes of columns.

4. CREATE PYSPARK DATAFRAME FROM TXT

You can download the TXT file from the given link: chipotle.txt

#create dataframe from text file
fromtxt=spark.read.text("hdfs://localhost:9000/tables/chipotle.txt")

#View Dataframe
fromtxt.show()

5. CREATE SPARK DATAFRAME FROM JSON FILE

You can download the JSON file from the given link: employee.json

#Creating Dataframe from JSON file
df=spark.read.json("hdfs://localhost:9000/tables/employee.json")

# View Dataframe
df.show()

#View Schema
df.printSchema()

6. CREATE SPARK DATAFRAME FROM PARQUET FILE

Apache Parquet is a columnar file that provides optimizations to speed up queries. It is compatible with many data processing frameworks in Hadoop Echo System. Parquet is more efficient in terms of storage and performance as compared to CSV and JSON.

You can download the PARQUET file from the given link: users.parquet

#Creating dataframe from parquet file
parqDF=spark.read.parquet("hdfs://localhost:9000/tables/users.parquet")

# View dataframe
parqDF.show()

#View Schema
parqDF.printSchema()

Now, write some code by your own 🙂

Thank you so much for reading. Let me know in the comment section what you think. Here is a task for you guys,

How will you create an Empty Dataframe? Comment down and let us know what you got.

Happy Sparking!!

-Gargi Gupta


0 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Insert math as
Block
Inline
Additional settings
Formula color
Text color
#333333
Type math using LaTeX
Preview
\({}\)
Nothing to preview
Insert