Here we are going to discuss to explore the statistics of the data frames and how to convert rdd to data frame.

Q-1 How to read the CSV file including headers as a dataframe and check the schema of the dataframe
Ans:

df_tips=spark.read.format(“csv”).option(“header”,True).load(“/FileStore/tables/tips.csv”)
df.show()

#print the schema
print(df_tips.printSchema())

#Count the records
print(df_tips.count())

#show the columns
print(df_tips.columns)

#Output
#print the schema
root
|– total_bill: string (nullable = true)
|– tip: string (nullable = true)
|– sex: string (nullable = true)
|– smoker: string (nullable = true)
|– day: string (nullable = true)
|– time: string (nullable = true)
|– size: string (nullable = true)


#number of records 244

#Columns:
[‘total_bill’, ‘tip’, ‘sex’, ‘smoker’, ‘day’, ‘time’, ‘size’]

We can see the above output that data types of all the columns are the String by default.

Q-2 How to infer the schema at the time of reading the dataframe and map the appropriate datatypes against the columns?

Ans:



from pyspark.sql.types import StructType,StructField,StringType,IntegerType,FloatType

data_schema=[StructField(“total_bill”,FloatType(),True),StructField(“tip”,FloatType(),True),StructField(“sex”,StringType(),True),StructField(“smoker”,StringType(),True),StructField(“day”,StringType(),True),StructField(“time”,StringType(),True),StructField(“size”,IntegerType(),True)]

final_schema=StructType(fields=data_schema)

df_schema=spark.read.format(“csv”).schema(final_schema).option(“header”,True).load(“/FileStore/tables/tips.csv”)
#Show 5 records and check the schema now.
df_schema.show(5)
df_schema.printSchema()


#Output
+———-+—-+——+——+—+——+—-+
|total_bill| tip| sex|smoker|day| time|size|
+———-+—-+——+——+—+——+—-+
| 16.99|1.01|Female| No|Sun|Dinner| 2|
| 10.34|1.66| Male| No|Sun|Dinner| 3|
| 21.01| 3.5| Male| No|Sun|Dinner| 3|
| 23.68|3.31| Male| No|Sun|Dinner| 2|
| 24.59|3.61|Female| No|Sun|Dinner| 4|
+———-+—-+——+——+—+——+—-+
only showing top 5 rows

#Schema
root
|– total_bill: float (nullable = true)
|– tip: float (nullable = true)
|– sex: string (nullable = true)
|– smoker: string (nullable = true)
|– day: string (nullable = true)
|– time: string (nullable = true)
|– size: integer (nullable = true)

Q-3 How to display few columns from the dataframe and also rearrange the order of columns?

Find the number of the columns and rows in the dataframe?


Ans:


from pyspark.sql.functions import col
df_tips.select(‘total_bill’,’size’).show(10)

#Or
df_tips.select(df_tips[‘total_bill’],df_tips[‘size’]).show(10)

#Or
df_tips.select(col(‘total_bill’),col(‘size’)).show(10)

# Rearrange the order of columns
df_tips.select(col(‘size’),col(‘total_bill’),col(“tip”),col(“sex”),col(“smoker”),col(“day”),col(“time”)).show(10)

# number of columns and rows
print(len(df_tips.columns)) #7

print(df_tips.count()) #244

Q-4 How to convert Spark RDD to DataFrame?

Ans:


rdd = sc.parallelize([(10,20,30),(40,50,60),(70,80,90)])

#Method 1
rdd.toDF([“col1″,”col2″,”col3”]).show()

#Method 2
from pyspark.sql import Row
rdd=sc.parallelize([Row(col1=10,col2=20,col3=30),Row(col1=40,col2=50,col3=60),Row(col1=70,col2=80,col3=90)])
rdd.toDF().show()

#Method 3
from pyspark.sql.types import IntegerType
schema=StructType([StructField(“col1”,IntegerType()),StructField(“col2”,IntegerType()),StructField(“col3”,IntegerType())])
spark.createDataFrame(rdd,schema).show()

Insert math as
Block
Inline
Additional settings
Formula color
Text color
#333333
Type math using LaTeX
Preview
\({}\)
Nothing to preview
Insert