Tutorial-1 PySpark Understand the DataFrames

Here we are going to discuss to explore the statistics of the data frames and how to convert rdd to data frame. Q-1 How to read the CSV file including headers as a dataframe and check the schema of the dataframe Ans: df_tips=spark.read.format(“csv”).option(“header”,True).load(“/FileStore/tables/tips.csv”) df.show() #print the schema print(df_tips.printSchema()) #Count the Read more…

Tutorial-5 PySpark RDD Union,Intersect,Subtract

In this article, we are going to discuss union,distinct,intersect,subtract transformations. Union: Merging of two or more RDDs. rdd1.union(rdd2) which outputs a RDD which contains the data from both sources. If the input RDDs contain duplicate elements,resultant rdd from union operations also contain duplicate elements. Q-1 We have one dataset students_marks.csv Read more…

Tutorial-4 PySpark RDD Joins

In this article , we are going to discuss different joins like inner,left,right,cartesian of RDD. Inner Join:It returns the matching records or matching keys from both RDD. Let’s say one RDD (K,V1) and other RDD contains (K,V2) then inner join between two RDD return (K,(V1,V2)). Q-1 We have one dataset Read more…

Tutorial-3 PySpark RDD Aggregation

In this article, we are going to discuss about GroupByKey, ReduceByKey and AggregateByKey. (a) GroupByKey:  On applying groupbyKey, dataset of (K, V) pairs convert into a dataset of (K, Iterable) pairs. Lots of unnecessary data transfer over the network. In the above image, each keys and values are being transferred in Read more…

