We will discuss and practice each transformations and actions using in spark RDD.
Transformations: Map,FlatMap, MapPartition, Filter, Sample, Union, Intersection, Distinct, ReduceByKey, GroupByKey, AggregateByKey, Join, Repartition, Coalesce etc .
Actions: Reduce, Collect, Count, First, Take, Foreach, saveAsTextFile etc.
Q-1 What all different ways to create the RDD?
Ans:
- Parallelize the present collection in our dataset.
- Referencing a dataset in external storage system
- Convert a dataframe to RDD.
1-(a) Parallelize method
#Parallelize on List
list_data=[2,3,4,5,6]
#For Spark version 2
rdd_list_ver2=spark.sparkContext.parallelize(list_data)
#For Spark version 1
rdd_list_ver1=sc.parallelize(list_data)
1-(b)Referencing a dataset in external storage system
data_rdd=spark.sparkContext.textFile(“/FileStore/tables/data.txt”)
1-(c) Convert Dataframe to RDD.
dataRDDCsv = spark.read.csv(“path/of/csv/file”).rdd
dataRDDJson = spark.read.json(“path/of/json/file”).rdd
dataRDDParquet =spark.read.parquet(“path/of/parquet/file”).rdd
Now let us see other questions and practice each Spark transformations.
(a) Map: Map transformation applies to each element of RDD and it returns the result as new RDD.
If Input has N elements in RDD then after applying map, output will have N elements only in RDD.
Q-2: Prepare the RDD through referencing a external dataset and convert the RDD into tuple like below:
- (value,1)
- (Firstletter(value),value)
- (value,len(value))
#Implement for (value,1)
#Input data.txt contains
#The Sun rises in the East and sets in the West.
#The Sun sets in the West and rises in the East
data_rdd=spark.sparkContext.textFile(“/FileStore/tables/data.txt”)
map_rdd=data_rdd.map(lambda x:(x,1))
map_rdd.collect()
#Output
#(The Sun rises in the East and sets in the West.,1)
#(The Sun sets in the West and rises in the East.,1)
#Implement for (Firstletter(value),value)
map_rdd_firstLetter=data_rdd.map(lambda x:(x[0], x))
map_rdd_firstLetter.collect()
#Output
#(T,The Sun rises in the East and sets in the West.)
#(T,The Sun sets in the West and rises in the East.)
#Implement for (value,len(value))
map_rdd_length=data_rdd.map(lambda x:(x,len(x)))
map_rdd_length.collect()
#Output
#(The Sun rises in the East and sets in the West.,47)
#(The Sun sets in the West and rises in the East.,47)
Q-3- Append some string(“Sun”) to each line in the input dataset?
#Below is the sample example to append some keyword in each line of RDD.
def appendedFunc(s):
topic=”Sun”
appendedLine=s+topic
return appendedLine
appended_rdd=data_rdd.map(appendedFunc)
appended_rdd.collect()
#Output
#[‘The Sun rises in the East and sets in the West.Sun’,
# ‘The Sun sets in the West and rises in the East.Sun’]
(b) FlatMap: Similar to map, it returns a new RDD by applying a function to each element of the RDD, but output is flattened.
If Input has N elements in RDD then after applying map output can have M elements in RDD.
Q-4:How to flatten the input strings RDD into words and convert each word like (word,1)?
data_flat=data_rdd.flatMap(lambda x:(x.split(” “)))
data_flat=data_rdd.flatMap(x=>(x.split(” “)))
data_flat.collect()
#Output
[‘The’,’Sun’,’rises’,’in’,’the’,’East’,’and’,’sets’,’in’,’the’,’West.’,’The’,’Sun’,’sets’,’in’,’the’, ‘West’,’and’,’rises’,’in’,’the’,’East.’]
data_map=data_flat.map(lambda x:(x,1))
data_map.collect()
#Output
# [(‘The’, 1),(‘Sun’, 1),(‘rises’, 1),(‘in’, 1),(‘the’, 1),(‘East’, 1),(‘and’, 1),(‘sets’, 1),(‘in’, 1),(‘the’, 1),(‘West.’, 1),(‘The’, 1),(‘Sun’, 1),(‘sets’, 1),(‘in’, 1),(‘the’, 1), (‘West’, 1),(‘and’, 1),(‘rises’, 1),(‘in’, 1),(‘the’, 1),(‘East.’, 1)]
Q-5: What is the difference between map and flatMap transformations?
Map: If Input has N elements in RDD then after applying map output will have N elements only in RDD.
FlatMap:If Input has N elements in RDD then after applying map output can have M elements in RDD.
Let us apply the map and flatMap transformations and see what will happen?
#Map
data_map=data_rdd.map(lambda x:(x.split(” “)))
data_map.collect()
#output
# [[‘The’,’Sun’,’rises’,’in’,’the’,’East’,’and’,’sets’,’in’,’the’,’West.’],[‘The’,’Sun’,’sets’,’in’,’the’,’West’,’and’,’rises’,’in’,’the’,’East.’]]
data_rdd.count() #Output –>2
data_map.count() #Output–>2
#Input RDD has 2 lines and Output RDD has also 2 lines , Input(N elements)==>Output(N elements)
#flatMap
data_flat=data_rdd.flatMap(lambda x:(x.split(” “)))
data_flat.count() #Output-22
data_flat.collect()
# Output
# [‘The’,’Sun’,’rises’,’in’,’the’,’East’,’and’,’sets’,’in’,’the’,’West.’,’The’,’Sun’,’sets’,’in’, ‘the’,’West’,’and’,’rises’,’in’,’the’,’East.’]
0 Comments