We will discuss and practice each transformations and actions using in spark RDD.

Transformations: Map,FlatMap, MapPartition, Filter, Sample, Union, Intersection, Distinct, ReduceByKey, GroupByKey, AggregateByKey, Join, Repartition, Coalesce etc .

Actions: Reduce, Collect, Count, First, Take, Foreach, saveAsTextFile etc.

Q-1 What all different ways to create the RDD?

Ans:

  • Parallelize the present collection in our dataset.
  • Referencing a dataset in external storage system
  • Convert a dataframe to RDD.

1-(a) Parallelize method

#Parallelize on List
list_data=[2,3,4,5,6]
#For Spark version 2
rdd_list_ver2=spark.sparkContext.parallelize(list_data)
#For Spark version 1
rdd_list_ver1=sc.parallelize(list_data)

1-(b)Referencing a dataset in external storage system
data_rdd=spark.sparkContext.textFile(“/FileStore/tables/data.txt”)

1-(c) Convert Dataframe to RDD.
dataRDDCsv = spark.read.csv(“path/of/csv/file”).rdd
dataRDDJson = spark.read.json(“path/of/json/file”).rdd
dataRDDParquet =spark.read.parquet(“path/of/parquet/file”).rdd

Now let us see other questions and practice each Spark transformations.

(a) Map: Map transformation applies to each element of RDD and it returns the result as new RDD.
If Input has N elements in RDD then after applying map, output will have N elements only in RDD.

Q-2: Prepare the RDD through referencing a external dataset and convert the RDD into tuple like below:

  • (value,1)
  • (Firstletter(value),value)
  • (value,len(value))

  • #Implement for (value,1)
    #Input data.txt contains
    #The Sun rises in the East and sets in the West.
    #The Sun sets in the West and rises in the East


    data_rdd=spark.sparkContext.textFile(“/FileStore/tables/data.txt”)
    map_rdd=data_rdd.map(lambda x:(x,1))
    map_rdd.collect()


    #Output
    #(The Sun rises in the East and sets in the West.,1)
    #(The Sun sets in the West and rises in the East.,1)


    #Implement for (Firstletter(value),value)
    map_rdd_firstLetter=data_rdd.map(lambda x:(x[0], x))
    map_rdd_firstLetter.collect()

    #Output
    #(T,The Sun rises in the East and sets in the West.)
    #(T,The Sun sets in the West and rises in the East.)


    #Implement for (value,len(value))
    map_rdd_length=data_rdd.map(lambda x:(x,len(x)))
    map_rdd_length.collect()

    #Output
    #(The Sun rises in the East and sets in the West.,47)
    #(The Sun sets in the West and rises in the East.,47)

    Q-3- Append some string(“Sun”) to each line in the input dataset?

    #Below is the sample example to append some keyword in each line of RDD.

    def appendedFunc(s):
    topic=”Sun”
    appendedLine=s+topic
    return appendedLine
    appended_rdd=data_rdd.map(appendedFunc)
    appended_rdd.collect()

    #Output
    #[‘The Sun rises in the East and sets in the West.Sun’, # ‘The Sun sets in the West and rises in the East.Sun’]

    (b) FlatMap: Similar to map, it returns a new RDD by applying a function to each element of the RDD, but output is flattened.
    If Input has N elements in RDD then after applying map output can have M elements in RDD.

    Q-4:How to flatten the input strings RDD into words and convert each word like (word,1)?

    data_flat=data_rdd.flatMap(lambda x:(x.split(” “)))
    data_flat=data_rdd.flatMap(x=>(x.split(” “)))
    data_flat.collect()
    #Output
    [‘The’,’Sun’,’rises’,’in’,’the’,’East’,’and’,’sets’,’in’,’the’,’West.’,’The’,’Sun’,’sets’,’in’,’the’, ‘West’,’and’,’rises’,’in’,’the’,’East.’]

    data_map=data_flat.map(lambda x:(x,1))
    data_map.collect()
    #Output
    # [(‘The’, 1),(‘Sun’, 1),(‘rises’, 1),(‘in’, 1),(‘the’, 1),(‘East’, 1),(‘and’, 1),(‘sets’, 1),(‘in’, 1),(‘the’, 1),(‘West.’, 1),(‘The’, 1),(‘Sun’, 1),(‘sets’, 1),(‘in’, 1),(‘the’, 1), (‘West’, 1),(‘and’, 1),(‘rises’, 1),(‘in’, 1),(‘the’, 1),(‘East.’, 1)]

    Q-5: What is the difference between map and flatMap transformations?

    Map: If Input has N elements in RDD then after applying map output will have N elements only in RDD.

    FlatMap:If Input has N elements in RDD then after applying map output can have M elements in RDD.

    Let us apply the map and flatMap transformations and see what will happen?

    #Map
    data_map=data_rdd.map(lambda x:(x.split(” “)))
    data_map.collect()
    #output
    # [[‘The’,’Sun’,’rises’,’in’,’the’,’East’,’and’,’sets’,’in’,’the’,’West.’],[‘The’,’Sun’,’sets’,’in’,’the’,’West’,’and’,’rises’,’in’,’the’,’East.’]]

    data_rdd.count() #Output –>2
    data_map.count() #Output–>2

    #Input RDD has 2 lines and Output RDD has also 2 lines , Input(N elements)==>Output(N elements)


    #flatMap
    data_flat=data_rdd.flatMap(lambda x:(x.split(” “)))
    data_flat.count() #Output-22
    data_flat.collect()
    # Output
    # [‘The’,’Sun’,’rises’,’in’,’the’,’East’,’and’,’sets’,’in’,’the’,’West.’,’The’,’Sun’,’sets’,’in’, ‘the’,’West’,’and’,’rises’,’in’,’the’,’East.’]


0 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Insert math as
Block
Inline
Additional settings
Formula color
Text color
#333333
Type math using LaTeX
Preview
\({}\)
Nothing to preview
Insert