We will discuss and practice below each transformations and actions in spark RDD.

Transformations: Map,FlatMap, MapPartition, Filter, Sample, Union, Intersection, Distinct, ReduceByKey, GroupByKey, AggregateByKey, Join, Repartition, Coalesce etc .

Actions: Reduce, Collect, Count, First, Take, Foreach, saveAsTextFile etc.

Q-1 What all different ways to create the RDD?

Ans:

  • Parallelize the present collection in our dataset.
  • Referencing a dataset in external storage system
  • Convert a dataframe to RDD.

(a) Parallelize method

//Parallelize on array
val array_data=Array(2,3,4,5,6)
//Spark version 2 onwards
val rdd_array_ver2=spark.sparkContext.parallelize(array_data)
//or before Spark versions 2, we can create the RDD:
val rdd_array_ver1=sc.parallelize(array_data)

//Parallelize on Sequence
val seq_data=Seq(2,3,4,5,6)
//For Spark version 2
val rdd_seq_ver2=spark.sparkContext.parallelize(seq_data)
//For Spark version 1
val rdd_array_ver1=sc.parallelize(seq_data)

//Parallelize on List
val list_data=List(2,3,4,5,6)
//For Spark version 2
val rdd_list_ver2=spark.sparkContext.parallelize(list_data)
//For Spark version 1
val rdd_list_ver1=sc.parallelize(list_data)

(b)Referencing a dataset in external storage system
val data_rdd=spark.sparkContext.textFile(“/FileStore/tables/data.txt”)

(c) Convert Dataframe to RDD.
val dataRDDCsv = spark.read.csv(“path/of/csv/file”).rdd
val dataRDDJson = spark.read.json(“path/of/json/file”).rdd
val dataRDDParquet =spark.read.parquet(“path/of/parquet/file”).rdd

Let us discuss and practice each Spark transformations

(a) Map: Map transformation applies to each element of RDD and it returns the result as new RDD.
If Input has N elements in RDD then after applying map, output will have N elements only in RDD.

Q-2: Prepare the RDD through referencing a external dataset and convert the RDD into tuple like below:

  • (value,1)
  • (Firstletter(value),value)
  • (value,len(value))

  • #Implement for (value,1)
    //Input data.txt contains
    //The Sun rises in the East and sets in the West.
    //The Sun sets in the West and rises in the East


    val data_rdd=spark.sparkContext.textFile(“/FileStore/tables/data.txt”)
    val map_rdd=data_rdd.map(x=>(x,1))
    map_rdd.collect().foreach(println)


    //Output
    //(The Sun rises in the East and sets in the West.,1)
    //(The Sun sets in the West and rises in the East.,1)


    #Implement for (Firstletter(value),value)
    val map_rdd_firstLetter=data_rdd.map(x=>(x(0), x))
    map_rdd_firstLetter.collect().foreach(println)

    //Output
    //(T,The Sun rises in the East and sets in the West.)
    //(T,The Sun sets in the West and rises in the East.)


    #Implement for (value,len(value))
    val map_rdd_length=data_rdd.map(x=>(x,x.length))
    map_rdd_length.collect().foreach(println)
    //Output
    //(The Sun rises in the East and sets in the West.,47)
    //(The Sun sets in the West and rises in the East.,47)



    Q-3- Append some string(“Sun”) to each line in the input dataset?

    //We can also transform the dataset using below way.
    //We can use below way when need to save each line of the text or dataframe in to some external table.
    //We can use below way when need to parse the XML and apply some operations on each element.
    //Below is the sample example to append some keyword in each element of RDD.

    val appended_rdd=data_rdd.map{line=>{
    val topic=”Sun”
    val append_line=line+topic
    append_line
    }
    } appended_rdd.collect().foreach(println)

    //Output
    //The Sun rises in the East and sets in the West.Sun
    //The Sun sets in the West and rises in the East.Sun



    Q-4:How to add the line number in the dataset without using rownumber function?

    //Input Employee_sample.csv
    //[123,1,20190506]
    //[124,0,20190403]
    //[125,0,20190403]

    val employee_rdd=spark.read.csv(“/FileStore/tables/Employee_sample.csv”)
    var linenumber=0
    val employee_row=employee_rdd.rdd.map{row=>
    { linenumber=linenumber+1
    Row(row(0),linenumber,row(2),row(3))
    }
    }
    employee_row.collect().foreach(println)
    //Output
    //[123,1,1,20190506]
    //[124,2,0,20190403]
    //[125,3,0,20190403]

    (b) FlatMap: Similar to map, it returns a new RDD by applying a function to each element of the RDD, but output is flattened.
    If Input has N elements in RDD then after applying map output can have M elements in RDD.

    Q-5:How to flatten the input strings RDD into words and convert each word like (word,1)?

    val data_rdd=spark.sparkContext.textFile(“/FileStore/tables/data.txt”)
    val data_flat=data_rdd.flatMap(x=>(x.split(” “)))
    data_flat.collect().foreach(println)
    //Output
    //The
    //Sun
    //rises
    //in
    //the
    //..
    //..
    //the
    //East.
    //Lets say we want to count as 1 in each element.

    val data_map=data_flat.map(x=>(x,1))
    data_map.collect().foreach(println)
    //Ouput
    //(The,1)
    //(Sun,1)
    //(rises,1)
    //(in,1)
    //..
    //..
    //(the,1)
    //(East.,1)



    Q-6: What is the difference between map and flatMap?

    Map: If Input has N elements in RDD then after applying map output will have N elements only in RDD.

    FlatMap:If Input has N elements in RDD then after applying map ouput can have M elements in RDD.

    Let us apply the map and flatMap transformations and see what will happen?

    val data_map=data_rdd.map(x=>(x.split(” “)))
    data_map.collect
    //output
    //Array(Array(The, Sun, rises, in, the, East, and, sets, in, the, West.), Array(The, Sun, sets, in, the, West, and, rises, in, the, East.))
    data_rdd.count() //Output –>2
    data_map.count() //Output–>2

    //Input RDD has 2 lines and Output RDD has also 2 lines , Input(N elements)==>Output(N elements)

    val data_flat=data_rdd.flatMap(x=>(x.split(” “)))
    data_flat.count() //Output-22
    data_flat.collect.foreach(println)
    //Output
    //The
    //Sun
    //rises
    //in
    //the
    //..
    //..
    //the
    //East.


0 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Insert math as
Block
Inline
Additional settings
Formula color
Text color
#333333
Type math using LaTeX
Preview
\({}\)
Nothing to preview
Insert