In this article, we are going to discuss union,distinct,intersect,subtract transformations.
- Merging of two or more RDDs.
- rdd1.union(rdd2) which outputs a RDD which contains the data from both sources.
- If the input RDDs contain duplicate elements,resultant rdd from union operations also contain duplicate elements.
Q-1 We have one dataset students_marks.csv and other also students marks dataset.How can we union these?
How can we perform union operations for more than two RDDs simultaneously?
//We can also perform union operations for multiple RDDs simultaneously.
rdd1 = sc.parallelize([1, 2, 3])
rdd2 = sc.parallelize([4, 5, 6])
rdd3 = sc.parallelize([7, 8, 9])
rdd = sc.union([rdd1, rdd2, rdd3])
// [1, 2, 3, 4, 5, 6, 7, 8, 9]
Distinct: If we require unique elements in RDD, distinct function can help here.
Q-2 Find out the distinct students names from students_marks.csv dataset?
Subtract: This transformation returns the RDD where elements exist in the first RDD not in the second RDD.
Q-3 Find out the students details that exist in the first RDD and not in the second RDD?
Intersection: The elements which exist in both the RDDs.It will return the unique elements,remove the duplicate, which exist in both the RDD.
Q-4 Find out the elements which exist in both the RDDs?