In this article, we are going to discuss union,distinct,intersect,subtract transformations.
Union:
- Merging of two or more RDDs.
- rdd1.union(rdd2) which outputs a RDD which contains the data from both sources.
- If the input RDDs contain duplicate elements,resultant rdd from union operations also contain duplicate elements.
Q-1 We have one dataset students_marks.csv and other also dataset is also students marks dataset.How can we union these two datasets?
students_marks=spark.sparkContext.textFile(“/FileStore/tables/student_marks.csv”)
students_rest_marks=spark.sparkContext.textFile(“/FileStore/tables/student_rest_marks.csv”)
students_union=students_marks.union(students_rest_marks)
students_union.collect()
Distinct: If we require unique elements in RDD, distinct function can help here.
Q-2 Find out the distinct students names from students_marks.csv dataset?
students_name=students_marks.map(lambda x:(x.split(“,”)[0]))
student_distinct=students_name.distinct()
student_distinct.collect()
Subtract: This transformation returns the RDD where elements exist in the first RDD not in the second RDD.
Q-3 Find out the students details that exist in the first RDD and not in the second RDD?
student_rest=students_union.subtract(students_marks)
student_rest.collect()
Intersection: The elements which exist in both the RDDs.It will return the unique elements,remove the duplicate, which exist in both the RDD.
Q-4 Find out the elements which exist in both the RDDs?
student_intersect=students_union.intersection(student_rest)
student_intersect.collect()
#Output
#[‘Daniel,Computer Science,99,100’, ‘Daniel,Physics,98,100’, ‘Daniel,English,50,100’, ‘Daniel,Hindi,99,100’, ‘Daniel,Chemistry,98,100’, ‘Daniel,Maths,98,100’]
0 Comments