Hey people!!

Do you know how Netflix recommends us movies? How it classifies things?

It predicts Movie Ratings according to user’s ratings and on other basic grounds. But, don’t you think we need to first analyze the data and get some insights from it.

Thus, we’ll perform Spark Analysis on Movie-lens dataset and try putting some queries together.


Photo by Jake Hills on Unsplash

QUESTION 1 : Read the Movie and Rating datasets.

We’ll read the CVS file by converting it into Data-frames. You can download the datasets from movie.csv rating.csv and start practicing.

// Movie Dataset
val df_mov=spark.read.options(Map("header"->"true")).csv("hdfs://localhost:9000/tables/movies.csv")
df_mov.show()

//Rating Dataset
val df_rat=spark.read.options(Map("header"->"true")).csv("hdfs://localhost:9000/tables/MovieRating.csv")
df_rat.show()

Movie

Rating

Movie Dataframe has three columns as:

  • movieId (Unique)
  • title
  • genres

Rating Dataframe has four columns as:

  • userId
  • movieId (Unique)
  • rating
  • timestamp

QUESTION 2: Check the datatype of dataframes column and change if it doesn’t go with the values?

df_mov.printSchema()
df_rat.printSchema()


In the movie dataset, movieId is of string datatype and for rating one, userId, movieId, and rating doesn’t fall in the proper datatype. We need to change it using withcolumn() and cast function.

import org.apache.spark.sql.functions._

// Movie
val movie=df_mov.withColumn("movieId", $"movieId".cast("Integer"))
movie.printSchema()

//Rating
val rating=df_rat.withColumn("UserId", $"userId".cast("Integer"))
            .withColumn("MovieId", $"movieId".cast("Integer"))
            .withColumn("Rating", $"rating".cast("Integer"))
rating.printSchema()


QUESTIONS 3: Check if there are null values in the rating dataframe and remove if any?

Let’s check out if there are null values in the rating dataframe.

rating.filter(($"UserId".isNull) or ($"MovieId".isNull) or ($"Rating".isNull) or ($"timestamp".isNull)).show()

To remove the null values,

val Rating=rating.na.drop()
Rating.show()


QUESTION 4: Find out the top 20 highest rating movies and worst 20 too?

We need to join both DataFrames, movie and Rating to find out top and worst rating movies.

TOP RATING MOVIES

val top=movie.join(Rating,movie("movieId")===Rating("MovieId"),"inner").sort(desc("Rating"))
top.select("title","Rating").show(20)

WORST RATING MOVIES

val worst=movie.join(Rating,movie("movieId")===Rating("MovieId"),"inner").sort("Rating")
worst.select("title","Rating").show(20)


QUESTION 5: Name top 10 most viewed movies?

  • Inner Join between movie and Rating Dataframe
  • groupBy using “title
  • count the number of users who watched a particular movie
  • Sorting in decreasing order
val view=movie.join(Rating,movie("movieId")===Rating("MovieId"),"inner").groupBy("title").agg(count("*").as("No_of_person_viewed"))
view.select("title","No_of_person_viewed").sort(desc("No_of_person_viewed")).show(10)

We found that Gattaca is one of the most viewed movie.


QUESTION 6: Name distinct list of genres available?

Since there are multiple genres in a single movie. We need to split the genre to start processing using ‘|’ operator and then applying explode function to split the array of genres and have a distinct genre in each row.

withColumn adds a new column to the Dataframe.

val df=movie.withColumn("Genres",explode(split($"genres","[|]")))
df.show()
df.select($"Genres").distinct.show()


QUESTION 7: How many movies are there in each genre?

We need to find the count of movies in each genre.

I am using the same Dataframe df, created in previous questions, and applying groupBy to Genre and then using count function.

df.groupBy("Genres").count().show()

So, here we have DRAMA which occupies most of the movies.


QUESTION 8: Convert exploded movie Dataframe Genres again into list with commas?

We’ll be using exploded movie Dataframe in this question that we obtained in question 6. collect_list() function is used to convert Genres into list.

df.groupBy("movieId", "title")
  .agg(collect_list($"Genres").as("Genre")).show()

Bingo!! We can see the list now.


QUESTION 9: Name the movies starting with number ‘3’?

Well, to find the movies starting with number ‘3’, let’s filter out the movies and then apply the startsWith() function to return True if the movie name(string) starts with the given prefix.

movie.filter(col("title").startsWith("3")).show()

Yeah!! We found so many movies starting with number 3 🙂


QUESTION 10: List out the userid and Genres where ratings of the movie is 5?

I wish now you have concrete knowledge to solve this. Let’s try:

  1. Inner Join between two Dataframes
  2. Filter movies with rating 5
  3. Select UserId, Genre and Rating
val df_new=movie.join(Rating,movie("movieId")===Rating("MovieId"),"inner").filter($"Rating"==='5')
df_new.select($"UserId",$"genres",$"Rating").show()


QUESTION 11: Check if we have duplicate rows with Userid and title and remove if any?

Let’s check if we have duplicates or not.

We inner joined the two Dataframes, performed groupBy on UserId and title and counted on them, to find for duplicates.

val dup=movie.join(Rating,movie("movieId")===Rating("MovieId"),"inner")
dup.groupBy($"UserId",$"title").count().filter($"count">1).show()

We have duplicates!!

Let’s remove them using dropDuplicates() function.

val drop = dup.dropDuplicates("UserId","title")
//Checking
drop.groupBy($"UserId",$"title").count().filter($"count">1).show(10000000)

After dropping duplicates, we again checked and found no entries.

Wohoo!! Duplicates no more exist 😛


Photo by Gwen Ong on Unsplash

The show is over. Here, the curtains falls!!

Thank you so much for reading this far. Try out some cranky questions and leave a comment down if you have any suggestions/doubts.

Have a break. Go and watch a movie 🙂

Happy Sparking!!


0 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Insert math as
Block
Inline
Additional settings
Formula color
Text color
#333333
Type math using LaTeX
Preview
\({}\)
Nothing to preview
Insert