Act fast, special offers end soon! Up to $294 is savings when you get Premium today.
Get offer codes

Picking low hanging fruits: cleaning and exploring Netflix data using R

image


What’s up! It’s your boy! Back at it again with another article. So last time, I gave some ideas on how to build an unguided project, along with some advice. Now’s a good time to put all of that into action with building three different projects of progressing complexity from a single data set. The only question is which data set? Well, considering that I spent the last 14 months of my life during the global lockdown binging on the various different streaming services, I figured it might be a good idea to look at some data from the world’s most prolific streaming service, Netflix.

Despite being a go-to streaming service that made it as one of the world’s titan tech companies, its content as a whole is pretty hit-or-miss. It seems that for every A-list cast member with an award-winning performance there are lots of no-named actors/actresses credited in some kind of hot garbage…and like two Adam Sandler movies.

Well, now’s a good time as any to see how varied their content is. We can also get an idea of how their algorithm to works the way it does. Does it have something to do with actors or directors? Maybe it has to do with the content rating? Or possibly, the content is really skewed to favor a particular cohort over others in terms of available quality content (i.e., those not borrowing a password VS. those that are).


image
When you’ve been putting this off to tomorrow for the past 5 years.


Since I have no idea why this may be the case, I figured it would be a good idea to look at what sort of content Netflix has to offer to find out. Specifically, it’ll be great to answer some of the following questions:

  1. What’s the breakdown between TV series and movies? How does the breakdown differ in regards to genres?
  2. What’s the breakdown between English-Speaking and non-English-speaking content?
  3. What’s the distribution of Netflix content in terms of content ratings?
  4. Which actors/actresses/directors are credited with the most headlining English-speaking or Non-English-speaking roles on Netflix?
  5. What are some of the most common terms used to describe English-speaking and Non-English-Speaking Netflix content?

THE DATA

So, in order to get a clear idea of what Netflix has to offer, I first need a data set to work off of. With a quick Google search, I was able to find this data set from Kaggle that contains about 7787 different titles. It contains a list of Netflix content dating back to 2010. Some of the variables contained in this data set include:

Variables Description
show_id Netflix identifier
type Is this a movie or TV series
title Content title
^ director List of all directing credits
** cast List of all cast members
country List of countries where the content is distributed
date_added When was the content added onto Netflix
release_year The year when the content was released
rating Content rating
duration Length of the content
listed_in List of all applicable genres
description Description of the movie/TV series

^ The order of the names corresponds to the hierarchy in directing roles
** The order of names corresponds to the casting hierarchy starting with headlining roles


THE PROCESS

Since we’re trying to gain some understanding of Netflix data that had been scrapped from an API, we’re probably going to have to do some data cleaning and wrangling before anything else. While this isn’t the sexiest of things to do, it is nevertheless very important as it would serve as a foundation for other projects to be built from it. However, before doing any of this, we’ll need to load in some libraries that’ll help us with these projects:

  • SKIMR – Used for a quick glance of the data set
  • TIDYTEXT – I’ll be working with text data, so this makes the process easy to do when introducing stop words and filtering them out.
  • TIDYVERSE – A collection of packages that makes the process of tidying data really easy
  • SHINY – I’ll eventually be using this to make this interactive (HINT: check out my next two articles)
  • WORDCLOUD2 – We’ll be making word cloud at some point. (Thanks to the idea from this article)
  • CLUSTER – Will be using this for the final project involving this data set
netflix = read.csv(“netflix_title.csv”)
View(netflix)


After reading the data using the data set, you’ll notice that there’s quite a bit of work that needs to be done.

  1. There are blanks in the data that resulted from the scrapping process
  2. The text for cast, director, genre, and description are all in one long string that needs to be separated out
  3. Duration variable looks to have a distinction between TV series and movies whereby movies are recorded in minutes and series in terms of number of seasons
  4. We have non-ASCII characters that we need to contend with
  5. The date_added section has two different formats used in inputting dates

Essentially, there’s a lot of stuff that needs to be done before even going into the analysis. But this is why reading in this data is so important since it gives us an idea of all of the little things that need to be addressed so that we get the most accurate analysis possible. So, let’s go through each of these step-by-step.

Step 1: Dealing with the blanks

Blank entries, or more commonly known as missing entries, are the most common thing that you’ll have to contend with in any data science/research project. While there are many processes to handle them, most of which are relatively easy to do, the actual manner in which you handle them is something that can make or break your analysis. Let me explain:

In a perfect world, we would have a completely detailed data set that clearly describes each subject according to some chosen descriptor. However, in real life, this is rarely ever going to happen and we would have no idea as to the reasoning for these missing entries. Like could it be because it truly doesn’t exist or an error in the data entry or whatever? We’re basically speculating at this point. However, we do know that having these missing entries will prevent us from conducting key analysis tests such as statistical hypothesis tests (i.e., T-test, Chi-Square Test, ANOVA) or most statistical modeling techniques (i.e., regression analysis). So, there’s an obvious need to figure this out.


image
When it pays to do the little stuff … or if you say ‘I know the promoter/manager’


Normally, the process will be to either remove segments of the data with missing entries or imputing them with some value. However, this can become quite problematic as it negatively impacts the integrity of the data with the introduction of bias. This means that this presently constructed data set, which is the most accurate representation of the information on Netflix content, will essentially become less of an accurate representation with these modifications. While a small change here or there wouldn’t really affect much in the grand scheme of things as it pertains to the accuracy of your statistical analysis to be correct (what we refer to as statistical power), but making large-scale changes does. Particularly in the case where we need to make predictions using the available data.

While the number of missing data that would significantly impact the integrity of the data and your analysis isn’t universally concrete as percentages range from 5% to 40%, a good rule of thumb that I follow (largely because of my Epidemiology background) is that missingness of at least 10% will serve as the cutoff that we’ll be introducing bias into the data.

Typically, the inclusion of missing entries with quotation marks and nothing between them (i.e., " ") to represent missing values is NOT THE NORM. In fact, when you access a count for the number of missing values, it’ll show up as being present instead of missing. There is typically another value used to reserve the fact that we have a missing entry like “NA” (used in R), “NaN” (used in Python), or the inclusion of some absurd value that has no real meaning. For example, if scores are usually in the range between 1 to 100, using a value like 999 could also be used to represent null or missing values. In the case of R, I’ll replace these blanks with NAs using the mutate_all() function found in the Tidyverse/dylyr package.

netflix = netflix %>% mutate_all(na_if, "")


By assigning a symbol to represent missing values with NA, we can get a more accurate count of the missingness as shown using the skim() function from the skimr package.


image


Here we can see that there are a number of variables of interest that have missing entries. Considering that we’ll be using this as the foundation for building additional projects, we’ll want to mitigate the degree of missingness here. With that in mind, it’ll probably be best to deal with this by starting from the variables with the lowest number of missing values and working our way up.

Step 2: Imputing missing entries

Considering that the variables with the fewest number of missing entries (rating and date_added) are something that is easily researched with help of the almighty Google search to find those missing entries, we can just manually impute the missing values in accordance with the formatting of the other entries. This would be possible with the use of the function mutate().

For example: looking at those missing rating entries, which correspond to the content rating, we have the following:


image


netflix = netflix %>% 
  mutate(
    rating = ifelse(c(is.na(rating) & title == "13TH: A Conversation with Oprah Winfrey & Ava DuVernay"), "TV-PG",
             ifelse(c(is.na(rating) & title == "My Honor Was Loyalty"), 'PG-13',
             ifelse(c(is.na(rating) & title == "Gargantia on the Verdurous Planet"), "TV-14",
             ifelse(c(is.na(rating) & title == "Little Lunch"), "TV-Y7",
             ifelse(c(is.na(rating) & title == "Louis C.K.: Live at the Comedy Store"), "TV-MA",
             ifelse(c(is.na(rating) & title == "Louis C.K.: Hilarious"), "TV-MA",
             ifelse(c(is.na(rating) & title == "Louis C.K. 2017"), "TV-MA", rating)))))))
  )

However, for cases where the variable has a few hundred missing entries, like country, this approach will be highly inefficient. So, we’ll need to do some big-brain strategy to make this work. Now, this can vary in a number of ways depending on your approach, which is fine so long as this can be rationalized (i.e., after enough mental gymnastics, I can lie to myself that it’s valid).


image


In the case of the country variable, which has 507 missing entries, looking up each of these TV series/movies is not efficient, I’ll be using other corresponding variables to help fill these missing entries. These include:

  • GENRE:

    • “South Korea” for rows with “Korean TV Shows” listed as genre or
    • “United Kingdom” for rows with “British TV Shows” listed as a genre
  • TITLE:

    • imputing country for well-known film or TV franchises like “Monty Python” is quintessentially English and thus would likely have the “United Kingdom” listed as the country of origin;
    • certain titles are duplicated because of language dubbing which we can use to select the country of origin such as the case for titles containing “Tamil” or “Hindi” = “India”
  • CAST and DIRECTOR:

    • While not always the case, certain actors are notorious for having a presence in a particular country’s cinema. For instance, I’m not likely going to find Solomon Khan in any sort of project outside of Bollywood, thus I can list that TV series or movie as having “India” as the country of origin
    • I’ve also used notably naming conventions used in certain countries to help with finding the country of origin such as the name “Aoi” or “Sasaki” for “Japan” or “Singh” for ‘India’
    • Similarly, there are certain directors notorious for existing in a particular country’s cinema, like Quentin Tarantino being exclusively Hollywood and thus, his movies should be listed as “United States”
  • DESCRIPTION

    • Using keywords pertaining to notable locations, cohorts of people, or language to define the country of origin
  • As for anything that is left over, I’ll just list it as “Unknown”.

The handling of unknown director(s) and cast(s) will not be as convenient as the above process. Considering that it’s entirely possible that is no listed cast or director due to the nature of the content (i.e., documentary or reality TV show), its absence would make sense. As such, we’ll just impute a term to indicate that this is unknown or non-existent. Considering that this really isn’t a matter of imputing data as much as it’s labeling missing entries, we’re basically good overall in terms of keeping the integrity of the data.

netflix = netflix %>% 
           mutate(
             director = ifelse(is.na(director), "Unknown/No Director(s)", director), 
             cast = ifelse(is.na(cast), "Unknown/No Cast", cast)
           )

Step 3: Separating out the text

This step actually isn’t too bad, it’s just a matter of appropriately using the piping capability afforded by the magrittr package found in Tidyverse, along with the use of the separate() and the pivot_longer() function that is used to separate out the text and stack each of the separated words for a given row, respectively.

Now, as previously mentioned, the entries for this chained text variable correspond to the casting and directing hierarchy where the lead role or lead director is the first name listed. As such, we can separate each name out and have the initial cast or director name be stored as the “headlining actor/actress” and “lead director” respectively. Comparably, for the other text chained variables like genre and country, which have multiple entries as well, the same sort of hierarchy convention applies where the principal to tertiary listing corresponds to the listing of each of these variables.

NOTE: In the case of genre and country, the splitting of text will result in some extra whitespaces which will need to be trimmed off. This is accomplished using the str_trim() function.

# Process for splitting up the cast

netflix_cast_split = netflix %>% 
           separate(
             cast, into = c("headliner", 'cast member 1', 'cast member 2', 'cast member 3', 'cast member 4', 'cast member 5', 'cast member 6', 'cast member 7', 'cast member 8', 'cast member 9', 'cast member 10', 'cast member 11', 'cast member 12', 'cast member 13', 'cast member 14', 'cast member 15', 'cast member 16', 'cast member 17', 'cast member 18', 'cast member 19', 'cast member 20', 'cast member 21', 'cast member 22', 'cast member 23', 'cast member 24', 'cast member 25', 'cast member 26', 'cast member 27', 'cast member 28', 'cast member 29', 'cast member 30', 'cast member 31', 'cast member 32', 'cast member 33', 'cast member 34', 'cast member 35', 'cast member 36', 'cast member 37', 'cast member 38', 'cast member 39', 'cast member 40', 'cast member 41', 'cast member 42', 'cast member 43', 'cast member 44', 'cast member 45', 'cast member 46', 'cast member 47', 'cast member 48', 'cast member 49'), sep = ", "
           ) %>% 
           pivot_longer(headliner:`cast member 49`, names_to = "cast_type", values_to = 'cast') %>% 
           filter(!is.na(cast)) %>% 
           mutate(cast_type = ifelse(cast_type == "headliner", "headliner", "supporting cast"))


# repeat for director, genre, and country

A special scenario comes with dealing with the description section. The process here is a bit more nuanced as we’ll have to deal with a bunch of words that hold no real value in terms of information, known as stop words. Additionally, I will also need to contend with punctuation marks, symbols, and non-ASCII characters here. While this does seem like a lot, it’s actually pretty easy to do if we use a few key functions as shown below:

# Substitute symbols and marks for blanks
netflix_description = netflix %>% 
mutate(description = gsub('[\\,.;:!?"]', "", description)) 

# converting non-ASCII 

netflix_description = netflix_description %>%
	mutate(description = stringi::stri_trans_general(description, “latin-ascii”))

netflix_description$original_description = netflix_description$description 

# I just want to have the original to compare to  

netflix_description = netflix_description %>% 
	unnest_tokens(
		output = word,
		input = description # Enables splitting up the description by each word for an individual row
	) %>% 
	anti_join(
        stop_words,
        by = “word” # allows for return all rows that aren’t included in the filter list (this case stop_words)
	)

colnames(netflix_description) = c("show_id", 'type', 'title', 'director', 'cast', 'country', 'date_added', 'release_year', 'rating', 'duration', 'listed_in', 'original_description', "keywords")

# renaming the columns 
         
netflix_description$keywords = str_trim(netflix_description$keywords, side = 'both')

# Removes whitespace from splitting description words

Step 4: Creating the necessary data sets

As I’ve previously mentioned, this whole data cleaning and wrangling process will serve as the basis for not only this project with the exploratory analysis, but for the other more complex projects as well. So it’ll make sense to get some of those preliminary steps out of the way now rather than later when we’ll need to devote our attention to more complex things. In this case, we’ll need to create multiple individual data frames that are split up by groups within certain variables.

In my case, I’ll create data sets with the following separations:

  1. Keywords in description & Genre
  2. Keywords in description & Language
  3. Director & Cast
  4. Director & Genre
  5. Director & Language
  6. Cast & Genre
  7. Cast & Language
  8. Language & Genre
  9. Director, Cast & Language
  10. Director, Cast & Genre
  11. Director, Language & Genre
  12. Cast, Language & Genre
  13. Cast, Language, Director & Genre

I’m not going to show the entire process here, but essentially you will need to use the same process above with separating long-text data with separate() and pivot_longer() functions with the existing data set. It’ll look something along these lines:

#  KEYWORDS IN DESCRIPTION and LANGUAGE

netflix_descriptionxlanguage = netflix_description %>% 
  separate(
    country, c("main country", "secondary country", 'tertiary country', 'fourth country', "fifth country", 'sixth country', 
    "seventh country", 'eighth country', 'nineth country', 'tenth country', 'eleventh country', 'twelfth country'), sep = ",") %>% 
  pivot_longer(`main country`:`twelfth country`, names_to = "country_type", values_to = "country_name") %>% 
  filter(!is.na(country_name)) %>% 
  mutate(country_type = ifelse(country_type == "main country", 'main country', 'other country'), country_name = ifelse(country_name == "", "Unknown Country", country_name))
  
netflix_descriptionxlanguage$country_name = str_trim(netflix_descriptionxlanguage$country_name, side = 'both')

netflix_descriptionxlanguage = netflix_descriptionxlanguage %>% 
  mutate(
    english_or_not = 
      ifelse(c(country_type == "main country" & country_name == "United States"), "English Speaking", 
      ifelse(c(country_type == "main country" & country_name == "United Kingdom"), "English Speaking", 
      ifelse(c(country_type == "main country" & country_name == "Canada"), "English Speaking", 
      ifelse(c(country_type == "main country" & country_name == "New Zealand"), "English Speaking", 
      ifelse(c(country_type == "main country" & country_name == "Australia"), "English Speaking", 
      ifelse(c(country_type == "main country" & country_name == "Ireland"), "English Speaking", 
      ifelse(c(country_type == "main country" & country_name == "Jamaica"), "English Speaking", 
      ifelse(c(country_type == "main country" & country_name == "Barbados"), "English Speaking", 
      ifelse(c(country_type == "main country" & country_name == "Belize"), "English Speaking", 
      ifelse(country_type == "main country", "Non-English Speaking", NA))))))))))
  ) %>% 
  filter(!is.na(english_or_not)) # To deal with those countries I literally have no idea about...

# Repeat this process for the other 12

Boom, we’re now ready to do some exploration and answer those questions. If you want to check out the entire clean-up sequence, you can look it up here.


EXPLORATORY ANALYSIS

We’re ready to do some exploration and answer some of those questions. Since the code can get pretty wild here, you can check out the code used to make these visualizations here.


1A) What’s the breakdown between TV series and movies?

We can see that the majority of the content on Netflix is movies (~ 69%) as compared to TV series.

image


1B) How does the breakdown differ in regards to genres?

Examining a bit deeper in relation to genre, the majority of the content appears to be international content, however much of that is actually a secondary listing. In terms of primary categorization, most of the movie-related content on Netflix are principally dramas followed by comedies, whilst for TV series, it namely action and adventure.


2) What’s the breakdown between English-Speaking and non-English-speaking content?

Interestingly enough, there is about a 50:50 split in terms of English and non-English content.

image


3) What’s the distribution of Netflix content in terms of content ratings?

In terms of content rating, the majority have a Mature rating, which is 17+.


image


4) Which actors/actresses/directors are credited with the most headlining English-speaking or Non-English-speaking roles on Netflix?

Amongst the cast, we see that in terms of overall Netflix content that Shah Rukh Khan had the most credits as the lead. This was also the case for non-English-speaking content. However, in terms of English-speaking content, Adam Sandler had the most leading credits.


image


In a comparative analysis with lead directors, it was found that Raul Campos had the most directing credits for Netflix content overall as well as for non-English-speaking content. However, for English-speaking content, Marcus Raboy had the most lead directing credits.


image


5) What are some of the most common terms used to describe English-speaking and Non-English-Speaking Netflix content?

Lastly, here is a quick look at some of the top keywords found in the description of English-speaking content with the use of word clouds. It appears that “life”, “family”, “world” and “documentary” are the most common words to appear


image


As for non-English-speaking content on Netflix, the most common terms found in the description are “life”, “woman”, “family” and “love”.


image


RECAP

Overall, this was a fairly straightforward task looking into Netflix data that would be appropriate for a Tier-1 data project as it really only took a few hours to a day to complete (most of which stemmed from the cleaning process). Going from data cleaning + wrangling to exploratory analysis, every aspect of the process relied on those same foundational skills that we’ve built upon early on the data science path.

Sure, there may be some things that you’ve may have been unfamiliar with, particular with text data, but this can easily be reviewed. Everything else should remain fairly familiar to you in one way or another. Obviously, we can dive a lot deeper into this exploratory analysis by examining the interrelationship of the above comparisons with an interaction factor, say the breakdown of Netflix content by content type and content rating.

So, what’s the next step? Well, I’m going to step this up a bit by introducing some more advanced techniques to make better use of this data. How exactly will I be doing this? You’re just going to have to wait and see. So, keep an eye out for the next article.

If you’re interested in check out some of my other projects, you can head over to my GitHub to check some of them out. Alternatively, if you’re got some idea on a collaborative project or just want to connect, hit me up on my LinkedIn.

Thanks for the read.

4 Likes

Great job your article, Michael! :star2: Detailed step-by step data cleaning and analysis and cool visualizations! I’m not familiar with R yet, but planning to start learning it soon, seems it has a huge potential just as Python. Also, I’m glad to see that you found useful word clouds, they are really amazing plots: simple and intuitively understandable to any audience. Thanks for sharing your work with us! :star_struck:

4 Likes

Thanks, @Elena_Kosourova. I was trying to figure out some new ways to show off keywords that weren’t a piechart/bar chart and remembered about word clouds. A much better upgrade. There’s definitely cooler stuff to come in a few weeks, so definitely check it out.

I’m also in the same boat, but going from R to Python and honestly, it isn’t too bad to switch between the two. The biggest things to get over were the syntax thing that messes you up from time to time and the process for some of the machine learning stuff.

2 Likes

Well done Michael! A very detailed project with nice visualizations. For me, it was a surprise finding out that the 50:50 breakdown between English and non-English content. I had expected it to be skewed towards non-English content (just think about the number of movies made by Bollywood).

However, I’d say that you could have done bigger charts (especially axes labels) because they are too small to read on smaller screens (mine’s 10 inches).

Happy coding and good luck with your next Netflix projects :grinning_face_with_smiling_eyes:

2 Likes

Thanks, @artur.sannikov96. Fair enough about those charts. I’m just too used to reading tiny text that it seemed fine to me, but I totally get your point. Definitely will take it into consideration for some of the other projects using this data set.

As for the split b/t English and non-English, it could very much be more skewed towards non-English content if you were to use a different means of delineating language other than what I had used. But, considering that it’s easier to get domestic distribution rights than international ones, I wasn’t too surprised by the findings.

2 Likes