Hiya my fellow Data Questers! If you have been in this community long enough, you may have noticed me popping up from time to time answering some posts with the occasional overboard response. This time I figured I would give a shot at writing an article.
No matter which milestone we are at in our learning journey, we all aim to improve and outperform the previous one. Sometimes utilizing previous projects or basing someone’s work as a foundation, can help us enhance our understanding. In fact, I have bookmarked a few projects from within the community to take inspiration from.
While it’s a joy to read up on the “how-to” of a project, it’s a real struggle to understand and appreciate it when it has been done with an unfamiliar language or with an approach that we haven’t come across before. One such project is a recent one that explored Netflix data using R. I figured I’d do an official “remake”, but with Python using several familiar libraries - Numpy, Pandas, Matplotlib etc.
Without further ado, let’s get started.
Let’s do a table read
Okay, so this is a pretty standard step - open an IDE, read in the data set, store it in a data frame, apply the “.info()” and “.describe()” methods and take a look at what we have to work with. After that? Unsurprisingly, at this point, I would freeze - happens 99% of the time! So anywho, this is what the Netflix data set looks like.
Looking at the dataset, we can see that there are some columns with missing values. A closer look at the sole numerical data in the data set, “release_year”, we’d see the following:
We have titles as old as time to as new as 2021. Aside from missing values, we’d have to check if there are any duplicate entries. Fortunately, we don’t have to deal with any duplicate data.
The screenplay needs some work
It’s cliché I know, but as with any real-world data set, this dataset needs some serious makeover too before it can work. Here’s what we’re working with.
As we can see, we’ll need to deal with the following:
- Non-English characters in the titles
- Dates in string format
- Many of the key variables are stored in long text format and are listed out by commas (such as director, cast, etc.)
- The release year appears to be different between TV series (represented as a year of the latest season instead of the actual year of release) and movies (actual release year)
- To top it up, this is the extent of the missing values we’d need to contend with:
Well, time to roll up our sleeves and get the hard work out of the way.
Rewrites, rewrites, some more rewrites
Since there’s quite a bit to do, it’s best to get some of the easy stuff out of the way. The simplest of all of the above plot holes is the conversion of the dates into DateTime objects. In this case, a simple
pd.to_datetime() method applied to the
date_added column (that provides the date when the title was added to the Netflix database) will suffice.
Next up, we have non-English characters in the titles. While it’s great to see this kind of diversity and inclusion from Netflix, from a data analysis perspective, it’s a pain to deal with. As I don’t really have a solution here, I’ll leave them be.
For the release year column as well, we are moving ahead with the year specified in the dataset.
Now all that’s left is deciding what to do with the missing data. There are two major approaches we can take:
- Remove entries with missing values.
- Replace the missing values with the actual values or estimate/impute them.
In either of the approaches one factor – the threshold - plays a key role. Whether we decide to remove the entries altogether or to impute them, is dependent upon how much data are we exactly missing. It may not be a problem with a few rows, but if it’s a big enough segment, we’ll be affecting the ability to generalize our findings since this data set is yet further away from an actual representation of the “truth”. This will impact us later on, if and when we apply this data towards more complex processes like generating predictions with machine learning techniques.
Of the columns with missing data, the
rating column (how suitable the content is for an audience) looks to have the least amount missing. Since it isn’t too much work, we can just do a quick search and fill in those missing entries.
data_added column, there’s a similar number of missing entries as with ratings. However, we don’t have a reliable resource available to find this information and imputing these values will do more harm than good. Additionally, the missing values only constitute ~0.01% of the total data set. Hence removing these rows won’t have a significant impact on our analysis.
cast columns the missing values make up less than 10% of the data set. But it’s the
director column that has 30% missing entries. Removing these rows would significantly reduce the power of our analysis, eventually reducing the performance of machine learning algorithm(s).
The other approach could be to ignore this column altogether if we assume that it would not be important enough for our purpose, which in our scenario isn’t the case. The bottom line is we’ll need to find a solution to fill in these missing entries.
FINDING NEMO, FINDING DORY
To impute/ find these missing values let’s take help from a reputable source IMDB. It has several datasets which contain various information about a title, along with the data definitions. Below is a flow chart of the process used to perform imputation for the missing directors and cast members. For the entire code work, check out the IPython notebook here.
NOTE: The dark headers represent the data frame/dataset names and the coloured/ no-background filled cells represent the field/column names also represented as [field name].
This may look a little much, but the process essentially comes down to this:
Divide the Netflix data set into two separate data frames:
- One without missing values in the entire row (aka. non_null_data)
- One containing rows with missing values (aka. missing_data)
Utilize a specific “IMDB_work.py” script to extract, read and store the relevant IMDB data sets. This module just has one function that serves to return a Pandas data frame comprising records from the IMDB data sets.
The title column was applied lower case and any non-ASCII characters were removed before merging with their respective year columns; for both the Netflix and IMDB datasets.
Using the unique combinations of the content title, along with the start/release year, the Netflix data frame with the missing entries was compared with the IMDB data set.
The combinations that match, will allow us to use the unique identifier [tconst] in the IMDB dataset.
The unique identifier will then help us to find out the unique identifiers for director(s) [directors] and cast members [nconst] - by specifically filtering only those listed as actresses or actors [category] - associated with that title.
Once we have the id’s we have the set of names [primaryName] for the actors/actresses and director(s) for a given title; which is then used to replace the missing entries in the Netflix data frame (missing_data).
Lastly, we just merge the two data frames back together.
It is important to mention that there are some cases where the same titles appear multiple times in the data set, but are listed under different years/countries (likely as a result of just sharing the same title or are remakes/reboots of a film or TV series). Since we may not reach an efficient solution to work with these titles with available data at hand (including IMDB),
I chose to leave them as having an “Unknown” director or cast as applicable.
As for the
country column, the IMDB datasets do not appear to be useful in addressing this. However, the IMDB webpage for any given title seems to have information on the country of origin listed. For example: looking at my favourite movie, The Dark Knight, we can see both the United States and the United Kingdom as the countries of origin.
It looks like we’ll have to change our approach by using another useful skill, web scraping! The key factor to extract the above-highlighted data is again the unique ID for a title in the IMDB dataset which also takes its place in the IMDB Webpage URL.
The workflow for this process can be summarized as:
And we’ll need to use this function to achieve this:
For some leftover entries, I’ve used filtering based on selecting certain keywords listed in the description, title, and/or the genre (i.e.
listed_in column) that could be used to indicate the country of origin example: “British” for the United Kingdom, “Korean” for South Korea) as inputs for missing values. The only exception is “Spanish” as there are too many countries that uses Spanish as a native language to differentiate from. So a bit of manual work for limited titles was done. The rest of the entries have been marked as “Unknown”.
After all of that, our final tally for missing values shows that we have brought down the missing values for the
director column close to 20%. Not much significant change for cast and country columns, but hey, at least we utilized web-scraping!
We’re onto principal photography
Now that we’re as ready to go as we’ll ever be, it’s time to do some in-depth exploration.
Looking at the sort of content that Netflix has to offer, we see that Netflix offers twice as many movies as there are TV series. With Netflix already pushing towards its own produced content - both shows and movies, we might see a change in this ratio.
Looking broadly at the content type, we see that the majority is rated for either mature audiences or 14+.
NOTE: the percentages have been rounded to the nearest integer
Since Netflix is such a mixed bag in terms of content, we would expect to see a wide range in terms of the original release date for some of the content. However, a closer look seems to show that much of the content that has been played on Netflix is from the 2010s. Interestingly enough, we see substantial growth in terms of the content offered from its early days prior to 2016 until 2020 (however, this might be due to how the dataset was curated/collected and not necessarily how Netflix decided to add them).
Despite this, it seems that most of the TV series content available on Netflix appears to be short-lived series with the vast majority only lasting a single season. Not exactly binge-worthy, so much as being left with a ton of cliffhangers without a real end to the story.
Moving on from TV Shows, the movie content on Netflix seems to stick with the conventional length of approx. 1.5 hours.
And, the award goes to
Since we applied a different approach to handling missing data compared to the previous work, it would not be surprising to see some differences. Nevertheless, let’s see how our findings stack in terms of the following questions:
1) How do TV series and movies break down in terms of genre?
Normally, a genre would explain the nature of the content (i.e. drama, comedy, etc.). So, for something like “International movie/TV”, it wouldn’t seem right to be labelled as a genre. Also, as “international” is more of a relative term, I decided to leave it out as a genre option. Instead, I decided to split this breakdown further to differentiate the genre between domestic and international markets instead. Here the international may correspond to the titles that were simultaneously released in multiple countries and not necessarily specific to the US.
For international TV content, dramas appear to be the most prevalent genre followed by romantic content. However, for domestic content, the vast majority appear to be kid-related content with comedies following right after.
As for movie-related content, both dramas and comedies make up the majority of the content for both domestic and international content. However, there appears to be more diversity on the domestic front with greater content for independent films as well as children/family-oriented movies.
2) What’s the split between English and Non-English content on Netflix?
Despite using a different approach for imputing missing values for the
country column, we still managed to get similar results.
3) Who are the top actors/actresses and directors for English and Non-English content?
It looks like all the top five artists for non-English content are from India, with Anupam Kher leading with the highest number of titles. However, in terms of English content, the great John Cleese appears to be the leading credited actor.
In terms of directors, Jan Suter and Raul Campos were the ones with the most directing credits for non-English content. However, for English-related content, it’s a tie between Marcus Raboy and Jay Karas.
4) What are some of the most common terms used in the description between English and Non-English Content?
The differences aren’t that prominent when it comes to the most common descriptors of a plot of a title between, English and Non-English content. They both seem to focus on life, family, love, friend and, for some reason, always trying to “find” (something). In order to find out what they are searching for I guess we will have to Netflix & Chill…
All in all, this project wasn’t too daunting. It mainly highlighted the task of data cleaning and wrangling to implement it to the best extent possible. Can’t say it’s something new, but hopefully, you can see its importance for future works and projects.
Going through the code, you may come across various questions, doubts, mistakes and ideas. I did too and perhaps still have. Data science projects, especially unguided ones, often feel overwhelming at the start, especially when you see other’s work! There are plenty of resources out there to help overcome that initial barrier, like this blog here.
I’m sure you may find some more room for improvements or maybe a few modifications. Maybe you would like to check out different ways to connect with other data sources and not just IMDB. This is just one path amongst endless possibilities that you can take and I encourage you to try some of them.
That being said, this is not the end of the show! We have just started. So, what’s next? As of now even I don’t know…. Until then Data-Science and Code!
Thanks for reading.