Blue Week Special Offer | Brighten your week!
days
hours
minutes
seconds

Introduction to Plotly Data Viz Library: Netflix Dataset

In this digital era data is everywhere you click a link you create an instance you watch a video you create an instance but with this data can you make useful insights. Let’s make some interactive charts with an awesome python library Plotly. Let’s create award-winning data visualizations with Plotly. Plotly can make people hook onto them immediately by its modern aesthetics that makes it quite easy to implement.

What is Plotly

Plotly is an open source for Python and R written in JavaScript, making graphs inherently interactive. Plotly is an interactive python library that provides numerous charts and we also can make a dashboard with it. The love for Plotly as a library is increasing by the audience and they are preferring more and more. plotly provides charts in 2D and 3D formats not only this they also provide animations with the charts. Isn’t that cool!!
One of the additional features that Plotly provides is that we can embed the charts on any blogs, website, articles which helps in the engagement ratio for the businesses.

Examples of Plotly

let’s see some of the mind-blowing charts which are created with the help of Plotly. This is just a few images we will see how to make this type of image.


waterfall chart in Plotly


Infographics Plotly Source: Reality check for DS, ML, RS

Dashboards with Plotly

Dashboards created by Plotly have interactive features and callbacks which helps to create an amazing dashboard.
I have made an interactive dashboard for which has a basic overview of customers preferring which category of products and what is the satisfaction index rate and doing a sentiment analysis by comparing with different metrics like Age, Division
Do give a look here

Let’s see how to make out-of-the-box visuals with Plotly.

Installing Plotly

Installing the Plotly library

# pip
!pip install plotly

Installing the Plotly library thought conda

# anaconda
conda install -c anaconda plotly

Importing the library

import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

Now let’s import the data. We are going to use the Netflix dataset for the exploration of data with Plotly.

Netflix, Inc is an American technology and media services provider and it’s production company headquartered is in Los Gatos, California which was founded in 1997 by Reed Hastings and Marc Randolph in Scotts Valley, California. The company’s primary business is its subscription-based streaming service, which offers online streaming of a library of films and television series, including those produced in-house.

Netflix is one of the top most popular entertainment services used by people of any age around the world.
We used dataset TV Shows and Movies listed on the Netflix dataset from Kaggle.

df = pd.read_csv(r'D:netflix_titles.csv')
df.head(3)

Description of the columns
This dataset contains data collected from Netflix of different TV shows and movies from the year 2008 to 2021.

  • type: Gives information about 2 different unique values one is TV Show and another is Movie
  • title: Gives information about the title of Movie or TV Show
  • director: Gives information about the director who directed the Movie or TV Show
  • cast: Gives information about the cast who plays role in Movie or TV Show
  • release_year: Gives information about the year when Movie or TV Show was released
  • rating: Gives information about the Movie or TV Show are in which category (eg like the movies are only for students, or adults, etc)
  • duration: Gives information about the duration of Movie or TV Show
  • listed_in: Gives information about the genre of Movie or TV Show
  • description: Gives information about the description of Movie or TV Show

Data Cleaning

Let’s first check if the data contains null values in them

df.isnull().sum()

3
Ohh… there are many columns that need to be cleaned before the visualization.

Let’s drop some features which we are not going to use in the visualization process.

df = df.dropna(how='any',subset=['cast', 'director'])

There are many null values so let’s drop the null values

df = df.dropna()

Now, let’s convert some columns into proper date time format

df["date_added"] = pd.to_datetime(df['date_added'])
df['year_added'] = df['date_added'].dt.year
df['month_added'] = df['date_added'].dt.month

If we see there are many values in duration columns which can be classified as a season too.

df['season_count'] = df.apply(lambda x : x['duration'].split(" ")[0] if "Season" in x['duration'] else "", axis = 1)
df['duration'] = df.apply(lambda x : x['duration'].split(" ")[0] if "Season" not in x['duration'] else "", axis = 1)

Everything looks nice but let’s change the column Listed_in name to genre which makes more sense.

df = df.rename(columns={"listed_in":"genre"})
df['genre'] = df['genre'].apply(lambda x: x.split(",")[0])

Let’s see some common styling which we will need to make plotly charts

Before going into the analysis let’s see how to make easy and elegant charts with plotly. First, we will see the plotly express. You can make any charts like bar, histogram, pie, scatter, line, and an area just you need to give the name of the chart beside px let’s see how it is done.

px.chart name(data, x=x_axis value, y=y_axis value)

# for example
px.bar(df, x=[1,2], y=[1,1])

Following the above steps you will able to make any chart but if we want to make charts in an elegant, and presentable manner we need to add stylings. Let’s give a look here

If we want to hide the whole x-axis or y-axis we just need to add this line of code.

fig.update_xaxes(visible=False)
fig.update_yaxes(visible=False)

If we want to hide the grid layout in x-axis or y-axis we just need to add this line of code.

fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)

If we want to change the x-axis title or y-axis title we just need to update the layout and write in the xaxis_title or yaxis_title parameter.

fig.update_layout(xaxis_title='xaxis', yaxis_title='yaxis')

Giving the height, width, title, color to the chart can be done by giving the appropriate height and width to the chart by height and width command and also give an suitable title and color name by title parameter and color parameter respectively.

fig = px.bar(df, names='type', height=300, width=600, 
									 title='Most watched on Netflix',
 color_discrete_sequence=['#b20710', '#221f1f'])

For changing the plot background we need to add the color to the the plot_bgcolor and paper_bgcolor parameter in update layout

fig.update_layout(plot_bgcolor='color name', paper_bgcolor='color name')

Well after doing everything we just need to adjust the margin of the chart to look it more presentable it is okay even if you don’t do it. For changing the margin the parameters are

t= top, b=bottom, r=right, l=left.

fig.update_layout(margin=dict(t=100, b=30, l=0, r=0))

Let’s do exploratory data analysis of Netflix: with Netflix styles!

fig_donut = px.pie(df, names='type', height=300, width=600, hole=0.7,
									 title='Most watched on Netflix',
									 color_discrete_sequence=['#b20710', '#221f1f'])
fig_donut.update_traces(hovertemplate=None, textposition='outside',
												textinfo='percent+label', rotation=90)
fig_donut.update_layout(margin=dict(t=100, b=30, l=0, r=0),
  				 						  showlegend=False,
	  	 									plot_bgcolor='#333', paper_bgcolor='#333',
			    							title_font=dict(size=45, color='#8a8d93',
													 family="Lato, sans-serif"),
											  font=dict(size=17, color='#8a8d93'),
											  hoverlabel=dict(bgcolor="#444", font_size=13,
													 font_family="Lato, sans-serif"))


Donut chart in plotly

So the ratio to Movie: TV Shows is 97%:2%, wow! looks like Movies are preferred more than TV Shows clearly. So now see what is the trend of movies thought time period

TV Shows & Movies impact over the years

d1 = df[df["type"] == "TV Show"]
d2 = df[df["type"] == "Movie"]

col = "year_added"

vc1 = d1[col].value_counts().reset_index().rename(columns = {col : "count", "index" : col})
vc1['percent'] = vc1['count'].apply(lambda x : 100*x/sum(vc1['count']))
vc1 = vc1.sort_values(col)

vc2 = d2[col].value_counts().reset_index().rename(columns = {col : "count", "index" : col})
vc2['percent'] = vc2['count'].apply(lambda x : 100*x/sum(vc2['count']))
vc2 = vc2.sort_values(col)

trace1 = go.Scatter(x=vc1[col], y=vc1["count"], name="TV Shows", marker=dict(color="orange"), )
trace2 = go.Scatter(x=vc2[col], y=vc2["count"], name="Movies", marker=dict(color="#b20710"))
data = [trace1, trace2]
fig_line = go.Figure(data)

fig_line.update_traces(hovertemplate=None)
fig_line.update_xaxes(showgrid=False)
fig_line.update_yaxes(showgrid=False)

large_title_format = 'Tv Show and Movies impact over the Year'
small_title_format = "<span style='font-size:13px; font-family:Tahoma'>Due to Covid updatation of content is slowed."
fig_line.update_layout(title=large_title_format + "<br>" + small_title_format, height=400,
                  margin=dict(t=130, b=0, l=70, r=40),
                  hovermode="x unified", 
                  xaxis_title=' ', yaxis_title=" ",
                  plot_bgcolor='#333', paper_bgcolor='#333',
                  title_font=dict(size=25, color='#8a8d93', family="Lato, sans-serif"),
                  font=dict(color='#8a8d93'),
                  legend=dict(orientation="h", yanchor="bottom", y=1, xanchor="center", x=0.5))

fig_line.add_annotation(dict(x=0.8, y=0.3, ax=0, ay=0,
                    xref = "paper", yref = "paper", 
                    text= "Highest number of <b>Tv Shows</b><br> were released in <b>2019</b><br> followed by 2017."
                  ))
fig_line.add_annotation(dict(x=0.9, y=1, ax=0, ay=0,
                    xref = "paper", yref = "paper",
                    text= "Highest number of <b>Movies</b> were relased<br> in <b>2019</b> followed by 2020"
                  ))
fig_line.show()


Line chart in plotly

From these charts, we can see that Movies are dominating on Netflix over TV Shows.

An interesting thing is seen that as we approach the year 2020 though, Movie showed dropped and TV Shows spike up. Why might this be? Well, one plausible explanation could be the pandemic impact that showed at our doorsteps at the start of 2020. People had more free time than ever so after completing Movies people wanted to watch some TV Shows that’s why there is a sudden hike in the TV Shows. With the expansion of TV Shows produced by Netflix demand for movies could be in recession.

So now if the producer want to release the shows which month could be the best? Let’s take a deep dive into months.

Best Month for Releasing Content

df_month = pd.DataFrame(df.month_added.value_counts()).reset_index().rename(columns={'index':'month','month_added':'count'})
# converting month no to month namedf_month['month_final'] = df_month['month'].replace({1:'Jan', 2:'Feb', 3:'Mar', 4:'Apr', 5:'May', 6:'June', 7:'July', 8:'Aug', 9:'Sep', 10:'Oct', 11:'Nov', 12:'Dec'})

fig_month = px.funnel(df_month, x='count', y='month_final', title='Best month for releasing Content',
                      height=350, width=600, color_discrete_sequence=['#b20710'])
fig_month.update_xaxes(showgrid=False, ticksuffix=' ', showline=True)
fig_month.update_traces(hovertemplate=None, marker=dict(line=dict(width=0)))
fig_month.update_layout(margin=dict(t=60, b=20, l=70, r=40),
                        xaxis_title=' ', yaxis_title=" ",
                        plot_bgcolor='#333', paper_bgcolor='#333',
                        title_font=dict(size=25, color='#8a8d93', family="Lato, sans-serif"),
                        font=dict(color='#8a8d93'),
                        hoverlabel=dict(bgcolor="black", font_size=13, font_family="Lato, sans-serif"))


Funnel chart plotly

Looks like the starting and ending months of the year is most preferred by the audience so it is the best time to release any shows for more profit.

Well, now we know Movies are most preferred than TV Shows and each year the number of the release of TV Shows and Movies are increasing. Most of the people are watching in the December, January, and October month. Now let’s see which country has the highest number of people.

Highest number of shows watched in the country

df_country = df.groupby('year_added')['country'].value_counts().reset_index(name='counts')

fig = px.choropleth(df_country, locations="country", color="counts", 
                    locationmode='country names',
                    title='Country '
                    range_color=[0,200],
                    color_continuous_scale=px.colors.sequential.OrRd
                   )
fig.show()


Map in plotly

Let’s add some spice and see how the year made an impact over the country for Netflix shows with animation.

Country Vs Year

df_country = df.groupby('year_added')['country'].value_counts().reset_index(name='counts')

fig = px.choropleth(df_country, locations="country", color="counts", 
                    locationmode='country names',
                    animation_frame='year_added',
                    title='Country Vs Year',
                    range_color=[0,200],
                    color_continuous_scale=px.colors.sequential.OrRd
                   )
fig.show()


Animated Map in Plotly

The United States holds the top spot for most content available on Netflix well Netflix was found in America so that’s obvious. Although that’s true, it’s interesting to note that International Movies and TV shows seem to dominate by genre. What you don’t see in these charts are the other countries outside the top 15. This is a probable reason behind this occurrence.

Note: To add the animation to the chart we just need to add a animation_frame and it is just simple.

Let’s take a look on the ratings of shows on the netflix

# making a copy of df
dff = df.copy()

# making 2 df one for tv show and another for movie with rating 
df_tv_show = dff[dff['type']=='TV Show'][['rating', 'type']].rename(columns={'type':'tv_show'})
df_movie = dff[dff['type']=='Movie'][['rating', 'type']].rename(columns={'type':'movie'})
df_movie = pd.DataFrame(df_movie.rating.value_counts()).reset_index().rename(columns={'index':'movie'})

df_tv_show = pd.DataFrame(df_tv_show.rating.value_counts()).reset_index().rename(columns={'index':'tv_show'})
df_tv_show['rating_final'] = df_tv_show['rating'] 
# making rating column value negative
df_tv_show['rating'] *= -1

# chart
fig = make_subplots(rows=1, cols=2, specs=[[{}, {}]], shared_yaxes=True, horizontal_spacing=0)
# bar plot for tv shows
fig.append_trace(go.Bar(x=df_tv_show.rating, y=df_tv_show.tv_show, orientation='h', showlegend=True, 
                        text=df_tv_show.rating_final, name='TV Show', marker_color='#221f1f'), 1, 1)
# bar plot for movies
fig.append_trace(go.Bar(x=df_movie.rating, y=df_movie.movie, orientation='h', showlegend=True, text=df_movie.rating,
                        name='Movie', marker_color='#b20710'), 1, 2)
# styling the chart
fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False, categoryorder='total ascending', ticksuffix=' ', showline=False)
fig.update_traces(hovertemplate=None, marker=dict(line=dict(width=0)))
fig.update_layout(title='Which has the highest rating TV shows or Movies?',
                  margin=dict(t=80, b=0, l=70, r=40),
                  hovermode="y unified", 
                  xaxis_title=' ', yaxis_title=" ",
                  plot_bgcolor='#333', paper_bgcolor='#333',
                  title_font=dict(size=25, color='#8a8d93', family="Lato, sans-serif"),
                  font=dict(color='#8a8d93'),
                  legend=dict(orientation="h", yanchor="bottom", y=1, xanchor="center", x=0.5),
                  hoverlabel=dict(bgcolor="black", font_size=13, font_family="Lato, sans-serif"))
fig.show()


Bi-directional chart in Plotly

The most votes of Netflix content are made with a TV-MA and TV-14 rating. If we see TV-MA is for the mature audience only adult programs and TV-14 rating contains content which is for parents or adult guardians may find unsuitable for children under the age of 14.

Which genre is preferred more for TV shows or Movies

df_m = df[df['type']=='Movie']
df_m = pd.DataFrame(df_m['genre'].value_counts()).reset_index()

fig_bars = px.bar(df_m[:5], x='genre', y='index', text='index', 
                        title='Most preferd Genre for Movies',
                        color_discrete_sequence=['#b20710'])
fig_bars.update_traces(hovertemplate=None, marker=dict(line=dict(width=0)))
fig_bars.update_xaxes(visible=False)
fig_bars.update_yaxes(visible=False, categoryorder='total ascending')
fig_bars.update_layout(height=300,
                  margin=dict(t=100, b=20, l=70, r=40),
                  hovermode="y unified", 
                  plot_bgcolor='#333', paper_bgcolor='#333',
                  title_font=dict(size=40, color='#8a8d93', family="Lato, sans-serif"),
                  font=dict(color='#8a8d93', size=13))


bar chart plotly

Movies are more preferred in Drama, comedy genre

df_tv = df[df['type']=='TV Show']
df_tv = pd.DataFrame(df_tv['genre'].value_counts()).reset_index()

fig_tv = px.bar(df_tv[:5], x='genre', y='index', text='index',
                     title='Most preferd Genre for TV Shows',
                     color_discrete_sequence=['#b20710'])
fig_tv.update_traces(hovertemplate=None, marker=dict(line=dict(width=0)))
fig_tv.update_xaxes(visible=False)
fig_tv.update_yaxes(visible=False, categoryorder='total ascending')
fig_tv.update_layout(height=300,
                  margin=dict(t=100, b=20, l=70, r=40),
                  hovermode="y unified", 
                  plot_bgcolor='#333', paper_bgcolor='#333',
                  title_font=dict(size=40, color='#8a8d93', family="Lato, sans-serif"),
                  font=dict(color='#8a8d93', size=13))

fig_tv.show()


bar chart in plotly

International TV shows are ruling the TV Shows which is great news! while other genres which are more preferred in TV Shows are comedy and British TV Shows. The anime crazy never gets off the hook.

Let’s jump to waterfall charts. Generally, the waterfall is a 2-Dimensional chart that is specially used to understand the effects of incremental positive and negative changes over time or over multiple steps or a variable. The waterfall charts are also known as Floating Bricks Charts, Flying Bricks Charts.

If you want to learn more about waterfall chart give a look here

Watching Movies over the Years

d2 = df[df["type"] == "Movie"]
col = "year_added"

vc2 = d2[col].value_counts().reset_index().rename(columns = {col : "count", "index" : col})
vc2['percent'] = vc2['count'].apply(lambda x : 100*x/sum(vc2['count']))
vc2 = vc2.sort_values(col)

fig2 = go.Figure(go.Waterfall(
    name = "Movie", orientation = "v", 
    x = ["2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017", "2018", "2019", "2020", "2021"],
    textposition = "auto",
    text = ["1", "2", "1", "13", "3", "6", "14", "48", "204", "743", "1121", "1366", "1228", "84"],
    y = [1, 2, -1, 13, -3, 6, 14, 48, 204, 743, 1121, 1366, -1228, -84],
    connector = {"line":{"color":"#b20710"}},
    increasing = {"marker":{"color":"#b20710"}},
    decreasing = {"marker":{"color":"orange"}},

))
fig2.update_xaxes(showgrid=False)
fig2.update_yaxes(showgrid=False, visible=False)
fig2.update_traces(hovertemplate=None)
fig2.update_layout(title='Watching Movies over the year', height=350,
                   margin=dict(t=80, b=20, l=50, r=50),
                   hovermode="x unified",
                   xaxis_title=' ', yaxis_title=" ",
                   plot_bgcolor='#333', paper_bgcolor='#333',
                   title_font=dict(size=25, color='#8a8d93', family="Lato, sans-serif"),
                   font=dict(color='#8a8d93'))


Waterfall chart in Plotly

The year 2010, 2012 were not good for the movie sales and the covid impact make it worse.

Conclusion

After going through this blog you got an idea of how does plotly works and what are the benefits of using plotly. We saw various charts like Histogram chart, Bar chart, Pie chart, Map chart, Scatter chart, Line chart, Waterfall chart, Animated bar chart.
Feel free to contact me on Kaggle | Linkedin

1 Like