Trouble When Creating Bars - Bar Plots and Scatter Plots

Screen Link: https://app.dataquest.io/m/144/bar-plots-and-scatter-plots/4/creating-bars

I was trying to go through the lesson “Creating Bars” where we begin to learn how to create bar plots and I’m having a hard to understanding a particular line of code. If someone could explain it to me that would be super helpful!!

import matplotlib.pyplot as plt
from numpy import arange
num_cols = ['RT_user_norm', 'Metacritic_user_nom', 'IMDB_norm', 'Fandango_Ratingvalue', 'Fandango_Stars']

bar_heights = norm_reviews[num_cols].iloc[0].values
bar_positions = arange(5) + 0.75

fig, ax = plt.subplots()

The above code is the code being used to create a bar plot. I have trouble understanding this line:

bar_heights = norm_reviews[num_cols].iloc[0].values

Where norm_reviews is the sliced dataframe, and num_cols are the selected columns, what does the iloc[0] and values portion actually do for the bar heights?

Thank you so much!

3 Likes

The bar graph we’re creating on this screen is graphing the information in the first row (.iloc[0]). Using .values takes the information in the row and turns it into an array that matplotlib can use to plot the heights of each bar. (You can see the result in the variable inspector for bar_heights.) Just for fun, you can change to .iloc[1] (or whichever) and see it make a bar plot for the next row.

3 Likes

Thanks again @april.g!!

Hey @april.g, I experimented without the “.values” method and it returned the same bar chart, working perfectly fine. I even submitted the result and it was properly accepted by the code checker. Any comment about why it would be a good coding practice to use the “.values” to return an array? Thanks!

1 Like

Yeh correct it returned same bar plot. @Sahil @april.g can let us know is this correct way to code.

1 Like

Hi @renanfmoises,

Generally, using .values can be considered good practice. This is because .values will return an ndarray which is often faster than pandas data structure. Here is my observation on the performance difference below. If the size of the data is large, then pandas is more efficient when it comes to arithmetic operations performed on the data. On the other hand, if the data is small, then ndarray performs faster. However, when it comes to indexing, ndarray is way faster than pandas data structure irrespective of data size.


Jupyter QtConsole 4.6.0

Python 3.7.6 (default, Jan 8 2020, 13:42:34)

Type 'copyright', 'credits' or 'license' for more information

IPython 7.12.0 -- An enhanced Interactive Python. Type '?' for help.


In [1]: import numpy as np

   ...: import pandas as pd


In [2]: ndarray_100m = np.random.randint(0, 9, 100000000)

   ...: series_100m = pd.Series(ndarray_100m)

   ...: series_ndarray_100m = series_100m.values

   ...: ndarray_10k = np.random.randint(0, 9, 10000)

   ...: series_10k = pd.Series(ndarray_10k)

   ...: series_ndarray_10k = series_10k.values


In [3]: %timeit series_100m ** 1.61803398875

735 ms ± 9.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [4]: %timeit series_ndarray_100m ** 1.61803398875

2.61 s ± 30.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [5]: %timeit series_10k ** 1.61803398875

385 µs ± 10 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [6]: %timeit series_ndarray_10k ** 1.61803398875

216 µs ± 1.81 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [7]: %timeit series_100m[1000]

46.9 µs ± 22.3 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [8]: %timeit series_ndarray_100m[1000]

168 ns ± 2.4 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


In [9]: %timeit series_10k[1000]

16.6 µs ± 3.15 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [10]: %timeit series_ndarray_10k[1000]

169 ns ± 2.11 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


Here is a good article on performance comparison:

Best,
Sahil

2 Likes