Bar Plot Pandas problem sorting bars numerical order high to low - Guided Project: Visualizing Earnings Based On College Majors

m = ['Metallurgical Engineering','Petroleum Engineering', 
            'Mining and Mineral Engineering']

mapping = {ShareWomen: i for i, ShareWomen in enumerate(m)}
key = recent_grads['ShareWomen'].map(mapping)
recent_grads.iloc[key.argsort()]


# pandas.DataFrame.plot.bar
recent_grads[:3].plot.bar(x='Major', y='ShareWomen',
                           title='ShareWomen in Highest 3 major by Median Salary'
                         , legend=False)

I am trying to sort the bars of the bar plot from highest to lowest to make a Pareto chart, but it is not working can someone please advise what is wrong with the above code?

I am trying to modify code from stackoverflow to implement this:
https://stackoverflow.com/questions/22635110/sorting-the-order-of-bars-in-pandas-matplotlib-bar-plots/22636132

Hi @jamesberentsen,

there are some issues, which secretly tell Python to do things we do not want it to do:

The Majors are all in uppercase

Python is very sensitive when it comes to names, number of spaces, etc. so in this case there will be no matches between Majors in m and Majors in recent_grads

column names switched

The mapping:

mapping = {ShareWomen: i for i, ShareWomen in enumerate(m)}

is reality is referring to the Major column as Petroleum Engineering is in Major column. ShareWomen is a result, which we will try to order. This by itself is not invalid, but may lead to problems during future debugging, like this one.

key returns NaNs

Because recent_grads['ShareWomen'] consists from floats only, this step will return NaNs only - there are no results that match the mapping.

key.argsort() returns -1 only

Because the key is constructed from NaNs, key.argsort() returns only -1.

.iloc not changing the DataFrame itself

.iloc returns the results that match the integer input (key.argsort() in this case), but it does not change the DataFrame itself. In this case, because key.argsort() = -1 independent on the Major, calling:

recent_grads.iloc[key.argsort()]

is equivalent to:

recent_grads.iloc[-1]

which means returning the last row from recent_grads DataFrame.

How to resolve this?

1. Names should perfectly match - UPPERCASE, lowercase, Title

Only the result matters, so you can choose either, so long as lowercase is compared with lowercase, UPPERCASE with UPPERCASE, etc.

2. A subset of recent_grads is better

It’s not a must but you remove the risk of having incorrect data. In this case, the last 3 will be ok, but the last 4 and more would not (the key will impose NaN on all data that don’t match the mapping, meaning all of them will have a single value - in this case 0.877960 because -1th value is called.

3. Reassign values back or (better) work on a copy of the DataFrame

Changing the original DataFrame is straight forward but when bugs appear, one has to call all the code again to recreate the DataFrame over and over until the issue is resolved. It pays to work with copies of the original. That way you’ll need to run only 1 line of code, instead of N (reding, cleaning, etc.)

At the end of the day, you’ll end up with this:Descening_order_bar_plot

Only minor changes are needed in the code, but they make for a huge difference.

If you have more questions, I’ll do my best to help.

Good luck!

3 Likes

Hi kakoori,

Thanks for your explanation.
I have amended the case to uppercase and converted the dataframe copy to a dictionary with to_dict to get a key:value pair so hopefully the key.argsort() works.
If I understand correctly, series.argsort works with index values which need to be numbers and it would not work with code above, as iloc returns numbers, but the keys are strings? Is that why it would return -1? I am not sure why I do not get an error with the code above using iloc[key.argsort() since I think it is working with the index of strings - the majors I do not know what is wrong with code below as still I get the exact same output.

It is still not sorting bars from high to low:

# 2. A subset of `recent_grads` is better -- 
 x = recent_grads[['Major','ShareWomen']]
# 3.work on a copy of the DataFrame
    x_copy = x.copy()

# 1. Names should perfectly match - UPPERCASE, lowercase, Title
    m = ['METALLURGICAL ENGINEERING','PETROLEUM ENGINEERING', 'MINING AND MINERAL ENGINEERING']

    x_copy.to_dict('dict')

    mapping = {major: i for i, major in enumerate(m)}
    key = x_copy['ShareWomen'].map(mapping)
    x_copy.iloc[key.argsort()]


    # pandas.DataFrame.plot.bar
    x_copy[:3].plot.bar(x='Major', y='ShareWomen',
                               title='ShareWomen in Highest 3 major by Median Salary'
                             , legend=False)

You don’t particularly need to overcomplicate this.

You can simply use sort_values() on your dataframe before plotting it -

# 2. A subset of `recent_grads` is better -- 
x = recent_grads[['Major','ShareWomen']]
# 3.work on a copy of the DataFrame
x_copy = x.copy()

# pandas.DataFrame.plot.bar
x_copy[:3].sort_values("ShareWomen", ascending=False).plot.bar(x='Major', y='ShareWomen',
                           title='ShareWomen in Highest 3 major by Median Salary'
                         , legend=False)

Notice that - x_copy[:3].sort_values("ShareWomen", ascending=False)

That will sort the values on ShareWomen, in descending order. And you will get the following -

image

I think the above is what you are trying to do here? Or do you specifically want to know why the code you shared based on that stackoverflow answer didn’t work?

1 Like

Many thanks the_doctor,

It worked now, that code is also easier to understand. Yes, I agree that simple is better, but since I spent time trying to figure it out it would be useful to know why the code did not work to understand the mechanics.
Regards,
JB

@jamesberentsen

You can leave the DataFrame as it is. The key will works just fine.

Series.argsort() will works with string too. The result is the sorted list of index positions.

-1 is returned by Series.argsort() every time it sees a NaN value, as per documentation for Series.argosort()

This may be, because .iloc function does not change the recent_grads. So, in the end, everything before the # pandas.DataFrame.plot.bar doesn’t influence the plot. This means that you plot the last 3 results of the 'ShareWomen' column.

To use this code:

you need only to include the modifed DataFrame:

m = ['METALLURGICAL ENGINEERING','PETROLEUM ENGINEERING',
     'MINING AND MINERAL ENGINEERING']

# set a filtering condition for the DataFrame
# returns True if the value from 'Major' column is found in list 'm'
filtering_condition = recent_grads['Major'].isin(m)

# return a subset of the recent_grads
x = recent_grads[filtering_condition]

mapping = {major: i for i, major in enumerate(m)}
key = x['ShareWomen'].map(mapping)
modified_df = x.iloc[key.argsort()]

# pandas.DataFrame.plot.bar
modified_df.plot.bar(x='Major', y='ShareWomen',
                    title='ShareWomen in Highest 3 major by Median Salary' ,
                    legend=False)
1 Like

Sure. Here is your code -


# 2. A subset of `recent_grads` is better -- 
x = recent_grads[['Major','ShareWomen']]
# 3.work on a copy of the DataFrame
x_copy = x.copy()

# 1. Names should perfectly match - UPPERCASE, lowercase, Title
m = ['METALLURGICAL ENGINEERING','PETROLEUM ENGINEERING', 'MINING AND MINERAL ENGINEERING']

x_copy.to_dict('dict')

mapping = {major: i for i, major in enumerate(m)}
key = x_copy['ShareWomen'].map(mapping)
x_copy.iloc[key.argsort()]


# pandas.DataFrame.plot.bar
x_copy[:3].plot.bar(x='Major', y='ShareWomen',
                           title='ShareWomen in Highest 3 major by Median Salary'
                         , legend=False)

If you print(key), you will notice that you get only NaNs as the output. Which doesn’t seem right. Why would it have only NaNs?

That’s because you are using map() on ShareWomen. ShareWomen has no value that corresponds to any of the strings in m. Those strings in m are only relevant to the Major column.

So, you need to modify it accordingly -

key = x_copy['Major'].map(mapping)

If you now print key you will see something like the following -

0 1.0
1 2.0
2 0.0
3 NaN
4 NaN
.
.
.

Your first 3 values in Major are the ones that you are trying to plot. So, that’s why the first 3 values above are based on your mapping dictionary. And rest of the values are NaNs because they were not part of your mapping dictionary.

If you try to run your code now, you will still not get the right plot. That’s because of the following -

x_copy.iloc[key.argsort()]

You apply the argsort() based on the key() to x_copy, but you are not saving it to anything. If you don’t save it back to x_copy, when you try to use plot, you will work with however the data was stored in x_copy before that operation, that is, with unsorted data.

So, you need to save it -

x_copy = x_copy.iloc[key.argsort()]

After this, when you run your code, you should get the correct plot because your x_copy is now sorted as per the key.

You can avoid saving it by attaching the plotting code to that step as well -

x_copy.iloc[key.argsort()][:3].plot.bar(x='Major', y='ShareWomen',
                           title='ShareWomen in Highest 3 major by Median Salary'
                         , legend=False)

1 Like

I understand now. Many thanks again for your explanation the _doctor.

1 Like

Thanks for your detailed explanation kakoori.

1 Like

is there any way to change the colors for different category, they’re all the same?

I tried Google but could not find anything :
color = (0.5,0.1,0.6)

I would suggest creating a separate question for this so that this post of yours only focuses on your original question. Easier for other students to find and go through them.

1 Like