Some issues with creating ordinal columns

Hi, I am trying to create an ordinal column in pandas based on some range which is like this -

def create_ordinal(val):
    if np.isnan(val):
        return 'Rookie'
    if (1 >= val <=3): 
        return 'Little experience'
    if (4 >= val <= 5):
        return 'Experienced'
    if (5 >= val <=10):
        return 'Very experienced'
    if (val > 10):
        return 'Veteran'

wnba['Exp_ordinal'] = wnba['Experience'].apply(create_ordinal)

The logic is taken from the picture included in the upload.Graphs_For_Frequency_Distributions___Dataquest

np.isnan() is used for the zero values.
But I am getting a different result than what is shown in dataquest.

Dataquest graph -
dq_graph

My graph -
my_graph

You can see I am getting a different result. If you just look at the little experience bars you can see that both graphs have different frequencies. The experienced bar is also wrong. I know there is some mistake in the function logic that I am using but couldn’t able to know what mistake I am making.

The data comes from this exercise on Dataquest - https://app.dataquest.io/m/286/visualizing-frequency-distributions/2/bar-plots

Any help will be very helpful. Thanks for taking some time to read and reply.

labels = ['Rookie','Little experience','Experienced','Very experienced','Veteran']
bin_df = pd.cut(wnba.query('Experience!="R"').Experience.astype(int),bins=[-1,1,4,5,10,100],labels=labels)
bin_df.value_counts()[labels].plot.bar()

I dropped ‘R’ because i don’t know what it means.
image
3 of my bars match dataquest graph. Maybe dataquest placed all ‘R’ under ‘Rookie’ and moved my ‘Rookie’ bar on top of my ‘Experienced’ bar. Not sure why dataquest graph bars are like this.
You can sort your bars by indexing into the output series produced by value_counts() with the order you want, and get familiar with df.reindex() which is very important for other applications.

You can make use of pd.cut for that unscalable if-else chaining. From https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html, try to learn how to control the left and right edges (4 combinations in total), the open/close edges of intervals in each bin directly affect whether counts go into one or other bin, so maybe that is where your graph mismatches. Also, try labels=False, very useful feature!
df.query helps substitute the tedious df[df.col] especially if df is a long name once you start versioning df through variable name. The last number in bins was arbitrary, just have to make it at least as big as max of the list of numbers you want to cut, similarly for the first number in bins

1 Like

Have changed all the values of ‘R’ to 0 (numeric zero) in the data set.

def create_ordinal(val):
if (val == 0):
return ‘Rookie’
if (1 >= val <=3):
return ‘Little experience’
if (4 >= val <= 5):
return ‘Experienced’
if (5 >= val <=10):
return ‘Very experienced’
if (val > 10):
return ‘Veteran’

wnba[‘Exp_ordinal’] = wnba[‘Experience’].apply(create_ordinal)

wnba[‘Exp_ordinal’].value_counts().iloc[[3,0,2,1,4]].plot.bar()