LIMITED TIME OFFER: 50% OFF OF PREMIUM WITH OUR ANNUAL PLAN (THAT'S $294 IN SAVINGS).
GET OFFER

Easily label your dataset

The mission on Frequency Distributions has an example of categorizing a numeric column with labels in a new column :

 def make_pts_ordinal(row):
    if row['PTS'] <= 20:
        return 'very few points'
    if (20 < row['PTS'] <=  80):
        return 'few points'
    if (80 < row['PTS'] <=  150):
        return 'many, but below average'
    if (150 < row['PTS'] <= 300):
        return 'average number of points'
    if (300 < row['PTS'] <=  450):
        return 'more than average'
    else:
        return 'much more than average'

I was looking for a something more pythonic. You want to be able to say

df[ newColName ] = label( df[ colName ] , markers, labels ) # where, in this case,
# markers would be [20, 80, 150, 300, 450]  and 
# labels would be ['very few points', 'few points', ... etc ]

That’s what follows… First a function (if you know of something in one of the standard libraries that does this, please share. I’d any day prefer something that already exists) that returns the interval number :

import math
import numpy as np

def interval( num_list, num ) :
    """ list of numbers, number --> int"""
    # [5,10,15], 5 --> 1
    # [5,10,15], 4 --> 0
    # [5,10,15], 15 --> 3
    # [], x --> ValueError
    if math.isnan( num ) or num == np.NaN :
        return None
    if len(num_list) == 0 :
        raise ValueError('Input list cannot be empty') 
    L = len( num_list )
    current = int( L / 2.0 )
    while True :
        if 0 == current : 
            if num < num_list[current] :
                return 0
            else :
                return 1
        elif L - 1 == current :
            if num >= num_list[current] :
                return L
            else :
                return L-1
        else :
            if num_list[current] <= num < num_list[current+1]  :
                return current
            elif num == num_list[current + 1] :
                return current + 1
            elif num > num_list[current + 1] :
                current = int( ( L + current)/2.0 )
            elif num < num_list[current] :
                current = int( current/2.0 )

And now a labeler that takes a pd.Series, marker list and label list and returns a series of labels :

import numpy as np
def label( series, markers, labels ) :
    """ pd.Series numeric, list of numbers of len N, list of len N+1 --> pd.Series """
    return series.apply( lambda x : labels[ interval( markers, x ) ] if not np.isnan(x) else 'invalid')

And then, for our specific case :

pts_order = ['very few points', 'few points', 'many, but below average',
             'average number of points', 'more than average', 'much more than average']
wnba['PTS_ordinal_scale'] = label( wnba['PTS'], [20,80,150,300,450], pts_order)
wnba['PTS_ordinal_scale'].value_counts().plot.bar( rot=30)
plt.xticks(ha='right')
plt.show()

image

https://www.kaggle.com/jinxbe/wnba-player-stats-2017

2 Likes

Pure python/numpy:https://johnlekberg.com/blog/2020-11-21-stdlib-bisect.html
Pandas: https://pbpython.com/pandas-qcut-cut.html

3 Likes

Thank you so much @hanqi. Following that lead, I realized I can’t even do a simple binary search :wink: The interval function I quoted before won’t work… The else needs to be :

if num_list[current-1] <= num < num_list[current]  :
    return current
elif num == num_list[current ] :
    return current + 1
elif num > num_list[current] :
    left = current
elif num < num_list[current-1] :
    right = current
current = int( (left + right)/2.0 )

But, it’s totally unnecessary since we have bisect, which also, given the ultimate goal, is not needed. The interval function can be coded simply as :

 import bisect

def interval( num_list, num ) :
    if math.isnan( num ) or num == np.NaN :
        return None
    if len(num_list) == 0 :
        raise ValueError('Input list cannot be empty') 
    return bisect.bisect(num_list, num )

And, for the labeling, we could just use (note the use of a very negative number and a very positive number to set the end points) :

wnba['new_PTS_lbl'] = pd.cut( wnba['PTS'], bins=[-10,20,80,150,300,450,1e6], labels=pts_order )

Using pd.cut as you pointed out.
Thanks!
Verified using value_counts that it works fine…

1 Like

num == np.NaN will never be True, so is useless contributor to

Maybe you mean num is np.nan.

I’m still trying to find sources confirming np.nan == any object is false.


This shows how it’s important to be aware of implicit type conversions.

However, df[A] == df[A] can still be used as a trick to identify nans, because all the other non-nan rows will be True and only nan rows are False. When working with df, be extra careful if your comparing the df object (which includes columns, indexes, values), or only the values. It can get really confusing, when you think your comparing values and result is True because it compared the columns/indexes which were non-empty and returned True.

Rather than do int(x/2.0), you can use integer division operator directly x//2. In fact, it’s very important to do // division in some online coding problems that accumulate large values and require %10*9+7 to ensure huge calculations are always cast to integers as often as possible, else the uncast huge floats could accumulate errors and you could waste time debugging the wrong places, bad situation to be in for timed automated interviews.

You look like you go in-depth to understand, so just for your solving fun, see how dictionary.dict_values are not equal even though they look to contain the same 1,1,1

1 Like