How to calculate % unique values in Guided Project - Practice Optimizing Dataframes

Dataframe Optimizing Guided Project in the Data Engineer Path
I’m working through step 2 of 5 (Exploring the Data in Chunks) in this project, trying to calculate how many of the string columns in the table from https://www.lendingclub.com/info/download-data.action contain values that are less than 50% unique. I thought I had it, but my answer doesn’t match the dataquest answer key. I’ve explained my process and given the dataquest solution below. Could someone explain what I’m missing? Thanks!

Dataquest Solution

I’m perplexed by the Dataquest solution (code shown below). The last two rows of code is where the < 50% selection takes place. However I think what is actually happening it that columns with fewer than 50 unique values (not 50%) are being selected. Take the int_rate column for example. I calculate 394 unique values (see my code below). This seems like it is low enough to be included in the less than 50% unique values in such a large table, but the dataquest solution does not return this column. I feel like I must be missing something obvious, so if someone could point it out to me, that would be great :slight_smile:

I’ve also included my solution at the very end of this long question in case it is helpful.

loans_chunks = pd.read_csv('loans_2007.csv',chunksize=3000)

uniques = {}
for lc in loans_chunks:
    strings_only = lc.select_dtypes(include=['object'])
    cols = strings_only.columns
    for c in cols:
        val_counts = strings_only[c].value_counts()
        if c in uniques:
            uniques[c].append(val_counts)
        else:
            uniques[c] = [val_counts]

uniques_combined = {}
unique_stats = {
    'column_name': [],
    'total_values': [],
    'unique_values': [],
}
for col in uniques:
    u_concat = pd.concat(uniques[col])
    u_group = u_concat.groupby(u_concat.index).sum()
    uniques_combined[col] = u_group
    if u_group.shape[0] < 50:
        print(col, u_group.shape[0])

My Process

Reading the table in chunks makes things a little more complicated, but my plan was to read the relevant columns, count the unique values in each column, and divide that number by the total number of rows:

This shows how I get a count of unique values:

chunk_iter = pd.read_csv("LoanStats3a.csv", chunksize=rows_under_5Mb, header=1, usecols=dq_columns)
unique_values = {}
for chunk in chunk_iter:
    obj_cols = chunk.select_dtypes(include=['object'])
    for col in obj_cols.columns:
        values = obj_cols[col].value_counts()
        if col in unique_values:                      
             unique_values[col] = pd.concat([unique_values[col], values])
        else:
            unique_values[col] = values

unique_values_combined = {}
for key in unique_values:
    unique_values_combined[key] = unique_values[key].groupby(level=0).sum()

unique_value_counts_chunks = {}
for key in unique_values_combined:
    unique_value_counts_chunks[key] = len(unique_values_combined[key])

The unique_value_counts_chunks dict in the code block looks like this:

{'term': 2,
 'int_rate': 394,
 'grade': 7,
 'sub_grade': 35,
 'emp_title': 30658,
 'emp_length': 11,
 'home_ownership': 5,
 'verification_status': 3,
 'issue_d': 55,
 'loan_status': 4,
 'pymnt_plan': 1,
 'purpose': 14,
 'title': 21264,
 'zip_code': 837,
 'addr_state': 50,
 'earliest_cr_line': 530,
 'revol_util': 1119,
 'initial_list_status': 1,
 'last_pymnt_d': 112,
 'last_credit_pull_d': 144,
 'application_type': 1,
 'id': 3}

So I simply divide the counts in the above dictionary by the total number of rows in the table (42,535). These are the columns that I calculate to have less than 50% unique values (the number is the percent I calculate):

{'term': 0.0,
 'int_rate': 0.01,
 'grade': 0.0,
 'sub_grade': 0.0,
 'emp_length': 0.0,
 'home_ownership': 0.0,
 'verification_status': 0.0,
 'issue_d': 0.0,
 'loan_status': 0.0,
 'pymnt_plan': 0.0,
 'purpose': 0.0,
 'zip_code': 0.02,
 'addr_state': 0.0,
 'earliest_cr_line': 0.01,
 'revol_util': 0.03,
 'initial_list_status': 0.0,
 'last_pymnt_d': 0.0,
 'last_credit_pull_d': 0.0,
 'application_type': 0.0}

The answer looks like it is indeed calculating less than 50 (absolute count) rather than %. However, if doing it your way, all the numbers are so small, way less than 0.5. Could be wrong instruction or bad question design. I wonder what happens to unique_stats dictionary later, maybe that is relevant to this question since you seem to paste an incomplete answer that didn’t use unique_stats.

P.S Your pd.concat on every loop could be spending much more time doing copy-pasting to return a longer dictionary value in each loop compared to the answer using list to store first and doing pd.concat only once at the end.

2 Likes

Thanks for the reply hanqi and thanks for the tip on the speed of my pd.concat! I’ll test the speed and amend my answer.

Good question on unique_stats…I’m not sure why unique_stats is defined. I don’t see it referenced anywhere else in the dataquest solution (https://github.com/dataquestio/solutions/blob/master/Mission165Solutions.ipynb).

1 Like