I believe this question in the Guided Project has a typo in it

Guided Project: Practice Optimizing Dataframes and Processing in Chunks

Original Question:

  • How many unique values are there in each string column? How many of the string columns contain values that are less than 50% unique?

My initial answer was wrong. So after looked up the solution I believe the above question should be

  • How many unique values are there in each string column? How many of the string columns have less than 50 unique values?

Besides, I believe the solution provided is way complicated than it should be

loans_chunks = pd.read_csv('loans_2007.csv',chunksize=3000)

uniques = {}
for lc in loans_chunks:
    strings_only = lc.select_dtypes(include=['object'])
    cols = strings_only.columns
    for c in cols:
        val_counts = strings_only[c].value_counts()
        if c in uniques:
            uniques[c].append(val_counts)
        else:
            uniques[c] = [val_counts]

uniques_combined = {}
unique_stats = {
    'column_name': [],
    'total_values': [],
    'unique_values': [],
}
for col in uniques:
    u_concat = pd.concat(uniques[col])
    u_group = u_concat.groupby(u_concat.index).sum()
    uniques_combined[col] = u_group
    if u_group.shape[0] < 50:
        print(col, u_group.shape[0])

The same result can be obtained with the following code:

from collections import defaultdict

uniques = defaultdict(set)
loans_chunks = pd.read_csv('loans_2007.csv',chunksize=3000)

for chunk in loans_chunks:
    for col in chunk.select_dtypes(include='object'):
        uniques[col].update(chunk[col].dropna().unique())

for col in uniques:
    n_unique = len(uniques[col])
    if n_unique < 50:
        print(col, n_unique)

# output
term 2
grade 7
sub_grade 35
emp_length 11
home_ownership 5
verification_status 3
loan_status 9
pymnt_plan 2
purpose 14
initial_list_status 1
application_type 1

Correct me if I am wrong.

1 Like

You’re correct. I’ll pass this information along to the team.

Thanks!

Not only is it a typo, but perhaps a full blown error on the interpretation of what is being calculated on behalf of dataquest.

https://app.dataquest.io/m/163/optimizing-dataframe-memory-footprint/13/converting-to-categorical-to-save-memory

https://app.dataquest.io/m/163/optimizing-dataframe-memory-footprint/14/converting-to-categorical-to-save-memory