How to proceed if all float columns have missing values? Also, qualms with the solution notebook - Guided Project: Practice Optimizing Dataframes and Processing in Chunks

https://app.dataquest.io/m/165/guided-project%3A-practice-optimizing-dataframes-and-processing-in-chunks/4/optimizing-numeric-columns

First,
the project asks us to calculate the memory usage across chunks three times.

  1. At the start on page 2
  2. After string column optimization
  3. After numeric column optimization

The solution notebook only does it the first time, so we have no memory savings to compare to.
You’d think the solution notebook would follow the instructions given to users by dataquest…

Second,
Page 4 of the project instructs us to:

  • Identify float columns that contain missing values, and that we can convert to a more space efficient subtype.
  • Identify float columns that don’t contain any missing values, and that we can convert to the integer type because they represent whole numbers.```

Perhaps due to the data itself, in my opinion, the solution notebook provides a very poor example for these bullet points.

The solution in this section seems half-hearted and redundant even, as both code blocks provided for this part have the same output in the notebook.

From solution notebook:

chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000, dtype=convert_col_dtypes, parse_dates=["issue_d", "earliest_cr_line", "last_pymnt_d", "last_credit_pull_d"])
mv_counts = {}
for chunk in chunk_iter:
    term_cleaned = chunk['term'].str.lstrip(" ").str.rstrip(" months")
    revol_cleaned = chunk['revol_util'].str.rstrip("%")
    chunk['term'] = pd.to_numeric(term_cleaned)
    chunk['revol_util'] = pd.to_numeric(revol_cleaned)
    float_cols = chunk.select_dtypes(include=['float'])
    for col in float_cols.columns:
        missing_values = len(chunk) - chunk[col].count()
        if col in mv_counts:
            mv_counts[col] = mv_counts[col] + missing_values
        else:
            mv_counts[col] = missing_values
mv_counts

Output:

{'member_id': 3,
 'loan_amnt': 3,
 'funded_amnt': 3,
 'funded_amnt_inv': 3,
 'installment': 3,
 'annual_inc': 7,
 'dti': 3,
 'delinq_2yrs': 32,
 'inq_last_6mths': 32,
 'open_acc': 32,
 'pub_rec': 32,
 'revol_bal': 3,
 'revol_util': 93,
 'total_acc': 32,
 'out_prncp': 3,
 'out_prncp_inv': 3,
 'total_pymnt': 3,
 'total_pymnt_inv': 3,
 'total_rec_prncp': 3,
 'total_rec_int': 3,
 'total_rec_late_fee': 3,
 'recoveries': 3,
 'collection_recovery_fee': 3,
 'last_pymnt_amnt': 3,
 'collections_12_mths_ex_med': 148,
 'policy_code': 3,
 'acc_now_delinq': 32,
 'chargeoff_within_12_mths': 148,
 'delinq_amnt': 32,
 'pub_rec_bankruptcies': 1368,
 'tax_liens': 108,
 'term': 3}
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000, dtype=convert_col_dtypes, parse_dates=["issue_d", "earliest_cr_line", "last_pymnt_d", "last_credit_pull_d"])
mv_counts = {}
for chunk in chunk_iter:
    term_cleaned = chunk['term'].str.lstrip(" ").str.rstrip(" months")
    revol_cleaned = chunk['revol_util'].str.rstrip("%")
    chunk['term'] = pd.to_numeric(term_cleaned)
    chunk['revol_util'] = pd.to_numeric(revol_cleaned)
    chunk = chunk.dropna(how='all')
    float_cols = chunk.select_dtypes(include=['float'])
    for col in float_cols.columns:
        missing_values = len(chunk) - chunk[col].count()
        if col in mv_counts:
            mv_counts[col] = mv_counts[col] + missing_values
        else:
            mv_counts[col] = missing_values
mv_counts

Output (according to notebook):

{'member_id': 3,
 'loan_amnt': 3,
 'funded_amnt': 3,
 'funded_amnt_inv': 3,
 'installment': 3,
 'annual_inc': 7,
 'dti': 3,
 'delinq_2yrs': 32,
 'inq_last_6mths': 32,
 'open_acc': 32,
 'pub_rec': 32,
 'revol_bal': 3,
 'revol_util': 93,
 'total_acc': 32,
 'out_prncp': 3,
 'out_prncp_inv': 3,
 'total_pymnt': 3,
 'total_pymnt_inv': 3,
 'total_rec_prncp': 3,
 'total_rec_int': 3,
 'total_rec_late_fee': 3,
 'recoveries': 3,
 'collection_recovery_fee': 3,
 'last_pymnt_amnt': 3,
 'collections_12_mths_ex_med': 148,
 'policy_code': 3,
 'acc_now_delinq': 32,
 'chargeoff_within_12_mths': 148,
 'delinq_amnt': 32,
 'pub_rec_bankruptcies': 1368,
 'tax_liens': 108,
 'term': 3}

The only difference between both code blocks is chunk = chunk.dropna(how='all').

What is the point of this redundancy, and what is the point of chunk.dropna(how='all') if the output remains the same? Are we meant to drop all null values as data engineers?

How should we proceed to identify float columns with/without missing data that we can switch to the integer type if all of the float columns have missing values and integer types can’t handle missing values (producing an error)?

Frankly, this is a really poor solution notebook. It’s not annotated or commented on properly like dataquest suggests we should do and it feel incomplete. Am i wrong for being critical? Maybe I’m just a poor student/learner.

Things like this annoy the ■■■■ out of me because it feels incomplete and I’m paying for it.

1 Like

Same issue here.

I understand that we are supposed to solve some of these problems ourselves.
For example, for the memory consumption you can use the same code you used before and just insert it in the for loop that iterates over the chunks to get the new memory usage after modifying the dataset.

But the last part of the Solution is indeed very vague and not explanatory of what conclusions we can arrive to.

It’s a big disappointment because I really like the way Dataquest teaches, but since the beginning of this Step from the Data Engineer Path I’ve noticed and reported many inconsistencies and mistakes like this one.

It makes it really hard to learn when we are dealing with holes in the content like the ones mentioned above.

I hope someone addresses this issue.

3 Likes