How to proceed if all float columns have missing values? Also, qualms with the solution notebook - Guided Project: Practice Optimizing Dataframes and Processing in Chunks

https://app.dataquest.io/m/165/guided-project%3A-practice-optimizing-dataframes-and-processing-in-chunks/4/optimizing-numeric-columns

First,
the project asks us to calculate the memory usage across chunks three times.

  1. At the start on page 2
  2. After string column optimization
  3. After numeric column optimization

The solution notebook only does it the first time, so we have no memory savings to compare to.
You’d think the solution notebook would follow the instructions given to users by dataquest…

Second,
Page 4 of the project instructs us to:

  • Identify float columns that contain missing values, and that we can convert to a more space efficient subtype.
  • Identify float columns that don’t contain any missing values, and that we can convert to the integer type because they represent whole numbers.```

Perhaps due to the data itself, in my opinion, the solution notebook provides a very poor example for these bullet points.

The solution in this section seems half-hearted and redundant even, as both code blocks provided for this part have the same output in the notebook.

From solution notebook:

chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000, dtype=convert_col_dtypes, parse_dates=["issue_d", "earliest_cr_line", "last_pymnt_d", "last_credit_pull_d"])
mv_counts = {}
for chunk in chunk_iter:
    term_cleaned = chunk['term'].str.lstrip(" ").str.rstrip(" months")
    revol_cleaned = chunk['revol_util'].str.rstrip("%")
    chunk['term'] = pd.to_numeric(term_cleaned)
    chunk['revol_util'] = pd.to_numeric(revol_cleaned)
    float_cols = chunk.select_dtypes(include=['float'])
    for col in float_cols.columns:
        missing_values = len(chunk) - chunk[col].count()
        if col in mv_counts:
            mv_counts[col] = mv_counts[col] + missing_values
        else:
            mv_counts[col] = missing_values
mv_counts

Output:

{'member_id': 3,
 'loan_amnt': 3,
 'funded_amnt': 3,
 'funded_amnt_inv': 3,
 'installment': 3,
 'annual_inc': 7,
 'dti': 3,
 'delinq_2yrs': 32,
 'inq_last_6mths': 32,
 'open_acc': 32,
 'pub_rec': 32,
 'revol_bal': 3,
 'revol_util': 93,
 'total_acc': 32,
 'out_prncp': 3,
 'out_prncp_inv': 3,
 'total_pymnt': 3,
 'total_pymnt_inv': 3,
 'total_rec_prncp': 3,
 'total_rec_int': 3,
 'total_rec_late_fee': 3,
 'recoveries': 3,
 'collection_recovery_fee': 3,
 'last_pymnt_amnt': 3,
 'collections_12_mths_ex_med': 148,
 'policy_code': 3,
 'acc_now_delinq': 32,
 'chargeoff_within_12_mths': 148,
 'delinq_amnt': 32,
 'pub_rec_bankruptcies': 1368,
 'tax_liens': 108,
 'term': 3}
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000, dtype=convert_col_dtypes, parse_dates=["issue_d", "earliest_cr_line", "last_pymnt_d", "last_credit_pull_d"])
mv_counts = {}
for chunk in chunk_iter:
    term_cleaned = chunk['term'].str.lstrip(" ").str.rstrip(" months")
    revol_cleaned = chunk['revol_util'].str.rstrip("%")
    chunk['term'] = pd.to_numeric(term_cleaned)
    chunk['revol_util'] = pd.to_numeric(revol_cleaned)
    chunk = chunk.dropna(how='all')
    float_cols = chunk.select_dtypes(include=['float'])
    for col in float_cols.columns:
        missing_values = len(chunk) - chunk[col].count()
        if col in mv_counts:
            mv_counts[col] = mv_counts[col] + missing_values
        else:
            mv_counts[col] = missing_values
mv_counts

Output (according to notebook):

{'member_id': 3,
 'loan_amnt': 3,
 'funded_amnt': 3,
 'funded_amnt_inv': 3,
 'installment': 3,
 'annual_inc': 7,
 'dti': 3,
 'delinq_2yrs': 32,
 'inq_last_6mths': 32,
 'open_acc': 32,
 'pub_rec': 32,
 'revol_bal': 3,
 'revol_util': 93,
 'total_acc': 32,
 'out_prncp': 3,
 'out_prncp_inv': 3,
 'total_pymnt': 3,
 'total_pymnt_inv': 3,
 'total_rec_prncp': 3,
 'total_rec_int': 3,
 'total_rec_late_fee': 3,
 'recoveries': 3,
 'collection_recovery_fee': 3,
 'last_pymnt_amnt': 3,
 'collections_12_mths_ex_med': 148,
 'policy_code': 3,
 'acc_now_delinq': 32,
 'chargeoff_within_12_mths': 148,
 'delinq_amnt': 32,
 'pub_rec_bankruptcies': 1368,
 'tax_liens': 108,
 'term': 3}

The only difference between both code blocks is chunk = chunk.dropna(how='all').

What is the point of this redundancy, and what is the point of chunk.dropna(how='all') if the output remains the same? Are we meant to drop all null values as data engineers?

How should we proceed to identify float columns with/without missing data that we can switch to the integer type if all of the float columns have missing values and integer types can’t handle missing values (producing an error)?

Frankly, this is a really poor solution notebook. It’s not annotated or commented on properly like dataquest suggests we should do and it feel incomplete. Am i wrong for being critical? Maybe I’m just a poor student/learner.

Things like this annoy the ■■■■ out of me because it feels incomplete and I’m paying for it.