Optimizing Dataframe Memory Footprint 14.-

Converting to Categorical to Save Memory.

On the previous page, 13, DQ states to “stick to using the category type primarily for object columns where less than 50% of the values are unique.”

On page 14, the mission wants us to

“convert all object columns where less than half of the column’s values are unique to the category dtype.”

My answer does that and is counted as wrong:

```
object_cols = moma.select_dtypes(include=['object']).columns
for col in object_cols:
first_unique = moma[col].value_counts(normalize=True).\
sort_values(ascending=False)[0]
if first_unique < .5:
moma[col] = moma[col].astype('category')
print(moma.info(memory_usage='deep'))
```

`moma[col].value_counts(normalize=True).sort_values(ascending=False)[0]`

will give me the proportion of the most common value in the column. If it’s less than 50%, then I can assume the rest of the values are as well.

The answer given by dataquest calculated something else

```
for col in moma.select_dtypes(include=['object']):
num_unique_values = len(moma[col].unique())
num_total_values = len(moma[col])
if num_unique_values / num_total_values < 0.5:
moma[col] = moma[col].astype('category')
print(moma.info(memory_usage='deep'))
```

This logic

```
num_unique_values = len(moma[col].unique())
num_total_values = len(moma[col])
if num_unique_values / num_total_values < 0.5:
```

does not calculate the proportion of unique values in a column. **In this instance, dataquest is teaching the wrong way to calculate the proportion of unique values in a column.**

For example, if there’s a column with 3 unique values but 100 total values in the column. By DQ’s calculation, that ratio would be 0.03 which is less than 0.5. So, we convert it to the categorical dtype to save memory as per the dataquest lesson.

But in reality, one of those values could appear 98 times in the column while the other two appear just once. This would make 98% of the values in the column unique. Which, according to dataquest, means that converting it to the categorical dtype would actually use more memory because most of the values are unique.

Is this clear?

Wouldn’t you expect a polished product from a subscription service that charges hundreds of dollars? Amateurs