Convert all object columns where less than half of the column's values are unique to the category dtype...dataquest gives wrong answer

Optimizing Dataframe Memory Footprint 14.-
Converting to Categorical to Save Memory.

https://app.dataquest.io/m/163/optimizing-dataframe-memory-footprint/14/converting-to-categorical-to-save-memory

On the previous page, 13, DQ states to “stick to using the category type primarily for object columns where less than 50% of the values are unique.”

On page 14, the mission wants us to
“convert all object columns where less than half of the column’s values are unique to the category dtype.”

My answer does that and is counted as wrong:

object_cols = moma.select_dtypes(include=['object']).columns
for col in object_cols:
    first_unique = moma[col].value_counts(normalize=True).\
    sort_values(ascending=False)[0]
    
    if first_unique < .5:
        moma[col] = moma[col].astype('category')
print(moma.info(memory_usage='deep'))

moma[col].value_counts(normalize=True).sort_values(ascending=False)[0] will give me the proportion of the most common value in the column. If it’s less than 50%, then I can assume the rest of the values are as well.

The answer given by dataquest calculated something else

for col in moma.select_dtypes(include=['object']):
    num_unique_values = len(moma[col].unique())
    num_total_values = len(moma[col])
    if num_unique_values / num_total_values < 0.5:
        moma[col] = moma[col].astype('category')
        
print(moma.info(memory_usage='deep'))

This logic

    num_unique_values = len(moma[col].unique())
    num_total_values = len(moma[col])
    if num_unique_values / num_total_values < 0.5:

does not calculate the proportion of unique values in a column. In this instance, dataquest is teaching the wrong way to calculate the proportion of unique values in a column.

For example, if there’s a column with 3 unique values but 100 total values in the column. By DQ’s calculation, that ratio would be 0.03 which is less than 0.5. So, we convert it to the categorical dtype to save memory as per the dataquest lesson.

But in reality, one of those values could appear 98 times in the column while the other two appear just once. This would make 98% of the values in the column unique. Which, according to dataquest, means that converting it to the categorical dtype would actually use more memory because most of the values are unique.

Is this clear?

Wouldn’t you expect a polished product from a subscription service that charges hundreds of dollars? Amateurs :stuck_out_tongue:

Hi @davidaguilaratx

Please elaborate the above statement. Thanks.

@Rucha

If the frequency distribution of a column with 100 rows is:
A 98
B 1
C 1

What percentage of values in the column are unique?
How would you calculate that?

Hi @davidaguilaratx

There are 3 unique values in the column: A, B and C.

I haven’t completed this mission yet, so will just discuss a simple example.

A dataset consisting of employee information with 3 columns:

  • Employee Name: This would be kind of unique for almost every employee. Few employees may share the same name.
  • Age: This would also vary among 100 employees. The minimum could be 19 (a college intern) to 65 for the CEO. You will still have variability for this column since the range is 46.
  • Gender: This column would typically have 3 values - Male, Female, Others. Even if we split the others, we would still have a max of 4 or 5 values in this column.

The column will have few unique values which would make it categorical. It doesn’t matter if the Number of Males is higher or the Number of Females.

And in the example you have cited - A is 98, B is 1 and C is 1, it is generally suggested to exclude the feature altogether from the dataset as the value A will dominate any kind of learning.

@Rucha

There are obviously 3 unique values…

If you see the link, you’ll see what mission/context I’m talking about in the Data Engineer pathway.

This mission is about optimizing the dataframe memory footprint. The whole pathway is about memory/time optimization of data pipelines. This is not about feature selection.

In this hypothetical example, the distribution could be:
A 60
B 20
C 20

Now again, what percentage of values in the column are unique?
How would you calculate that?

Dataquest says to calculate it by dividing 3 unique values by 100 total values (3/100 = 3%) according to the mission answer. Does this look correct?

Frankly, my hypothetical example is irrelevant if it only distracts from the point.

The point is, this module suggests that columns with less than 50% unique values will use less memory if converted to the category data type. However, the calculation to determine the percentage of unique values in the columns is wrong in the mission answer.

This is what my post is about.

Hi @davidaguilaratx

Okay, my bad to mix up this post with feature selection.

The 98% or 60% represents how many A’s are present in the given Column. That is not what Dataquest is asking you to calculate. It is asking you to calculate what other values are present in the column besides A?

By unique values, the content means you need to count how many different values are present in the column regardless of their frequencies.

And in the second distribution as well the unique count of values would still be 3.

Let’s just start over and let me know what makes you think unique values would mean to calculate their frequencies as well?

This part:

“Convert all object columns [to the category data type] where less than half of the column’s values are unique.

Hi @davidaguilaratx

Well the calculation shown by DataQuest is correct. Perhaps the words employed in content are creating confusion for you.

The prior mission Optimizing Memory Usage In Pandas | Dataquest does have an example to explain the uniqueness of values in a column but so be it. I am highlighting this post to @Sahil.

Thanks.

Yes, the wording is confusing.

Hi @davidaguilaratx I can see you’re not satisfied with the wording of the question and some of @Rucha’s answers. I’m here to try to explain it in another way.

Firstly, let’s discuss the meaning of these manipulations instead of diving into code or wordplay. After understanding the meaning, how to implement it and what’s right/wrong would be obvious. I confirm again DQ answer is right.

The purpose of using categorical dtype is to save the unique values in another part of memory, and refer to them repeatedly when populating a series. This is more memory efficient than copy pasting the value across the series.

Then we think about the tradeoffs. By using categorical dtype, it is keeping an extra table (1 slot for each unique value in the series) to keep track of the unique values. So when there are too many unique values in the series, there would be too many slots in the extra table created too, which results in almost a complete duplication of the original series, but not exactly duplicate because the extra table contains only all the unique values from the original series it derives from.

The goal is to make this extra table worth it’s space in memory. To make it worth it, it must be small enough. To be small, the series it is derived from must have a small number of unique values, because the size of the extra table = number of unique values in the series (Nothing to do with the number of times each unique value appears).

The number of unique values in the series is calculated by DQ, as the number of indexes in series.value_counts(), because each unique value occupies 1 index, with the value being the number of times it appears in the series, which is irrelevant to the size of the extra table.

moma[col].value_counts(normalize=True).sort_values(ascending=False)[0]
From this solution, it shows there is a lack of understanding of the reasoning above.
We can sanity check our answers, and the fastest way to get there is to use extreme values.(Boundary Value Analysis and Equivalence Partitioning Testing)

Assuming there’s a series of length 5 with 5 unique values. (doesn’t matter what they are)
The code above with will give frequencies of 0.2, 0.2, 0.2, 0.2, 0.2. Your code will select the top occuring normalized frequency as 0.2, and because it’s <0.5, convert to categorical.

What is happening then is exactly the situation we want to avoid, that we don’t want to convert if there are too many uniques in the series, which causes an exact copy paste of the series into the extra table, then building the series by refering to the table, an unnecessary detour.
The best case of conversion is when there are 2 uniques only. (because if 1 unique you might as well drop the column).

Some sample distributions of 2 uniques in various columns are 0.5,0.5. or 0.3, 0.7. According to your code, none of those columns will be converted because all their top value are >= 0.5, but these columns are exactly what categorical dtype is designed for.

More details here under “Memory Usage”: Categorical data — pandas 1.3.0 documentation

This trick of using indices to refer to a table of values can be found in decision tree implementations too. When you don’t want to copy the data when building the branches, but just refer to a single source of data using indices. It also exists all over sklearn libraries during sampling, train-test-split, where a range() of numbers are generated and only the numbers are manipulated. Later the relevant numbers are used to index the rows of data.

1 Like

Thank you for the clarification.

I now understand why DQ made the calculation in this manner.