Dummy-coding ValueError: columns overlap

Screen Link: Learn data science with Python and R projects

My Code:

dummy_cols = pd.DataFrame(index=train.index)

for col in text_cols:
    dummies = pd.get_dummies(train[col])
    dummy_cols = dummy_cols.join(dummies)

train = train.join(dummy_cols)
train = train.drop(columns=dummy_cols)

What I expected to happen:

Value of train is not what we expected.

What actually happened:

ValueErrorTraceback (most recent call last)
<ipython-input-1-3bada7baa9d2> in <module>()
      3 for col in text_cols:
      4     dummies = pd.get_dummies(train[col])
----> 5     dummy_cols = dummy_cols.join(dummies)
      6 
      7 # print(dummy_cols.index)
======== snip =============
ValueError: columns overlap but no suffix specified: Index(['Artery', 'Feedr', 'Norm', 'PosA', 'PosN', 'RRNn'], dtype='object')

I am getting the error because the categories for train[["Condition 1", "Condition 2"]] overlap

> train[["Condition 1", "Condition 2"]].apply(pd.Series.value_counts)
Output
        Condition 1  Condition 2
Artery           55          3.0
Feedr            85          8.0
Norm           1240       1442.0
PosA              9          2.0
PosN             26          3.0
RRAe             13          NaN
RRAn             21          NaN
RRNe              5          NaN
RRNn              6          2.0

To prevent the column names overlapping I tried to add a prefix pd.get_dummies(train[col], prefix=col) so column name looks like Condition 1_Artery which makes sense but the grader expects them to not have the prefix.

Looking at the answer, it uses pd.concat() which will overwrite the dummies from column Condition 1 with the dummies from Condition 2. I don’t think is what is intended but I could be wrong here and missing something.

Hmmm… So, I haven’t completed this particular section of the course at all.

But, I have been trying to figure this out for more than an hour and it’s a bit confusing.

Dataquest gives examples of dummy columns using the column Utilities and the table in the content shows that the names of the dummy columns are like Utilities_AllPub.

So, they specify that the dummy columns should be named as above. This means, in case another column has an AllPub value, there will be no clashes because we would have dummy columns as ColumnName_AllPub in that case.

However, in their solution, they don’t take the above into account.

They simply use pd.dummies() and then pd.concat() which results in duplicate column names. So, if there are two columns with AllPub as values. We will see two dummy columns after the concatenation and both of them will be named AllPub and not ColumnName_AllPub.

The above is easily cross-checked too after running their code. I created a simple frequency table based on columns of train after their solution, and in there we have 4 columns named Ex, which, as per their explanation in the content, should not happen. They should be ColumnName_Ex instetad.

When you are trying to work with join() (which, I think, should be fine in this case instead of concatenation since you are joining by index), as per what I understand of Pandas, you can’t have duplicate column names like that. So, you try to fix it by adding prefixes. And that’s something the DQ Grader doesn’t accept.

As per me, your approach is the correct one based on what the content establishes in terms of column names for the dummy columns. And I think this particular Mission/Mission Step might require corrections.

@Sahil could you please look into this? There does seem to be a discrepancy in terms of what the content establishes vs what’s in the DQ solution as I explain above. Either there should be additional clarifications or the solution might have to be changed.

Please do note, I have not gone through this entire Mission or related Missions yet. If this is something that was explained somewhere else and it applies here, then I would be wrong about the above. I would also be wrong about the use of join here too instead of concat. That’s been a difficult concept for me to grasp and I need more practice on that front.

It’s strange that this pattern is even used.
get_dummies can just be used on an entire dataframe at once, rather than applied column by column.
When doing the latter way, the categories will not have the column name prepended, while the df way will automatically prepend so it’s more convenient.

DataFrame

df = pd.DataFrame(['A','B'],columns=['letter'])
pd.get_dummies(df)

Output:

   letter_A  letter_B
0         1         0
1         0         1

Series

s = pd.Series(['A','B'],name='letter')
pd.get_dummies(s)

Output

   A  B
0  1  0
1  0  1