Screen Link: Learn data science with Python and R projects
dummy_cols = pd.DataFrame(index=train.index) for col in text_cols: dummies = pd.get_dummies(train[col]) dummy_cols = dummy_cols.join(dummies) train = train.join(dummy_cols) train = train.drop(columns=dummy_cols)
What I expected to happen:
Value of train is not what we expected.
What actually happened:
ValueErrorTraceback (most recent call last) <ipython-input-1-3bada7baa9d2> in <module>() 3 for col in text_cols: 4 dummies = pd.get_dummies(train[col]) ----> 5 dummy_cols = dummy_cols.join(dummies) 6 7 # print(dummy_cols.index) ======== snip ============= ValueError: columns overlap but no suffix specified: Index(['Artery', 'Feedr', 'Norm', 'PosA', 'PosN', 'RRNn'], dtype='object')
I am getting the error because the categories for
train[["Condition 1", "Condition 2"]] overlap
> train[["Condition 1", "Condition 2"]].apply(pd.Series.value_counts) Output Condition 1 Condition 2 Artery 55 3.0 Feedr 85 8.0 Norm 1240 1442.0 PosA 9 2.0 PosN 26 3.0 RRAe 13 NaN RRAn 21 NaN RRNe 5 NaN RRNn 6 2.0
To prevent the column names overlapping I tried to add a prefix
pd.get_dummies(train[col], prefix=col) so column name looks like
Condition 1_Artery which makes sense but the grader expects them to not have the prefix.
Looking at the answer, it uses
pd.concat() which will overwrite the dummies from column
Condition 1 with the dummies from
Condition 2. I don’t think is what is intended but I could be wrong here and missing something.