Instructions are not very clear

Hello everyone!,

Today I finished the course number 433-3 Guided Project: Building a Spam Filter with Naive Bayes.
In this part of the course where we are asked to build a training and test Set. The instructions do not advise to use the pd.DataFrame.reset_index() method. The problem is that if you follow the intructions and try no to look at the solution notebook, it is very probable that when using pd.DataFrame.concat() method later on. Dataframes would somehow not match. When using pd.DataFrame.head() on the new Dataframe the first 5 rows look fine and the values are all zeroes. But if you print the whole dataframe there are a lot of NaN values.

Personally I think that transposing all unique strings as columns in a new dataframe is not very relevant. Since this columns are never going to be used in the actual Algorithm. Correct me if I am wrong please!

Does actually resetting the Index on the initial Test_set make a difference when using pd.DataFrame.concat(axis = 1)? Not resetting this led to my function only classifying 1 sms as actual spam.

Thank you very much!! I hope somebody can help.

2 Likes

Hi @eliasalvarez96,

I have had exactly the same problem. It took me a long time gazing at the solutions to find out.

Funnily enough I had dropped the values later on, but still nothing was classified as spam.
Therefore I am really curious what happened as well, can someone please tell us what the difference is?

Cheers!

2 Likes

Hi @DavidMiedema, @eliasalvarez96,

Sorry about that. I will get it logged for review by the content team. Thank you for bringing this to our attention.

Best,
Sahil

Hi @eliasalvarez96 and @DavidMiedema, we’ll add some clarifications soon. We don’t mention using reset_index() early on because we want to give students more freedom into how they want to split the data set — we only ask for splitting, not for a certain method of splitting:

Split the randomized dataset into a training and a test set. The training set should account for 80% of the dataset, and the remaining 20% of the data should be the test set.

The pd.concat(axis=1) performs the concatenation row-wise by matching index labels. If we don’t use reset_index() early on, the index labels remain unordered from the randomization process, hence there’ll be many instances where pd.concat(axis=1) won’t find matching rows and will introduce NaN values instead.

This is how it works when all the index labels match:

And this is how it works when the labels don’t match:

image

1 Like

Thans @alex,

I figured it out later on. One thing remains unanswered though.

transposing all unique strings as columns in a new dataframe is not very relevant. Since this columns are never going to be used in the actual Algorithm. Correct me if I am wrong please!

is this true?