# remove punctation
print(training_set.head())
training_set['SMS'] = training_set['SMS'].str.replace('\W', ' ')
# transform every letter in every word to lower case
training_set['SMS'] = training_set['SMS'].str.lower()
print(training_set.head())
What I expected to happen:
1 ham yes princess are you going to make me moan
What actually happened:
1 ham yes princess are you going to make me moan
It would be great if someone can teach me why the output of my code is different as the output of code in the solution?
For my codes, there are more than 1 blank spaces between words.
For the solution, there is only one blank space between words.
Yes, sorry for the confusing description. there are two columns. ‘1’ is the index, ‘ham’ is the ‘Label’ column. the rest is the ‘SMS’ column. the problem is with the SMS column.
# remove punctation
training_set.head()
training_set['SMS'] = training_set['SMS'].str.replace('\W', ' ')
# transform every letter in every word to lower case
training_set['SMS'] = training_set['SMS'].str.lower()
training_set.head()
Ah, I understood now what is happening. In your replace(), you are actually replacing all non-word and non-digital symbols (\W) with a white space. In this case, you replaced all punctuation signs with a white space, which was added to the previous white space in that column. So before it was:
‘Yes, princess. Are you going to make me moan?’
And after replacing, you got a double white space after the words ‘yes’, ‘princess’ (i.e., the old white space + the new white space), and one white space (instead of ‘?’) after ‘moan’.
By the way, I had a look now, in the solution notebook happens the same thing, so it doesn’t seem to be an issue. Only that when you are using training_set.head() without print, these extra white spaces are not visible (but they exist anyway).