Series.str.replace('\W', ' ') not working as expected

Screen Link:

My Code:

# remove punctation 
print(training_set.head())
training_set['SMS'] = training_set['SMS'].str.replace('\W', ' ')

# transform every letter in every word to lower case
training_set['SMS'] = training_set['SMS'].str.lower()
print(training_set.head())

What I expected to happen:
1 ham yes princess are you going to make me moan

What actually happened:

1   ham      yes  princess  are you going to make me moan 

It would be great if someone can teach me why the output of my code is different as the output of code in the solution?
For my codes, there are more than 1 blank spaces between words.
For the solution, there is only one blank space between words.

Thank you in advance!

Basics.ipynb (6.4 KB)

Hi @lhc1412 please provide the screen link.

Hi @lhc1412,

But they are 2 different columns, right?
Try without print: training_set.head()

Hi, I am not so sure how to share the screen link, that’s why I uploaded the jupyter notebook for your reference.

Yes, sorry for the confusing description. there are two columns. ‘1’ is the index, ‘ham’ is the ‘Label’ column. the rest is the ‘SMS’ column. the problem is with the SMS column.

I tried without print, but nothing changes

Look, I have just tried it in your notebook:

# remove punctation 
training_set.head()
training_set['SMS'] = training_set['SMS'].str.replace('\W', ' ')

# transform every letter in every word to lower case
training_set['SMS'] = training_set['SMS'].str.lower()
training_set.head()

The output was the dataframe in form of table.

You can remove extra spaces just using below line -
re.sub(" +"," ",1 ham yes princess are you going)

1 Like

Ah, I understood now what is happening. In your replace(), you are actually replacing all non-word and non-digital symbols (\W) with a white space. In this case, you replaced all punctuation signs with a white space, which was added to the previous white space in that column. So before it was:
‘Yes, princess. Are you going to make me moan?’
And after replacing, you got a double white space after the words ‘yes’, ‘princess’ (i.e., the old white space + the new white space), and one white space (instead of ‘?’) after ‘moan’.

By the way, I had a look now, in the solution notebook happens the same thing, so it doesn’t seem to be an issue. Only that when you are using training_set.head() without print, these extra white spaces are not visible (but they exist anyway).

1 Like