Challenge: cleaning string column (is there a specific order?)

Screen Link:

My Code:

laptops['weight'] = laptops['weight'].str.replace('kg','').str.replace('kgs','').astype(float)

laptops.rename({'weight':'weight_kg'}, axis = 1, inplace = True)

laptops.to_csv('laptops_cleaned.csv', index = False)

What actually happened:
I got an error message

ValueError: could not convert string to float: '4s'

I noticed that the proposed solution had a different order for st.replace(). It starts by replacing ‘kgs’ and then ‘kg’. I tried running the solution (obviously it worked fine), so is this what I got wrong or there is something else that I did not notice.

laptops["weight"] = laptops["weight"].str.replace("kgs","").str.replace("kg","").astype(float)

thanks :+1:

4 Likes

Hi @boemer00,

Yes, the order from left to right matters here.

In your case, when you first replace kg with a white space, of kgs in your column will remain only s. So, when you use your second replace(), Python just can’t find such cases, because they don’t exist anymore. And then later, obviously, the values containing this remaining s cannot be converted to float.

That’s why you have to use or the succession suggested in the solution (first kgs, then kg), or if you want to use kg first, then the second replace() should replace the remaining s with white spaces:

laptops['weight'] = laptops['weight'].str.replace('kg','').str.replace('s','').astype(float)
3 Likes

Got it! Thank you so much :raised_hands:

1 Like

is there a way to place two things in one replace statement?
for ex. (pseudocode):

laptops['weight'].str.replace('kg',''; "s","").str.replace('s','').astype(float)
or
laptops['weight'].str.replace('kg','' and "s","").str.replace('s','').astype(float)
or
laptops['weight'].str.replace('kg','' **something else here** "s","").str.replace('s','').astype(float)
1 Like

Hi @drill_n_bass,

No, unfortunately the syntax of this method doesn’t have this option.

1 Like

good to know. thank you for feedback ! :slight_smile:

I mean, any of the following codes, combining two str.replace() in one line, is good in this situation:

laptops['weight'] = laptops['weight'].str.replace('kg','').str.replace('s','').astype(float)

or

laptops['weight'] = laptops['weight'].str.replace('kgs', '').str.replace('kg', '').astype(float)

I know, just wonder if there is a way to rid off one of str.replace statement, so the code is more simple and looks shorter.

1 Like

Looks like there’s a way to achieve it:

s.str.replace('kgs|kg', '')

It works because Series.str.replace accepts a regex pattern, and the | does an OR


Test snippet:

import pandas as pd

test = pd.Series(['weight in kg', 'weight in kgs', 'qwertykgs'])

print(test.str.replace('kgs|kg', ''))
2 Likes

It works!!! :slight_smile:

laptops['weight'] = (laptops['weight'].str.replace('kg|s','')
                     .astype(float)
                    )

Just wonder, about one thing. I was certain that it’s better to use “&”/and than “|”/or. But, when I wrote:

laptops['weight'] = (laptops['weight'].str.replace('kg&s','')
                     .astype(float)
                    )

…the code was rejected. I’m not sure why.
When we use “|”/or - I thought the logic is, that it will take away just randomly "kg’, or ‘s’ ( so the filtering won’t be completed totally: because if there will be “kgs” string, it will take only “kg”. Thought that “or” statement won’t take both). On the other hand, there should be ( and probably is - and that’s my error, actually - same problem with “&”/and: when there will be “kg” string, the code will be aborted: no “s” wouldn’t be found)

Great that you tried it out even after 2 weeks!

As you may already know that in regex some characters hold special meaning (including |, known as Alternation that acts as a boolean OR, matches left -> right)

To answer your question, here’s what I want you to try (and don’t miss the pattern 'kg[s]?' in the last print):

import pandas as pd

s = pd.Series(['weight in kg', 'weights in kgs', 'yesss kgs', 'did you miss kg&s?'])


print(':ROUND 1:', s.str.replace('kgs|kg', ''), sep='\n')  # Checks kgs first
print()
print(':ROUND 2:', s.str.replace('kg|kgs', ''), sep='\n')  # Checks kg first
print()
print(':ROUND 3:', s.str.replace('kg|s', ''), sep='\n')
print()
print(':ROUND 4:', s.str.replace('kg&s', ''), sep='\n')
print()
print(':ROUND 5:', s.str.replace('kg[s]?', ''), sep='\n')  # Checks kg with/without single s

Both 'kgs|kg' and 'kg[s]?' work.


With 'kg&s', I suppose you were looking for 'kg[s]?'?
([s]? looks for 0 or 1 occurrence of s)

totally share in this opnion too