Calculate a mean of a range of values in each row, in a vectorized way

Screen Link:

We need to do the following cleaning: data=[2-3, 4.0, 10-13, 40,…]. Those are years, or range of years. So, the person stayed in the enterprise between 2 and 3 years in the first case. I need to convert all of them to a single value. I could do it with a for loop, but I wonder if there is some vectorized form.

For example, if instead of the mean of each range I would just take the lowest value (in data[0] would be 2), it could be:
teste= combined_updated[‘institute_service’].astype(‘str’).str.replace(‘Less than 1 year’, ‘1’).str.replace(‘More than 20 years’, ‘20’).str.split(’-’).str[0].astype(‘float’)

So, I use str[0] to get the first value, cause I will get a list [2,3] for data[0]. is there some way to convert them to floats (inside the list) and then do the mean, in a vectorized way? …

Not even sure I was clear, sorry.

@ncirauqu,

It’s me again :grinning:

You can use this:

combined_updated['institute_service'] = combined_updated['institute_service'].astype('str').str.extract(r'(\d+)', expand=True).astype('float')

Hi Elena,
Thanks again :slight_smile: :slight_smile:But, if I did not understand wrong, this would take the first number from the range, isn’t it? For example, if we have ‘7-10’, it will take 7. I wonder if there is a way to do the mean instead (8.5 in the example, or 8, if we leave int).

Hi @ncirauqu,

Ah, I understood before that you wanted exactly the lowest value :slightly_smiling_face:

Then try this:

combined_updated['institute_service'] = combined_updated['institute_service'].astype('str').str.findall(r'(\d+)')

Then, on the resulting column, you can find the mean by using sum(lst)/len(lst) .

Hi Elena,
Could you, please, explain, why do we use this pattern r’(\d+)’ here?
May be I missed some info during a teaching, but absolutely don’t understand this syntax for our case.

Hi Veniamin,

Here we’re looking for all the occurences of any digits from 0 to 9 in a string, whether it’s a one-digit number (like 8) or multiple-digit number (like 23). To match a character representing any digit from 0 to 9, we use a regular expression \d (a regular expression, or regex, is a sequence of characters that define a search pattern). If we’re looking for one or more occurences of a digit one after another (like, 123 or 999), we have to add + to our regex above. As a result, our regex pattern for finding all the occurences of one or more digits looks like this: \d+.

You can find more examples of regex patterns in this article, and also in this and this DQ missions.

1 Like

Ok, great, I got it. Thank you a lot for detailed response!

1 Like