Guided Project:Clean And Analyze Employee Exit Surveys - Why Str.extract(r'(\d+)') does not create a double count in Years?

https://app.dataquest.io/m/348/guided-project%3A-clean-and-analyze-employee-exit-surveys/9/clean-the-service-column

This is similar to another thread here Str.extract(r'(\d+)') - can someone explain how this code works?

However, I want to ask further is that for example below 1 appears twice, under “Less than 1 year” and “1-2”. Or 20 under “11-20” and “More than 20 years”.
How do we know when we run this code it would not double count?

NaN 164
Less than 1 year 77
3-4 72
1-2 68
11-20 49
More than 20 years 48
5-6 36
7-10 30
5.0 30
3.0 28

Hi @willx

If we use extract then it extracts data according to regular expressions which is in the bracket, i.e.
in this case (\d+).(it is also called as capture group)

Here, \d —> Any digit character([0-9])
and + —> one or more times

As a whole, we can say that extract data which has any digit character either one or more times.

Example:-
‘1-2’ ----> ‘1’
‘Less than 1 year 77’ —> ‘1’
‘11-20’ —> ‘11’

Hi i understand that. here is my problem - if we are going to put it in buckets, let’s say

  1. Less than 1 Year - X
  2. Between 1 - 2 Years - Y

If we extract it that way say getting the first number that comes along, wouldn’t that return
Year 1 - X + Y?
Shouldn’t we take only X, instead of X + Y?

or is my illustration a bit unclear here?

If you are extracting using the above regular expression then from string X it return ‘1’ and from string Y also it return ‘1’. Both will not match with each other(X + Y).

How do i explain this.

NaN 164
**Less than 1 year** 77
3-4 72
**1-2** 68
11-20 49
More than 20 years 48
5-6 36
7-10 30
5.0 30
3.0 28

Assuming , there is bin - something like Less than 1 Year. Another bin is more than 1 Year.
However, as the regex here is pulling the first digit right?
so the regex would pull all out 2 (1s) - with 1 value 77, and another 68.
Which i don’t think is what we wanted right?

NaN 164
**Less than 1 year** 77
3-4 72
**1-2** 68
11-20 49
More than 20 years 48
5-6 36
7-10 30
5.0 30
3.0 28

Here, numbers corresponding to the values are the frequency of the value. Or we can say that the value “Less than 1 year” is in 77 rows of the data. Similarly, the value “1-2” is in other 68 rows of the data.

Hi @willx. I think I get what you’re asking, but I’m not super great with regex so I’ve been experimenting with str.extract() to see what’s going on. Since your pattern only has one capture group, it’s going to extract the first instance of number. For 11-20, it will only extract the 11, not the 20. So the 49 entries for 11-20 will be classifed as 11, and won’t end up double counted with the more than 20 years.

For the less than 1 year and 1-2, those are going to end up lumped together because we’re extracting the 1 from each. This is okay though, because I think in that project we’re creating groups based on ranges of experience (like “less than 5 years” and “between 5-10 years”). It’s the best we can do when the data is inconsistent like this and we can’t really group by individual years (like “1 year”, “2 years”, etc).

Does that answer what you’re looking for?

yeah, that’s what i was thinking. so i think for that particular problem, it may not be an issue specifically.

however, let’s say for argument’s sake that if we are to bin it by 1 year, 2 years, any idea of how we would solve that problem?
am going through regex myself, now so also still experimenting on the idea

It would probably depend on how the data is entered whether or not we could get more granular with the groupings. In this dataset we have some rows that say 3-4 years, and others that say 7-10 or 11-20, which is very inconsistent. I think it would be pretty hard to get any more granular. If the data ranges are more consistent (or you just focus on years less than 7), you could probably make a case for distinguishing between “less than 1 year” and “1-2 years”.

I was trying to figure out how to create a capture group where it would optionally have words (Less than/more than) and I couldn’t figure it out. My next strategy might be to clean up the column and change “Less than 1 year” to 0 (so there isn’t a 1 anymore) to eliminate the confusion between the 2 columns. :thinking: Not sure if that makes sense, but that’s how I would do it if I couldn’t get a pattern to work.

1 Like

I got stuck on this page but some general questions and comments:

  • You could use e.g. str.replace(“Less than 1 year”, “0”)?
  • Any idea how to average e.g. 5-6? I though you could split it and then divide the list by 2 but I couldn’t work out the code.

The answer to this was quite impressive, it’s simplicity inversely matched by the long winded approach I took trying to do things like the above!

Didn’t see that regex’s were the next topic, so don’t stress if you didn’t get this first go