How greedy vs non-greedy regex behave?

I was comparing 2 regex patterns on a .

  1. a? (greedy)
  2. a?? (non-greedy)


Questions

Why does non-greedy have Match 2?
I expected non-greedy to be matching as little as possible looking rightwards from each starting position in the a string. I expect there are 2 starting positions, 1 before and 1 after the a.
I may have a wrong understanding of what “starting positions” are, and how regex engine does its recursion.

The quantifier ? means 0 or 1 matches. I expect greedy to mean at each starting position, match the longest possible (1 > 0 so 1 in the case of ?) string. The 2 results of a? makes sense to me. 1st match begins from a position before a and tried to go for the longer option (0 vs 1 from ? quantifier) of matching 1 a. 2nd match begins from after a and could only match zero-length because there are no more a behind the 1st a in string.

I also expect non-greedy to mean at each starting position, match the shortest possible (0) string, so i thought Match 2 of a?? regex should not appear.

Hello @hanqi
Different flavors react differently. Check for the python flavor, you’ll see that there are only 2 matches. All of which are null.
?? - always considered as lazy excludes an item, if possible. In simple terms, "match the rest of the word, except the immediate preceding letter of ??"

In this StackOverflow answer, it answered most of your doubts:

Thanks @info.victoromondi for the link. I went through that answer and it didn’t address my question. Instead it provided more evidence supporting the fact that Match 2 should not have appeared in non-greedy.
Could you address which part of my reasoning above is wrong directly?

What code led you to that? Below is my example using Python standard library re module.
It gives the same result as regex101 in my first post above, which i don’t understand.

list(re.finditer(r'a??','a'))
Output:

[<re.Match object; span=(0, 0), match=''>,
 <re.Match object; span=(0, 1), match='a'>,
 <re.Match object; span=(1, 1), match=''>]

I used regex101:

and selected Python in Flavor

This page solved my questions: Zero-Length Regex Matches

Important points

  1. PCRE behaves different from Python < 3.7, but same as Python >= 3.7 (regex101’s python is likely < 3.7, thus @info.victoromondi 's 2 matches only instead of 3).
  2. Different engine have different behaviour after a zero-length match
  • Stay at same place in string (PCRE, Python >= 3.7)
  • Move 1 step ahead in string (Python < 3.7, JGSoft has extra restriction on top of this)

If it was zero-length, the engine makes note of that, as it must not allow a zero-length match at the same position.

Above statement explains why when stepping through regex debugger, it looks like next match is repeating the previous match.


Match 2 repeats 6 steps from match 1, but match gets rejected due to above quoted statement, then does its own 6 steps to find it’s match from a new starting position in string
Match 3 repeats 6 steps from match 2, and 6 of its own from new starting position
Match 4 repeats 6 steps from Match 3, and 3 of its own from new starting position to fail.

Good to know: Each time the string moves to a new starting position, regex101 debugger will show a red arrow moving the regex all the way to the beginning for a retry.

list(re.finditer(r'a??','a')) gave 3 matches ('','a','') because it was Python 3.8.12 >= 3.7 required to match regex101 PCRE behaviour in my first post.

Understanding why '','a',''
After 1st zero-length match, position in string did not advance.
a?? attempted to non-greedy match 0 times, but this fails because of above quote, so it matched length 1 string since ? allows 0 or 1.
After this match, string advances to after a , and regex restarts from the left (red back arrow in regex101 debugger) to again non-greedy match zero-length to give the second ''.