Why does the wildcard [a-z] sometimes yields unintuitive results (like a<A<e)?

I couldn’t get any explanation for the relationship between the sortcommand and the wildcard [a-z].

1 Like

Hi @tattunglam,

The explanation is similar to the one in lesson 386 - the locale of the system affects the sorting order. Normally, people expect that the sorting is based on ASCII/UNICODE value which does sort the letters properly, which is how the built-in Python sort function works. But the sorting works differently for the shell’s sort.

To be honest, it is an overwhelming topic and probably requires a very deep dive and a lot of time to understand. The closest answer I can find is this StackOverflow answer and it has a link to the document that describes the algorithm. Here’s the link to the document: UTS #10: Unicode Collation Algorithm.

The most relevant excerpt from the standard would be this (emphasis is not mine but the authors’):

1.1.1 Collation Order and Code Chart Order

Many people expect the characters in their language to be in the “correct” order in the Unicode code charts. Because collation varies by language and not just by script, it is not possible to arrange the encoding for characters so that simple binary string comparison produces the desired collation order for all languages. Because multi-level sorting is a requirement, it is not even possible to arrange the encoding for characters so that simple binary string comparison produces the desired collation order for any particular language. Separate data tables are required for correct sorting order. For more information on tailorings for different languages, see [CLDR].

The basic principle to remember is: The position of characters in the Unicode code charts does not specify their sort order.

What I understand is because they need to account for many locales/languages, they cannot rely on a simple algorithm e.g. compare ASCII value. Instead, they opted for a more powerful and multi-level algorithm which unfortunately produces unintuitive results for those who don’t understand how the algorithm works (which is probably most people judging by how many are confused about this including myself).

There are ways to make the result more “natural” but they might cause issues if you’re not careful. The safest method is to be aware of how the sorting works for your computer’s locale (in this case, Dataquest’s platform) and figure out a pattern that can give you what you want.


The things I’ve read to (barely) understand why the result is unintuitive:

3 Likes