Sorting Multiple Columns When Text Processing

Tags:390-7
Screen Link:

https://app.dataquest.io/m/390/text-processing/7/sorting-on-multiple-columns

In the example given for this section, why is the output
10,1,C,2,86
in the third line (as opposed to the first or second) after running the code

/home/learn$ sort -t"," -k4,5 example_data_no_header.csv

I expected it to be in the first or second line since, for the fourth column, 2<236 and 2<140.

2 Likes

As explained in the content -

For example, if we run sort -t"," -k4,5 example_data_no_header.csv , the command will look at the characters from the fourth and fifth columns as one field only.

So,

10,1,C,2,86

is seen as

10,1,C,286

by the sort.

Now, sort sorts that column in lexicographic order.

So, if you have values like

1, 2, 3, 4, 11, 12, 23

after sorting the above you would get -

1, 11, 12, 2, 23, 3, 4

The above was also explained in the previous Mission Step

So the first sorted value would be 14022 because 1 < 2.

Second value would be 236224 because out of all values that begin with 2, the second character 3 is the smallest. Then third would be 286 because out of all values that being with 2, the second character 8 is higher than the second character 3 in 236224 and so on.

3 Likes

I understand now, thank you.

Hi ,can you clarify please how the sort command orders the numbers ,I understand the -k4,5 but sorting in lexographic order seems complicated to me.thanks

Yeah, it can be confusing.

When working with strings - alphabets specifically, lexicographic order is just alphabetical order.

However, when working with numbers, it’s the numerical order when looking at the digits of the numbers from left to right.

So, if you have the following numbers -

1, 2, 11, 3, 23, 22, 15, 31

Lexicographic order will be -

1, 11, 15, 2

Notice that we look at the first smallest digit, which is 1. Then the second digit of numbers that have a second digit, and the smallest of the second digits is 1, that’s why the second number above is 11.

Once we have exhausted all numbers starting with 1, we move on to the next smallest digit, which is 2.

Based on that this is the final outcome -

1, 11, 15, 2, 22, 23, 3, 31

Now, if there was a 3-digit number like 112, then that would be taken into account as well. 112 would be β€œhigher” than 11 but would be smaller than 15. So the order of numbers in that case would be -

1, 11, 112, 15, 2, 22, 23, 3, 31

There is more to it but the above should suffice. In most cases, you will have functions that will sort it as per the order above for you and it’s not something to worry about. I have, honestly, not encountered this much myself to comment on its usefulness in data science/analysis.

But it should be noted, that when numbers are represented as strings, like -

[β€˜1’, β€˜11’, β€˜15’, β€˜2’, β€˜22’, β€˜23’, β€˜3’, β€˜31’]

Sorting a list of strings as above using those commonly available sorting functions, will sort them in lexicographic order. Sorting them as normal numerical values -

[1, 2, 11, 3, 23, 22, 15, 31]

will sort them in the numerical order you expect.

2 Likes

I also wonder about using sorting parameters.
Why do these commands produce different results?
/home/dq$ sort -t':' -k3,3 -k4,4 -g -r characters_no_header
/home/dq$ sort -t':' -k3,3 -k4,4gr characters_no_header