Anyone knows the reason why this string can be sorted?(Exploring Ebay Car Sale project)

Hello everyone,

In the 5th slide of the exploring Ebay car sale project. There was a small step to explore the distribution of values in the date_crawled, ad_created, and last_seen columns (all string columns) as percentages.

the code looks like below :

autos["date_crawled"].str[:10].value_counts(normalize = True).sort_index(ascending = True)

and the output is

2016-03-05    0.025327
2016-03-06    0.014043
2016-03-07    0.036014
2016-03-08    0.033296
2016-03-09    0.033090
2016-03-10    0.032184
2016-03-11    0.032575
2016-03-12    0.036920
2016-03-13    0.015670
2016-03-14    0.036549
2016-03-15    0.034284
2016-03-16    0.029610
2016-03-17    0.031628
2016-03-18    0.012911
2016-03-19    0.034778
2016-03-20    0.037887
2016-03-21    0.037373
2016-03-22    0.032987
2016-03-23    0.032225
2016-03-24    0.029342
2016-03-25    0.031607
2016-03-26    0.032204
2016-03-27    0.031092
2016-03-28    0.034860
2016-03-29    0.034099
2016-03-30    0.033687
2016-03-31    0.031834
2016-04-01    0.033687
2016-04-02    0.035478
2016-04-03    0.038608
2016-04-04    0.036487
2016-04-05    0.013096
2016-04-06    0.003171
2016-04-07    0.001400
Name: date_crawled, dtype: float64

I am curious how did python sorted this index if we have not change the class? The class of these dates are still strings in my view.
Does anyone have any insight on this?:grin:

1 Like

Are you assuming strings have to be turned into numbers to be sorted?
https://docs.python.org/3/reference/datamodel.html
In python, any object can be compared. When you define your own objects, you will specify __lt__,__gt__,__le__,__eq__, etc methods (or use functools.total_ordering) to say how objects and which attributes (depending on application, you may tighten or loosen conditions for equality) among objects are compared to determine the size between the 2 objects.
Strings are objects too. They are iterable and are compared from left to right until ties are broken (so doesn’t mean longer string is bigger if shorter string’s earlier position characters win first). ord() turns each character of a string to a Unicode code point number and compares that. This is what you see in docs https://docs.python.org/3/tutorial/datastructures.html#comparing-sequences-and-other-types as ‘lexicographical order’(or more intuitively dictionary order, as how you see them in any real life dictionary). Just for the perfectionist who craves symmetry, you can invert ord('2') with chr(50)

That’s why you must be careful with types. A number printed on the console may not actually be a int or float but a string. Nevertheless, if one side of the comparison is a letter, you obviously know the other side which looks like a number is a string too, if it wasn’t, you get TypeError: '>' not supported between instances of 'str' and 'int'. So confusion is more likely when both operands are numbers that are encoded as strings and they have different lengths and you see these as integers '222'>'1111' is True as strings, False as numbers

1 Like

Hello Hanqi,

Thank you so much for your informative answer. :+1:

To add to my answer, besides python, sql can also compare sizes between more than just numbers. Look at it’s BETWEEN operator for one, which can compare text, numbers, dates and probably many more. (Most sql variants have much more datatypes than python)