What are the advantages of the DQ solution over my solution?

Screen Link: https://app.dataquest.io/m/1000353/working-with-dates-and-times-in-python-practice-problems/13/maximum-number-of-simultaneous-sessions

First of all, I´m not questioning the DQ solution, the DQ content writers have much more experience in coding than I do. But I would love to know what advantages it has over my solution. Is it more scalable? more explicit?

I´ve checked the time necessary to finish both codes, mine seems to be somewhat faster.

My Code:

# Write your solution below
from csv import reader
f = open('session_log.csv')
read_f = reader(f)
session_log = list(read_f)[1:]

# Create all events
events=[]
for row in session_log:
    row = [int(x) for x in row]
    session_start = dt.datetime(row[1], row[2], row[3], row[4], row[5])
    session_end = dt.datetime(row[-5], row[-4], row[-3], row[-2], row[-1])
    events.append([1, session_start])
    events.append([-1, session_end])
 
 # Sort the events by time  
events.sort(key = lambda x: x[1])

# Calculate the maximum number of simultaneous sessions
session_count = 0
max_sessions = 0
for event in start_end:
    session_count += event[0]
    if session_count > max_sessions:
        max_sessions = session_count

The DQ solution:

import datetime as dt

class Event():
    
    def __init__(self, time, is_start):
        self.time = time
        self.is_start = is_start
    
    def __lt__(self, other):
        """
        Function used to sort the events
        """
        if self.time == other.time:
            return self.is_start
        return self.time < other.time

# Read the CSV file
import csv
f = open('session_log.csv')
reader = csv.reader(f)
rows = list(reader)[1:]

# Create all events
events = []
for row in rows:
    row = [int(x) for x in row]
    session_start = dt.datetime(row[1], row[2], row[3], row[4], row[5])
    session_end = dt.datetime(row[6], row[7], row[8], row[9], row[10])
    start_event = Event(session_start, True)
    end_event = Event(session_end, False)
    events.append(start_event)
    events.append(end_event)

# Sort the events by time, break ties by putting end events first
events.sort()

# Calculate the maximum number of simultaneous sessions
session_count = 0
max_sessions = None
for event in events:
    if event.is_start:
        session_count += 1
    else:
        session_count -= 1
    if max_sessions == None or session_count > max_sessions:
        max_sessions = session_count

Your solution is missing the check if times are equal.

if self.time == other.time:
            return self.is_start

This means it only works correctly because you appended the times in the order of

events.append([1, session_start])
events.append([-1, session_end])

and are depending on the fact that list.sort is stable: https://stackoverflow.com/questions/1915376/is-pythons-sorted-function-guaranteed-to-be-stable#:~:text=The%20built-in%20sorted(),%2C%20then%20by%20salary%20grade).

The DQ solution could have appended things in the reverse order with end going in first and still gave right answer. Your code will get it wrong if you appended end first because it has no logic of putting start before end when timings clash.

events.append(end_event)
events.append(start_event)

I would choose your solution still since it’s much less verbose, and since sorts are guaranteed to be stable, we should make use of that property.

Thank you for your reply.

I´ve got 2 doubts after thinking to over:

The first one is related to the fact that I barely understand this part of the DQ solution

def __lt__(self, other):
        """
        Function used to sort the events
        """
        if self.time == other.time:
            return self.is_start
        return self.time < other.time

I understand that they overload the comparison operator <. And I guess they assign some functionality to the case when times are equal. But I don´t understand what happens in case two times are equal. Could you, please, explain it to me?

I tried to see it with a sample list and I made sure that there were 2 events (session start and session end) with the same time. I modified the class by adding __str__ method in order to be able to print out the events.

Here´s the original list of 20 sessions, the events happened at the same time happened 2018/01/18 01:18:

session_log=[['user_id', 'session_start', 'session_end'], 
             ['1', '2019/02/28 00:25', '2019/02/28 05:09'], 
             ['2', '2018/02/27 17:14', '2018/02/28 16:03'], 
             ['3', '2018/01/17 21:03', '2018/01/18 01:18'], 
             ['4', '2019/09/27 13:13', '2019/09/28 07:19'], 
             ['5', '2018/07/26 10:12', '2018/07/26 16:34'], 
             ['6', '2019/11/02 07:13', '2019/11/03 06:27'], 
             ['7', '2018/01/27 05:22', '2018/01/27 09:36'], 
             ['8', '2018/01/18 01:18', '2018/01/18 19:43'], 
             ['9', '2018/04/17 13:46', '2018/04/18 00:59'], 
             ['10', '2018/09/04 01:16', '2018/09/04 11:35'], 
             ['11', '2019/09/04 20:41', '2019/09/05 06:55'], 
             ['12', '2019/04/01 17:39', '2019/04/02 11:33'], 
             ['13', '2019/02/17 05:12', '2019/02/17 06:11'], 
             ['14', '2019/12/23 03:42', '2019/12/23 16:01'], 
             ['15', '2018/02/28 14:12', '2018/03/01 00:41'], 
             ['16', '2018/10/15 22:07', '2018/10/16 06:14'], 
             ['17', '2018/04/10 06:21', '2018/04/10 12:07'], 
             ['18', '2019/01/30 15:39', '2019/01/31 07:57'], 
             ['19', '2019/11/22 18:02', '2019/11/22 21:34'], 
             ['20', '2019/01/13 08:31', '2019/01/14 09:22']]

And here is the representation of events list after sorting (only using the DQ solution code):

['2018/01/17 21:03', True]
['2018/01/18 01:18', True]
['2018/01/18 01:18', False]
['2018/01/18 19:43', False]
['2018/01/27 05:22', True]
['2018/01/27 09:36', False]
['2018/02/27 17:14', True]
['2018/02/28 14:12', True]
['2018/02/28 16:03', False]
['2018/03/01 00:41', False]
['2018/04/10 06:21', True]
['2018/04/10 12:07', False]
['2018/04/17 13:46', True]
['2018/04/18 00:59', False]
['2018/07/26 10:12', True]
['2018/07/26 16:34', False]
['2018/09/04 01:16', True]
['2018/09/04 11:35', False]
['2018/10/15 22:07', True]
['2018/10/16 06:14', False]
['2019/01/13 08:31', True]
['2019/01/14 09:22', False]
['2019/01/30 15:39', True]
['2019/01/31 07:57', False]
['2019/02/17 05:12', True]
['2019/02/17 06:11', False]
['2019/02/28 00:25', True]
['2019/02/28 05:09', False]
['2019/04/01 17:39', True]
['2019/04/02 11:33', False]
['2019/09/04 20:41', True]
['2019/09/05 06:55', False]
['2019/09/27 13:13', True]
['2019/09/28 07:19', False]
['2019/11/02 07:13', True]
['2019/11/03 06:27', False]
['2019/11/22 18:02', True]
['2019/11/22 21:34', False]
['2019/12/23 03:42', True]
['2019/12/23 16:01', False]

The DQ solution says:

# Sort the events by time, break ties by putting end events first

And in this example above the start event eventually goes first… (Although it´s true that sorting from the DQ code doesn´t depend on the order of appending elements to the list.)

Anyway, I don´t understand how these piece of code works in order to break ties:

 if self.time == other.time:
        return self.is_start

That was my first doubt.

And now the second one:
I have to admit that when coming out with my solution I barely thought about the possibility of events happening on same time. And I´ve got my code working correctly only by chance and due to stable sorting as you mentioned.
But after having thought about it for a while I think that when counting the number of simultanuos sessions and having 2 events of different nature (start of a session and end of another session) happen at the same time, ideally the counter value shouldn´t change 'cause +1 and -1 at the same time return 0. Neither solution has it implemented. What do you think about it?

The purpose of overriding these comparison operators is to help the python framework understand how to sort when you have non primitive types like classes that may contain many members. You just have to return a boolean defining which of 2 objects is smaller.

if self.time == other.time:
            return self.is_start

This means when times are equal, check the value of self.is_start, if it is true (meaning self is start), then this class instance must be ordered before the other one. If you are overriding __gt__, that means the object with member attribute is_start as true will be ordered after.

There are 4 permutations of time clashes. ss,se,es,ee, and ss and ee don’t matter to the result since they both +1 or -1, meaning they affect max_sessions in the same way no matter what order. If it happens that the code is comparing ee or ss, i’m not sure which instance will be placed first, my guess is python will only compare every pair once for a total ~n(n+1)/2 comparisons, and whichever object was placed into self will be ordered in front of other in cases of ss and ee.

I think the DQ comment is wrong, should be start events first, please submit a ticket for that.

You are thinking of shortcuts now, and maybe having a bias that less updated variable values are a better way of programming.
The problem is if you want to make the code not update when its es or se, you now have to write additional logic to check which of the 4 permutations mentioned above is happening, and include extra conditionals, some paths update, some don’t. This is way more trouble than it’s worth.
Sometimes it’s better to just let each class do it’s own job even if at a high level you know +1 followed by -1 or vice versa is 0. This makes the thinking and programming simpler, and less consideration of special cases is needed. Such thinking of breaking each piece down to lowest level and let each part do its own job ignoring others is especially important in recursions and dynamic programming in data structures and algorithms interviews.

This leetcode on calculating surface area demonstrates this concept too. You just have to care about what each cube is contributing to the total surface area and do additions and subtractions, even though you know 2 cubes which faces stuck to each other contribute nothing to the total surface area: https://leetcode.com/problems/surface-area-of-3d-shapes/

Besides the concept of “not simplifying” during problem solving, there is even the concept of reframing/making the problem more complex to solve something, such as representing the numbers {3,4} as {1,2,3,4} - {1,2}, just to make use of the property that starting from 1 is easier to handle than starting in middle. (Digit DP | Introduction - GeeksforGeeks). This concept is most commonly seen in calculating probability between a range of random variable values. That probability comes from CDF(higher bound) - CDF(lower bound) because CDF is a convenient function expressing sums all probabilities of the Random Variable from negative infinity to the bound. Observe the analogy here of “converting a range to a difference”.(Using the cumulative distribution function (CDF) - Minitab Express)

Here is another example of why shortcuts may not work in all cases: Understanding the point of the sigmoid function in logistic regression

Sorry, it took me some time to process everything that you´ve said.

It seems that, finally, I´ve got how the overloading of a comparison operator works. When you return a boolean, this boolean will be the answer to the question of whether self is smaller/bigger than other, won’t it?

I still have to learn what is a better way of programming. So, at the moment I choose to trust your expertise.

But by the way,

actually, we don´t need to know which of the 4 permutations is happening, the only thing we need is to update the counter with the sum of all events that happened at the same time, i.e. +2 in case of ss, +0 in case of see or es and -2 in case of ee. If I were to implement this logic, I would do something like this:

session_count = 0
max_sessions = 0
i = 0
while i < (len(events)-1):
    count = events[i][0]
    while events[i+1][1] == events[i][1]:
        count += events[i+1][0]
        i += 1
    session_count += count
    if session_count > max_sessions:
        max_sessions = session_count
    i += 1

session_count += events[i][0]
if session_count > max_sessions:
    max_sessions = session_count

One more time, thank you for your replies, this view from the outside from someone who is more experienced was extremely helpful.

This is what I mean by additional logic. Yes no need to check exactly which of 4, but need to check which of 2 (same/not same), my point being the fact that you need to check is already unnecessarily complex logic.

These are the extra mental burden to the developer and reader:

  1. Having to index into position 1 of each event to check whether is start
  2. Comparing current to next event (have to test for 4 transition permutations when testing code)
  3. Double while loop and adjusting outer loop endpoint to stop 1 before end
  4. Keeping track of/updating while loop i indices.

All these, together with intermediate tracking variables like session_count and count can be avoided with proper appending order during set up of events and using max(np.cumsum([event[0] for event in events])) to get the answer. The list comprehension materialization should be unnecessary so max(np.cumsum(event[0] for event in events)) should work too but not sure why DQ console returns generator_object at xxx rather than the integer answer.

At the beginning of learning, it may be tempting to throw all tools you know into a problem, or make use of all the information you have. Some people love to use map/filter just because they can while others argue that list comprehensions are better (i’m not yet clear why). In this problem, like your 1st solution, the solution does not need to know whether a timing is start or end, because that is kind of encoded in the {1,-1} already for the sake of the solution. Such eliminative thinking is particularly useful in SQL, where in the process of tweaking existing queries (more error prone than writing from scratch), you can throw away columns or even entire tables for performance gains if those information already come from another part of the query.

I forgot to add an advantage of DQ solution. That when it uses classes, it gets all the benefits of OOP. It can easily handle a future where new attributes are added to the class, such as “number of attendees”, “importance/priority of event”. These additional attributes could change the way sorting is implemented, and all that logic will fit nicely into def __lt__. A simple events.sort(key = lambda x: x[1]) may not be able to handle that kind of sorting then. The 1 liner lambda would become a full blown function floating around in codebase not encapsulated inside a class, and since the event is not represented as a class but a list, you would have to use indexes rather than names to get the relevant properties for comparison.

This event can also be subclassed into more particular types of events, where you can rank all events of a particular type in front of another type. The def __lt__ of the subclass will be inherited from the parent Event class.

class obj:
    def __init__(self,value):
        self.value=value

l = [obj(3),obj(2)]
l.sort()
#TypeError: '<' not supported between instances of 'obj' and 'obj'
class obj:
    def __init__(self,value):
        self.value=value
    def __lt__(self,other):
        return self.value<other.value

l = [obj(3),obj(2)]
l.sort()
l[0].value,l[1].value
#(2, 3)
# subclass does not need to have def __lt__
class subclass(obj):
    def __init__(self,value):
        self.value=value
l = [subclass(3),subclass(2)]
l.sort()
l[0].value,l[1].value
#(2, 3)
1 Like

I see what you mean. And I do realize that there´s still a lot to learn, especially in the way of thinking.

This one is a very elegant way to finish my solution.

I wish DQ covered more OOP-related topics and their application in the world of big data. Or, maybe, they are covered, but in the later courses.

1 Like

Here’s a long lesson I like. Even reading it takes a ton of patience and short term memory, so I bet creating such content is extremely difficult. The author has to know how to set up a situation, have multiple ways of organizing it, and explain refactoring. Learning refactoring is a good entry point to cleaner thinking.
https://martinfowler.com/tags/refactoring.html

Left out an important point in my __lt__ implementation demo.

You still need to use functools.total_ordering and define the 2 necessary methods (__eq__ + 1 of 4 other comparisons <=, <, >, >=) if want to use type comparison operators directly in the code, or else will be NotImplementedError. If you just want to use built-in sorted(), l.sort(), then implementing __lt__ alone is enough

Coming back to this thread, i see what a mess my previous answer was. There were important things i didn’t get clear and answered slightly off target.

DQ comment is correct. Even though DQ appended start_event first, after sort it will put end_event first in case of ties because the 2nd position in tuple for end_event is False, which translates to 0,(while True → 1), so False tuples (end_event) will be sorted before True tuples(start_event).

Differences in goals that I didn’t clarify:
DQ and my cumsum solution were both trying to treat each event (no matter start/end) as a step for updating statistics in the algorithm. You solution has less granular updating, and was trying to consider all events occuring at same time first, then summarizing the statistics before adding. All 3 are workable interpretations.

DQ’s way puts end_event before start, so won’t have much difference from yours.
My cumsum solution using your setup

events.append([1, session_start])
events.append([-1, session_end])

will put start_event first, allowing a chance to get a higher global max than DQ solution.
Eg. If current accumulated open sessions are 3, and there is an upcoming 1 start and 1 end, if start came first, it goes 3–>4–>3, so 4 is new max. If end came first, it goes 3–>2–>3, no chance to shoot past 3, even though both end up at 3.