Quick question about course answer vs my answer (set() + list() vs if not in)

Hi there,

Just a quick question about the provided solution vs my own. The mission suggests adding every word from each row in the SMS column to a list called vocabulary. When its done, it says to convert the vocabulary list to a set using the set() function, then convert it back to a list using the list() function. Apparently this allows you to remove duplicates.

Before reading the whole question (a mistake I often make!) I iterated over each word but instead of using set() and list() at the end, I used an if statement within the for loop to skip words that were already in the list.

I believe it resulted in the same list (once sorted), but is my way inefficient/slow in comparison? Is using an if statement in this manner frowned upon in some way?

Apologies if this is a dumb question or answered elsewhere in the course!

Code below.

Screen Link:

My Code:

trainingset['SMS'] = trainingset['SMS'].str.split()
vocabulary = []
for row in trainingset['SMS']:
    for word in row:
        if word not in vocabulary:
            vocabulary.append(word)

Solution Code (using my variable names):

trainingset['SMS'] = trainingset['SMS'].str.split()
vocabulary = []
for row in trainingset['SMS']:
    for word in row:
        vocabulary1.append(word)
        
vocabulary = list(set(vocabulary))

in membership checking (different from the for i in iterable which is a “hardcoded into language” in) is usually done with list, set, dict. in list is O(n) complexity while latter 2 are O(1), set uses the same hashtable structure as dict, just having no meaningful values.

Both your code and solution have 2 for loops, leading to at least O(r*n) row*longest_sentence_length complexity. However your code additionally did an in check in vocabulary every inner loop. This is extra O(n) work for a total of O(r*n*n). As vocabulary list grows, this check becomes slower and slower. The solution is to initiaize vocabulary = set() because set membership checks are O(1).

Also, it lets you remove the if word not in vocabulary: check because set().add() automatically handles the check. Usually in algorithm design people use the in check to control other elements so it must be there, but in your case you are controlling whether to add to set ornot, which is already done by set().add() so no need to manually check.

This is somewhat analogous to coding maximum = max(maximum, element) to replace

maximum = -math.inf
for element in [2,1,3]:
    if element > maximum:
        maximum = element        

max/min() at each stage is a commonly seen pattern in optimization problems solved with greedy/dynamic programming approaches. It’s less verbose than using if and achieves the same.

The solution could have similarly used set from the beginning.
It wasted another O(n) (n=length of list) complexity turning vocabulary to set, then another O(n) (n=length of set) from set to list.

The above definitions of n are not very clear. It generally means the size of input. Size can be interpreted as value of integer if input is integer (10>9), or the length of iterable, [1,2,3] > [20,12], or the length of iterable on each dimension if more than 1 dimension (usually 2D table in bottom-up dynamic programming), or V,E (counts of vertices, edges) to describe complexity of graph algorithms