Stuck at page 4 of Guided Project: Building a database for crime reports

Screen Link:

My Code:

def get_col_value_set(csv_filename,col_index):
    with open(csv_filename, "r") as file: 
        next(file)
        f=csv.reader(file)
        f=list(f)
        val=[]
        for row in f:
            if row[col_index] not in val:
                val.append(row[col_index])
        return (val)        
        
for col in range(len(col_headers)):
    val=get_col_value_set("boston.csv",col) 
    length=len(val)
    print(col_headers[col], length)

What I expected to happen:
Expected to get output of the code with the print in the for loop

What actually happened:

Getting no output no error at all

@charulagarwal

if not in list has O(N) time complexity since you are checking for membership of a value in a list.

By including the if not in list statement within the for loop, your function has a quadratic time complexity, O(N^2).

The dataset is roughly 298,000 rows. This means your function will, in the worst case, have to iterate 298,000 x 298,000 times, or about 88.8 million times… which could take a pretty long while. Not sure how long it takes but I ran your code and interrupted it after 5 minutes.

If you want to write a function that will execute much faster, you can use a set as the instructions suggest:

def get_col_set(csv_filename, col_index):
    with open(csv_filename) as file:
        next(file)
        reader = csv.reader(file)
        values = set()
        for row in reader:
            values.add(row[col_index])
        
    return values

import time
values_per_col_index = {}

start = time.time()
for i in range(7):
    values_per_col_index[i] = len(get_col_set('boston.csv', i))
end = time.time()
runtime = end - start
7.533519983291626

This took only ~7.5 seconds on my laptop.

Or use pandas without the function (though the lesson asks us to use the function in the next section):

import pandas as pd

boston = pd.read_csv('boston.csv')

values_per_col = {}

start = time.time()
for col in boston.columns:
    values_per_col[col] = len(boston[col].unique())
end = time.time()
runtime = end
 - start

runtime
0.06252241134643555

Took only about half a second on my laptop.