Screen Link:
My Code:
def get_col_value_set(csv_filename,col_index):
with open(csv_filename, "r") as file:
next(file)
f=csv.reader(file)
f=list(f)
val=[]
for row in f:
if row[col_index] not in val:
val.append(row[col_index])
return (val)
for col in range(len(col_headers)):
val=get_col_value_set("boston.csv",col)
length=len(val)
print(col_headers[col], length)
What I expected to happen:
Expected to get output of the code with the print in the for loop
What actually happened:
Getting no output no error at all
@charulagarwal
if not in list
has O(N) time complexity since you are checking for membership of a value in a list.
By including the if not in list
statement within the for loop, your function has a quadratic time complexity, O(N^2).
The dataset is roughly 298,000 rows. This means your function will, in the worst case, have to iterate 298,000 x 298,000 times, or about 88.8 million times… which could take a pretty long while. Not sure how long it takes but I ran your code and interrupted it after 5 minutes.
If you want to write a function that will execute much faster, you can use a set as the instructions suggest:
def get_col_set(csv_filename, col_index):
with open(csv_filename) as file:
next(file)
reader = csv.reader(file)
values = set()
for row in reader:
values.add(row[col_index])
return values
import time
values_per_col_index = {}
start = time.time()
for i in range(7):
values_per_col_index[i] = len(get_col_set('boston.csv', i))
end = time.time()
runtime = end - start
7.533519983291626
This took only ~7.5 seconds on my laptop.
Or use pandas without the function (though the lesson asks us to use the function in the next section):
import pandas as pd
boston = pd.read_csv('boston.csv')
values_per_col = {}
start = time.time()
for col in boston.columns:
values_per_col[col] = len(boston[col].unique())
end = time.time()
runtime = end
- start
runtime
0.06252241134643555
Took only about half a second on my laptop.