opened_file = open('AppleStore.csv')
from csv import reader
read_file = reader(opened_file)
apps_data = list(read_file)
content_ratings = {'4+': 0, '9+': 0, '12+': 0, '17+': 0}
for row in apps_data[1:]:
c_rating = row[10]
if c_rating in content_ratings:
content_ratings[c_rating] += 1
print(content_ratings)
This is how I solved it:
opened_file = open('AppleStore.csv')
from csv import reader
read_file = reader(opened_file)
apps_data = list(read_file)
content_ratings = {'4+': 0, '9+':0,'12+':0,'17+':0}
for key in content_ratings.keys():
for item in apps_data[1:]:
if key in item:
content_ratings[key] += 1
print(content_ratings)
It’s absolutely a valid solution, and a creative one at that, but in a technical sense, it is somewhat inoptimal.
Do this:
import time
opened_file = open('AppleStore.csv')
from csv import reader
read_file = reader(opened_file)
apps_data = list(read_file)
start = time.time
for key in content_ratings.keys():
for item in apps_data[1:]:
if key in item:
content_ratings[key] += 1
print(content_ratings)
end = time.time
print(end-start)
And then, compare the time it took using the suggested answer that only used a single for loop.
For me, using the nested for loop that you used took 0.04123258590698242 seconds. Using a single for loop - the way my original answer was - took 0.005053997039794922 seconds. This obviously isn’t a difference that’s going to be perceptible to a human, but it would become more apparent if the data-set was bigger.
The example above presents a terrific and very easy way of checking the efficiency of your code! Unsurprisingly, I observed that going the route without the nested for loop took about 1/8th of the time. You should try it out too!
For clarity, these are the only 4 lines of code you need, so it’s really easy to remember:
import time
start = time.time
# Insert the for loop, or any other lines of code you want to time here.
end = time.time
print(start-end)
Also, keep in mind that in some of the later missions, you’re going to have to construct the dictionary yourself too, because you won’t know the values you’ll be searching for. So you wouldn’t be able to use dict.keys().
The answer uses in which is calling the __contains__ method of dictionary to check in O(1) time complexity (meaning constant time, not related to how many key:value pairs the dictionary has), whether the dictionary has the key or not. Your solution enforced the usage of a for-loop (which calls __iter__()) over the dictionary keys which makes it O(n) time (linear time) to go through the dictionary, so overall your solution takes a longer time. If you say "But i am using in in if key in item:", that is analogous to the answer doing c_rating = row[10] (O(1) instant lookup) so you are getting no speed advantage.
An important difference here is the answer has hardcoded to look at the index 10 column while you are searching for something in any column/index position in the entire row. Practically, it is very common for information to appear in the wrong columns due to erroneous human input. The answer method is able to identify that while your solution will return that it found that item anyway when it’s in the wrong column. Besides human error, \n newline characters mixed with imperfect SQL queries can also cause values filled in wrong columns, so always looking at the column where you expect things to be is a more robust solution. However hardcoding 10 may not be good too because what if you wanted to insert a column before that which means 10 is not refering to the column you want anymore? Therefore, using named columns/named tuples are better in this case as you can reference the correct item through name no matter how information shifts around.
Question for you:
What if you maintain your looping order (meaning over content_ratings.keys in outer loop, then over apps_data inside), but fix the column you look at and replace the inner loop through apps_data with in instead? This requires you to extract all the available apps_data in that specific column.
If you did 1, there is really no need to repeatedly extract all the apps data for the same column for each outer loop of the dictionary. What if you produced this list of values first, then went through the dictionary? What savings does that offer?
Everything you see in this problem are looping methods. When you move on to numpy and pandas libraries you will be exposed to vectorized methods, which produce the same operations but much faster without looping.
As commented by @hanqi, strongly recommend to use Counter.
Without using Counter and with code given by the original post, it shows how to approach the problem of counting the values. By using a Counter object, it shows exactly what the problem wants that is the count of values. Using Counter significantly helps to improve the code readability.
If you choose not to use Counter, then suggest to use defaultdict. You don’t need to initialize each time for a new count for this value.
An example on defaultdict:
colors = ['red', 'green', 'red', 'blue', 'green', 'red']
d = {}
# basic way to loop through a dictionary
for color in colors:
if color not in d:
d[color] = 0
d[color] += 1
If colors is a long list, then if color not in d is a bottleneck.
>>> d
Out[32]:
{'blue': 1, 'green': 2, 'red': 3}
Next level improvement, use get method
d = {}
for color in colors:
d[color] = d.get(color, 0) + 1
>>> d
Out[33]:
{'blue': 1, 'green': 2, 'red': 3}
To further improve the code, use defaultdict.
from collections import defaultdict
#default value for integers are 0
d = defaultdict(int)
for color in colors:
d[color] += 1