Counting with Dictionaries - Is my solution non-optimal?

I’m on 7. Counting with Dictionaries in the Dictionaries and Frequency Tables mission (https://app.dataquest.io/m/314/dictionaries-and-frequency-tables/7/counting-with-dictionaries) and I noticed that my solution was different than what is under “Get Help.” Is my solution non-optimal or just different?

This is the solution provided:

opened_file = open('AppleStore.csv')
from csv import reader
read_file = reader(opened_file)
apps_data = list(read_file)
content_ratings = {'4+': 0, '9+': 0, '12+': 0, '17+': 0}

for row in apps_data[1:]:
    c_rating = row[10]
    if c_rating in content_ratings:
        content_ratings[c_rating] += 1
        
print(content_ratings)

This is how I solved it:

opened_file = open('AppleStore.csv')
from csv import reader
read_file = reader(opened_file)
apps_data = list(read_file)

content_ratings = {'4+': 0, '9+':0,'12+':0,'17+':0}

for key in content_ratings.keys():
  for item in apps_data[1:]:
    if key in item:
      content_ratings[key] += 1
print(content_ratings)

Hi there smoralesjr!!

It’s absolutely a valid solution, and a creative one at that, but in a technical sense, it is somewhat inoptimal.

Do this:

import time 
opened_file = open('AppleStore.csv')
from csv import reader
read_file = reader(opened_file)
apps_data = list(read_file)

start = time.time

for key in content_ratings.keys():
  for item in apps_data[1:]:
    if key in item:
      content_ratings[key] += 1
print(content_ratings)

end = time.time
print(end-start)

And then, compare the time it took using the suggested answer that only used a single for loop.

For me, using the nested for loop that you used took 0.04123258590698242 seconds. Using a single for loop - the way my original answer was - took 0.005053997039794922 seconds. This obviously isn’t a difference that’s going to be perceptible to a human, but it would become more apparent if the data-set was bigger.

The example above presents a terrific and very easy way of checking the efficiency of your code! Unsurprisingly, I observed that going the route without the nested for loop took about 1/8th of the time. You should try it out too!

For clarity, these are the only 4 lines of code you need, so it’s really easy to remember:

import time

start = time.time
# Insert the for loop, or any other lines of code you want to time here.
end = time.time
print(start-end)

Also, keep in mind that in some of the later missions, you’re going to have to construct the dictionary yourself too, because you won’t know the values you’ll be searching for. So you wouldn’t be able to use dict.keys().

2 Likes

Thanks for this explanation! I’ve had similar questions in the past, but never new how to test it.

You may think doing the answer is doing 2 loops just like you but just switching the inside and outside loops which should end up with the same number of variable accessing steps and thus same time required.
This answer explains some difference between in and for: https://stackoverflow.com/questions/38204342/python-in-keyword-in-expression-vs-in-for-loop

The answer uses in which is calling the __contains__ method of dictionary to check in O(1) time complexity (meaning constant time, not related to how many key:value pairs the dictionary has), whether the dictionary has the key or not. Your solution enforced the usage of a for-loop (which calls __iter__()) over the dictionary keys which makes it O(n) time (linear time) to go through the dictionary, so overall your solution takes a longer time. If you say "But i am using in in if key in item:", that is analogous to the answer doing c_rating = row[10] (O(1) instant lookup) so you are getting no speed advantage.

An important difference here is the answer has hardcoded to look at the index 10 column while you are searching for something in any column/index position in the entire row. Practically, it is very common for information to appear in the wrong columns due to erroneous human input. The answer method is able to identify that while your solution will return that it found that item anyway when it’s in the wrong column. Besides human error, \n newline characters mixed with imperfect SQL queries can also cause values filled in wrong columns, so always looking at the column where you expect things to be is a more robust solution. However hardcoding 10 may not be good too because what if you wanted to insert a column before that which means 10 is not refering to the column you want anymore? Therefore, using named columns/named tuples are better in this case as you can reference the correct item through name no matter how information shifts around.

Question for you:

  1. What if you maintain your looping order (meaning over content_ratings.keys in outer loop, then over apps_data inside), but fix the column you look at and replace the inner loop through apps_data with in instead? This requires you to extract all the available apps_data in that specific column.
  2. If you did 1, there is really no need to repeatedly extract all the apps data for the same column for each outer loop of the dictionary. What if you produced this list of values first, then went through the dictionary? What savings does that offer?
  3. How can you use https://docs.python.org/3.7/library/collections.html#collections.Counter to simplify this problem? (very useful dictionary manipulator)

Everything you see in this problem are looping methods. When you move on to numpy and pandas libraries you will be exposed to vectorized methods, which produce the same operations but much faster without looping.

3 Likes

As commented by @hanqi, strongly recommend to use Counter.

Without using Counter and with code given by the original post, it shows how to approach the problem of counting the values. By using a Counter object, it shows exactly what the problem wants that is the count of values. Using Counter significantly helps to improve the code readability.

If you choose not to use Counter, then suggest to use defaultdict. You don’t need to initialize each time for a new count for this value.
An example on defaultdict:

colors = ['red', 'green', 'red', 'blue', 'green', 'red']
d = {}
# basic way to loop through a dictionary
for color in colors:
    if color not in d:
        d[color] = 0
    d[color] += 1

If colors is a long list, then if color not in d is a bottleneck.

>>> d
Out[32]:
{'blue': 1, 'green': 2, 'red': 3}

Next level improvement, use get method

d = {}
for color in colors:
    d[color] = d.get(color, 0) + 1
>>> d
Out[33]:
{'blue': 1, 'green': 2, 'red': 3}

To further improve the code, use defaultdict.

from collections import defaultdict
#default value for integers are 0
d = defaultdict(int) 
for color in colors:
    d[color] += 1
1 Like