Data Science Path Lessons Questions

I frequently have questions related to the lessons and don’t know of any other place to get help, thus I’m writing this post as a living Q&A for anyone going through the Data Science Lessons.

Dictionaries and Frequency Tables

First question:

In section 10 on Looping over Dictionaries, how does Python loop over a dictionary using a variable that was not defined? For example:

content_ratings = {‘4+’: 4433, ‘12+’: 1155, ‘9+’: 987, ‘17+’: 622}
total_number_of_apps = 7197
for rating in content_ratings:

The lessons call for this rating variable to be searched for in the dictionary, but the variable was never defined. I can’t wrap my head around why this works.

Thanks for the help and please let me know if there is a better way to get coaching like this related to the lessons

1 Like

In section 11, “Keeping the Dictionaries Separate”, it seems to me that the lesson is confusing the key and the value. For example:

content_ratings = {‘4+’: 4433, ‘12+’: 1155, ‘9+’: 987, ‘17+’: 622}
total_number_of_apps = 7197
c_ratings_proportions = {}
c_ratings_percentages = {}
for key in content_ratings:
proportion = content_ratings[key] / total_number_of_apps

Isn’t the idea to transform the values imto proportions? This seems to be equating a proportion to the key (the content rating of say ‘4+’) / 7197. ‘4+’ / 7197 doesn’t make sense…

image

In this lesson, how does Python find “size” in the size_freq dictionary if the dictionary hasn’t been populated yet?

i.e.

if size in size_freq:
size_freq[size] += 1

How does “if size in size_freq” come before the command that populates the table with sizes?

Hi chase, you can ask on this community as and when you meet some difficulties. I find that easier than holding a list of questions in the backlog.

Could you try markdown when writing code? For single line code, you can surround them backticks (top left of keyboard) like this. For blocks of code, surround them by triple backticks (both on their own new lines), ```. This makes the indentation of if and for blocks correct and thus more readable.

For question 1, the rating variable is repeatedly assigned to during iteration, with each iteration assigning a different key of the content_ratings dictionary to rating. It does not have to be defined before starting the iteration. I haven’t looked at the mission, but If the instruction was worded as "search for rating", that would be strange because usually in computer science, search is a specific term implying looking for a specific item in a collection(ordered/unordered) of items. In this case i wouldn’t call it search but just iteration since there isn’t a particular item to search for. You can still argue this is search if you add a conditional in the body of the for loop, eg. if rating == 4+: do_something(). In this case this is the linear search technique.

For question 2, what is the difference between content_ratings[key] and key? This understanding will solve your question.

For question 3, size_freq[size] = anything is the syntax for assigning a value to an existing/new key to the dictionary. One practical example is emulating sklearn.impute.SimpleImputer by looping through columns of a dataframe, and saving the mean value of each column into a dictionary. This dictionary would begin empty and in each iteration be populated with the key being column name and value being the mean of the column.

On why if size in size_freq comes before the assignment, it is logically more convenient for us to think in positives than negatives. Imagine if you wanted to reverse it, it becomes

if size not in size_freq:
    size_freq[size] = 1
else:
    size_freq[size] += 1

Notice in is now not in
On the grounds for typing 1 word less and positive being easier to think about, the former style is preferred.
However, there are times when the not in pattern is preferred. Sidetrack to probability, where there are concepts of A and A_complement. It may be more convenient to specify the outcomes/events you don’t want than the outcomes/events you want because there is less to write. Eg. writing not in [8,9], when in [0,1,2,3,4,5,6,7] also works for the event single digits less than 8.
So above explains why order that way.
Why it is ok to order that way? Because the code will not try to access the value of the key in the dictionary unless that key (and its value which is implied) is already in the dictionary. It avoids this access (which will generate KeyError if key wasn’t in) with the if size in size_freq conditional key check. If key in, increment, if key not in, move to else path of code to assign that key in with a value.

As you get familiar with this this if-else, you may move to the get property of python dictionaries to make use of pythonic features. https://www.geeksforgeeks.org/get-method-dictionaries-python/ . Even more advanced is collections.defaultdict https://docs.python.org/3.7/library/collections.html#collections.defaultdict.
That is not to say if-else is useless. It is much more explicit with what’s going on and helps readability. Also makes git diff which operates line by line clearer.

Hi Blakeley,

we dont require to declare any variable in python. for other language like java, we should declare the variable to used in for loops,

in java, for(int i=0,i<5,i++)

since python is high level language. we don’t have to declare variable. it means that
loops goes till the values in the content_rating.

rating is fetches first values from content_rating, then second time it fetches the second values from content-rating and go on.

Hi hanqi,

Thanks for the tip on markdown.

That resolves question one. I also found this line in lesson 10 that I had overlooked originally:

When we iterate over a dictionary with a for loop, the looping is done by default over the dictionary keys

Had to read that twice before it clicked

Question 2: What is the difference between content_ratings[key] and key?

I thought that [] implied index, which then would make content_ratings[key] puzzling to me.
Also, given that looping is done by default over dictionary keys, I’m not sure what key really is in and of itself.
So I don’t know the difference between the two.

Question 3:

size_freq[size] = anything is the syntax for assigning a value to an existing/new key to the dictionary.

This is helpful, but I am still confused about part 2 of your answer. My question is why does if size in size_freq come before populating the dictionary with size?

Follow on Q: Would the proceeding code below go on to populate the dictionary size_freq, and if so how?:

for row in apps_data[1:]:
    size = row[2]

If row[2] is every integer on the 2nd indexed row in the apps_data list, does the loop then populate each integer in this row to new dictionary keys? Would size then serve as a variable name that merely refers to key?

If so, then why use if and not for?

Thanks for your patience :raised_hands:

Yes [] can mean index for a list. However, when it comes to unordered objects like dictionaries, putting positional numbers inside don’t make sense. It is shorthand for __getitem__ and __setitem___.(https://stackoverflow.com/questions/43627405/understanding-getitem-method) These methods are implemented by users who wish to make use of python’s [] syntactic sugar for getting and setting items. If it appears on left hand side like content_ratings[key] = 2, it is calling __setitem__, if on right hand side value = content_ratings[key], it is calling __getitem__.
So the difference is key is some hashable object (you can just understand it as an immutable, unchanging object for now, because dictionary keys must be hashable), while content_ratings[key] is using this key to look in the content_ratings dictionary to get the value corresponding to the key.

Why would you think dictionary must have size key populated before it can be checked for the existence of size? If that was indeed the case, the if serves no purpose, because we already know size must be in size_freq, so there is no point in checking if size is in. It’s because there are 2 possible scenarios, size is in and size is not in that we have to use if.
It is perfectly find to check a set/list/dictionary for objects that you know are not there currently, just in case some day in future they are. The code will not break just because the object checked is not inside, it’s just that the contents of the if block will not be run if the if condition is evaluated to False.
There are 2 branches here in this if else, so it doesn’t really matter whether you place the if block logic first or the else block logic first, the if condition must be checked either way. It does matter more when code becomes if-elif-elif ....-else,(3 or more branches) where you can optimize by placing more common conditions first so the following elif won’t have to be checked once the 1st if evaluates to True.

Your 2 line code snippet will not populate size_freq because there isn’t any size_freq.__setitem__(key,value) or size_freq[key] = value (syntactic sugar for the former) used. size is just a variable used to store row[2] so you don’t have to type row[2] repeatedly when you want to do something with it in the body of the for loop. More importantly, it’s so row[2] indexing operation does not have to be repeatedly called (if you use row[2] multiple times).

I didn’t do the mission so may be wrong, but from your first question, it looks like size is indeed used to populate content_ratings. It would be more correct to say key refers to size than size refers to key. key is just a looping variable. It is a throwaway variable (unless you code at more difficult levels) which you never have to remember its contents, or use it, once the loop is done.