Data Science Path Lessons Questions

I frequently have questions related to the lessons and don’t know of any other place to get help, thus I’m writing this post as a living Q&A for anyone going through the Data Science Lessons.

Dictionaries and Frequency Tables

First question:

In section 10 on Looping over Dictionaries, how does Python loop over a dictionary using a variable that was not defined? For example:

content_ratings = {‘4+’: 4433, ‘12+’: 1155, ‘9+’: 987, ‘17+’: 622}
total_number_of_apps = 7197
for rating in content_ratings:

The lessons call for this rating variable to be searched for in the dictionary, but the variable was never defined. I can’t wrap my head around why this works.

Thanks for the help and please let me know if there is a better way to get coaching like this related to the lessons

1 Like

In section 11, “Keeping the Dictionaries Separate”, it seems to me that the lesson is confusing the key and the value. For example:

content_ratings = {‘4+’: 4433, ‘12+’: 1155, ‘9+’: 987, ‘17+’: 622}
total_number_of_apps = 7197
c_ratings_proportions = {}
c_ratings_percentages = {}
for key in content_ratings:
proportion = content_ratings[key] / total_number_of_apps

Isn’t the idea to transform the values imto proportions? This seems to be equating a proportion to the key (the content rating of say ‘4+’) / 7197. ‘4+’ / 7197 doesn’t make sense…

In this lesson, how does Python find “size” in the size_freq dictionary if the dictionary hasn’t been populated yet?

i.e.

if size in size_freq:
size_freq[size] += 1

How does “if size in size_freq” come before the command that populates the table with sizes?

Hi chase, you can ask on this community as and when you meet some difficulties. I find that easier than holding a list of questions in the backlog.

Could you try markdown when writing code? For single line code, you can surround them backticks (top left of keyboard) `like this`. For blocks of code, surround them by triple backticks (both on their own new lines), ```. This makes the indentation of `if` and `for` blocks correct and thus more readable.

For question 1, the `rating` variable is repeatedly assigned to during iteration, with each iteration assigning a different key of the `content_ratings` dictionary to `rating`. It does not have to be defined before starting the iteration. I haven’t looked at the mission, but If the instruction was worded as "search for `rating`", that would be strange because usually in computer science, search is a specific term implying looking for a specific item in a collection(ordered/unordered) of items. In this case i wouldn’t call it search but just iteration since there isn’t a particular item to search for. You can still argue this is search if you add a conditional in the body of the for loop, eg. `if rating == 4+: do_something()`. In this case this is the linear search technique.

For question 2, what is the difference between `content_ratings[key]` and `key`? This understanding will solve your question.

For question 3, `size_freq[size] = anything` is the syntax for assigning a value to an existing/new key to the dictionary. One practical example is emulating `sklearn.impute.SimpleImputer` by looping through columns of a dataframe, and saving the mean value of each column into a dictionary. This dictionary would begin empty and in each iteration be populated with the key being column name and value being the mean of the column.

On why `if size in size_freq` comes before the assignment, it is logically more convenient for us to think in positives than negatives. Imagine if you wanted to reverse it, it becomes

``````if size not in size_freq:
size_freq[size] = 1
else:
size_freq[size] += 1
``````

Notice `in` is now `not in`
On the grounds for typing 1 word less and positive being easier to think about, the former style is preferred.
However, there are times when the `not in` pattern is preferred. Sidetrack to probability, where there are concepts of `A` and `A_complement`. It may be more convenient to specify the outcomes/events you don’t want than the outcomes/events you want because there is less to write. Eg. writing `not in [8,9]`, when `in [0,1,2,3,4,5,6,7]` also works for the event `single digits less than 8`.
So above explains why order that way.
Why it is ok to order that way? Because the code will not try to access the value of the key in the dictionary unless that key (and its value which is implied) is already in the dictionary. It avoids this access (which will generate `KeyError` if key wasn’t in) with the `if size in size_freq` conditional key check. If key in, increment, if key not in, move to `else` path of code to assign that key in with a value.

As you get familiar with this this if-else, you may move to the `get` property of python dictionaries to make use of pythonic features. https://www.geeksforgeeks.org/get-method-dictionaries-python/ . Even more advanced is `collections.defaultdict` https://docs.python.org/3.7/library/collections.html#collections.defaultdict.
That is not to say if-else is useless. It is much more explicit with what’s going on and helps readability. Also makes `git diff` which operates line by line clearer.

Hi Blakeley,

we dont require to declare any variable in python. for other language like java, we should declare the variable to used in for loops,

in java, for(int i=0,i<5,i++)

since python is high level language. we don’t have to declare variable. it means that
loops goes till the values in the content_rating.

rating is fetches first values from content_rating, then second time it fetches the second values from content-rating and go on.

Hi hanqi,

Thanks for the tip on markdown.

That resolves question one. I also found this line in lesson 10 that I had overlooked originally:

When we iterate over a dictionary with a for loop, the looping is done by default over the dictionary keys

Question 2: What is the difference between `content_ratings[key]` and `key`?

I thought that `[]` implied index, which then would make `content_ratings[key]` puzzling to me.
Also, given that looping is done by default over dictionary keys, I’m not sure what `key` really is in and of itself.
So I don’t know the difference between the two.

Question 3:

`size_freq[size] = anything` is the syntax for assigning a value to an existing/new key to the dictionary.

This is helpful, but I am still confused about part 2 of your answer. My question is why does `if size in size_freq` come before populating the dictionary with `size`?

Follow on Q: Would the proceeding code below go on to populate the dictionary `size_freq`, and if so how?:

``````for row in apps_data[1:]:
size = row[2]
``````

If `row[2]` is every integer on the 2nd indexed row in the `apps_data` list, does the loop then populate each integer in this row to new dictionary keys? Would `size` then serve as a variable name that merely refers to `key`?

If so, then why use `if` and not `for`?

Yes `[]` can mean index for a list. However, when it comes to unordered objects like dictionaries, putting positional numbers inside don’t make sense. It is shorthand for `__getitem__` and `__setitem___`.(https://stackoverflow.com/questions/43627405/understanding-getitem-method) These methods are implemented by users who wish to make use of python’s `[]` syntactic sugar for getting and setting items. If it appears on left hand side like `content_ratings[key] = 2`, it is calling `__setitem__`, if on right hand side `value = content_ratings[key]`, it is calling `__getitem__`.
So the difference is `key` is some hashable object (you can just understand it as an immutable, unchanging object for now, because dictionary keys must be hashable), while `content_ratings[key]` is using this key to look in the `content_ratings` dictionary to get the value corresponding to the `key`.
Why would you think dictionary must have `size` key populated before it can be checked for the existence of `size`? If that was indeed the case, the `if` serves no purpose, because we already know `size` must be in `size_freq`, so there is no point in checking if `size` is in. It’s because there are 2 possible scenarios, `size is in` and `size is not in` that we have to use `if`.
It is perfectly find to check a set/list/dictionary for objects that you know are not there currently, just in case some day in future they are. The code will not break just because the object checked is not inside, it’s just that the contents of the `if` block will not be run if the `if` condition is evaluated to False.
There are 2 branches here in this if else, so it doesn’t really matter whether you place the if block logic first or the else block logic first, the `if` condition must be checked either way. It does matter more when code becomes `if-elif-elif ....-else`,(3 or more branches) where you can optimize by placing more common conditions first so the following `elif` won’t have to be checked once the 1st `if` evaluates to True.
Your 2 line code snippet will not populate `size_freq` because there isn’t any `size_freq.__setitem__(key,value)` or `size_freq[key] = value` (syntactic sugar for the former) used. `size` is just a variable used to store `row[2]` so you don’t have to type `row[2]` repeatedly when you want to do something with it in the body of the for loop. More importantly, it’s so `row[2]` indexing operation does not have to be repeatedly called (if you use `row[2]` multiple times).
I didn’t do the mission so may be wrong, but from your first question, it looks like size is indeed used to populate `content_ratings`. It would be more correct to say `key` refers to `size` than `size` refers to `key`. `key` is just a looping variable. It is a throwaway variable (unless you code at more difficult levels) which you never have to remember its contents, or use it, once the loop is done.