Tag Association Question

Screen Link: https://app.dataquest.io/m/469/guided-project%3A-popular-data-science-questions/8/relations-between-tags


I’m hoping someone can explain the logic behind a section of code in the solution for the Popular Data Science Questions guided project.

Here is the line of code I’m curious about, which fills in a database of assocations between all the tags, which are in list format in another database column:

associations.fillna(0, inplace=True)

for tags in questions[“Tags”]:
associations.loc[tags, tags] += 1

(Full solution: https://github.com/dataquestio/solutions/blob/master/Mission469Solutions.ipynb, line 17)

As I mentioned, these tags are in lists within the “Tags” column. My question is how the final line works. Given a list of tags such as
[‘machine-learning’, ‘regression’, ‘linear-regression’, ‘regularization’], each combination of two tags will be made and counted.

I understand what this line of code does now, but I would’ve never thought to write something like this/am not entirely sure why this line of code works. I would assume it would cause an error since you’re trying to count the combination of two lists, there must be something I don’t fully understand about lists or this specific instance that allows this to work on each tag within the list.

Any thoughts or understanding would be greatly appreciated!

1 Like

I’m wondering the same thing now. My instinct was that it should be something like this:

for row in questions[“Tags”]:
    for tags in row:
        associations.loc[tags,tags] += 1

So basically two for loops, one to get the row that contains the list and another to get the strings within the list. Neither is working for me at this moment. I must be missing something.

I found something on stack overflow that creates a ‘co-occurence’ matrix like this. Its much more complex and I don’t fully understand it but it is working for me. Here it is if you want to check it out. Its the third answer down if you sort by votes (which is the default):

I actually looked at the DQ answer after I found the stack overflow solution and was surprised at how simple the DQ solution was in comparison.


Your block of code does work, but you made one small error in your nested “for” loop. Instead of

associations.loc[tags,tags] += 1

You have to use

associations.loc[tags,row] += 1

So you final code should look like this:

for row in questions[“Tags”]:
    for tags in row:
        associations.loc[tags,row] += 1

This will return the same answer as:

for tags in questions["Tags"]:
    associations.loc[tags, tags] += 1

I was also stuck at this problem, but apparently Pandas can replace the nested for loop with the above syntax which is pretty powerful!


df.loc[] “will access a group of rows / columns by label”, and you can use that to assign values to entire rows and columns, and also to specific intersections where a row (or rows) meets a target column (or columns) and vice versa. That is what is going on in that line of code. So it doesn’t loop through the row, it uses the labels to access those rows and columns in the matrix, and then adds to the values there (which are starting at 0). Below are some examples you can try:

Example DataFrame:

example = [‘a’, ‘b’, ‘c’, ‘d’, ‘e’, ‘f’, ‘g’, ‘h’, ‘i’, ‘j’, ‘k’]
example = pd.DataFrame(index=example, columns=example).fillna(value=0)


Making some lists with our labels of interest:

ex_1 = [‘b’]
ex_2 = [‘a’, ‘c’]
ex_3 = [‘d’, ‘e’, ‘g’]
ex_4 = [‘f’, ‘h’, ‘j’, ‘k’]

Accessing some combinations of these lists:

example.loc[ex_1, ex_1]


example.loc[ex_2, ex_4]


example.loc[ex_3, ex_3]


Assigning some values to the df:

example.loc[ex_1, ex_1] = 5
example.loc[ex_2, ex_2] = 5
example.loc[ex_3, ex_3] = 5
example.loc[ex_4, ex_4] = 5
example.loc[ex_1, ex_2] += 10
example.loc[ex_4, ex_1] += 20
example.loc[‘c’, ex_4] += 50
example.loc[[‘i’, ‘k’], [‘d’, ‘e’]] += 100