Guided Project: Popular Data Science Questions

Hello! I’m stuck in 8th step of this Guided Project, the topic is to found relationship between two and multiple tags…in special at first line of cells 17 and 18(first line) of Project’s solution, so, here is the question:

Cell 17 - After create associations df, and fill the NaN values with 0, It iterate’s over questions[‘Tags’] and for each lists of tags and, if the list matches it will sum itself by 1, the result is a data frame with a bunch of 0 and a transversal column of 1 (when they match themself’s).

At the first line of cell 18, a new df is created (relations_most_used) and it uses associations.loc[most_used.index, most_used.index] and then the df is filled with several values of each time the index was related with the column! I didn’t get it! I seems like it is iterating it over again, but to me its only creating a new db with the index and columns labeled as most_used! So, how these bunch of values (output of cell 18) are getting out?

Thanks for the atention!!

1 Like

I decided to recreate a mini of this work to show how it works:

If you run questions['Tags'].head() you get this:

image
Create a set of unique tags for the items in the list above.

abc = questions['Tags'].head()
abc_tags = []
for a_list in abc:
    for item in a_list:
        abc_tags.append(item)
abc_tags = set(abc_tags)
abc_tags

image

When you get the unique keys, you create a n x n matrix in which you intend to show the relationships between words in questions['Tags'] with:

assoc = pd.DataFrame(index=abc_tags, columns=abc_tags)
assoc.fillna(0, inplace=True)

for tag in questions['Tags'].head():
    assoc.loc[tag, tag] += 1
assoc

Now we will look at this table and investigate the relationships between words
image

Take one word from the list. Take it as a word on the row or on the column of the table. You can trace that word’s relationship with other words on the same list. I showed some in yellow.

You can see that machine learning x machine learning has a value of 3. This is because machine learning occurs on dataset three times.

So if you have a words say [machine learning, data-mining], the assoc.loc[tag, tag] += 1 works like this: assoc.loc[machine learning, machine learning] +=1, assoc.loc[machine learning, data-mining] += 1 and assoc.loc[data-mining, data-mining] += 1

The style_cells function was to format the pandas dataframe by adding colors to cells.

1 Like

I was also having a hard time understanding this step. Thanks for the explanation, @monorienaghogho.

2 Likes