Project 1: App profiles. Getting 9311 records instead of 9659 after duplicate removal (step 5)

I am getting a data set length of 9311 instead of 9659 after cleaning duplicates from the data. I have been over and over the functions and cannot find the issue, so I’m requesting another set of eyes to help me spot it. Thank you!

Dictionary function
This function generates a dictionary with the max reviews. When this function is complete, the dictionary is 9659 entries long, as expected:

def create_review_count_dictionary(name_column_num, review_column_num, dataset):
    review_count_dict = {}

    for app in dataset[1:]:
        app_name = app[name_column_num]
        if app_name not in review_count_dict:
            review_count_dict[app_name] = float(app[review_column_num])
        else:
            if float(review_count_dict[app_name]) < float(app[review_column_num]):
                review_count_dict[app_name] = app[review_column_num]

    return review_count_dict

Duplicate Removal function
Here’s where the issue comes in. After running this code, (which was copied from the solution notebook with small modifications) the length of the cleaned set returns only 9311 records. So there are 348 rows missing from the clean data set:

def remove_duplicate_entries(name_column_num, review_column_num, dataset):
    android_clean = []
    already_added = []

    reviews_max = create_review_count_dictionary(0, 3, android)

    for app in dataset[1:]:
        name = app[0]
        n_reviews = float(app[3])

        if (reviews_max[name] == n_reviews) and (name not in already_added):
            android_clean.append(app)
            already_added.append(name)
    return android_clean

I did a lot of trial and error with your code and compared it to the results I got from my own project. It was a good exercise to see another way of doing the same thing! I noticed when I compared the dictionary created by the create_review_count_dictionary function and what I had done in my code that there were exactly 348 differences. Looking at the else statement, app[review_column_num] wasn’t converted to float, so I changed that bit to see what would happen:

else:
    if float(review_count_dict[app_name]) < float(app[review_column_num]):
        review_count_dict[app_name] = float(app[review_column_num])

After making that change the dictionaries were the same, and the cleaned set returns the expected number of rows.

2 Likes

I can’t thank you enough, April! I was banging my head all day over this one!

If you are going through the course too and you want to compare projects, feel free to email me at (removed) :slight_smile:

Sure. I’m in the SQL section of the analyst path right now (slogging through it…). Feel free to DM me any time.

1 Like

Can’t figure out how to DM lol!

You know, I said that and didn’t know straightaway how to do it either. :smiley:
You can either click on a person’s name within a thread and use the blue message button that pops up in the window. I sent you a message as a test. You can also go to your profile page and there’s a blue “New Message” button where you can type a user’s name and send a message too.

1 Like

Here’s a gif to help - great explanation!

1 Like

Thanks! I don’t have a blue message button. I assume because I’m still on the free content and haven’t paid yet.

I’m trying to modify these functions that you have listed here. I have tried everything and am still coming up with 9311 when I remove the dupes and finally put them into the cleaned_data list Thanks for the help.

Mike

Hi @mik123je, welcome to the community!

Would you mind either copy/pasting your code or uploading a copy of your notebook instead of the screenshot? I’d like to help you troubleshoot and would make it easier if the code isn’t trapped in an image. Here’s a quick gif on how to format the code when you copy and paste:

In the meantime, I noticed within your remove_duplicates() function, in the if-statement, you are referencing reviews_max (if (reviews_max[name] == n_reviews)). I think you meant max_reviews from a few lines up? This might not solve your problem but it popped out at me.

I’m sorry about the screenshot. I hope I did the copy/paste method correct.
You we’re right about the reviews_max/max_reviews. I made the change and now I get the 9659.
I should have walked away and came back to it, but I was getting frustrated and tried everything along with deleting most of the original code. I saved it and sure enough the same line was the culprit.
Thank you!!
Mike

   def reviews_dict(dataset, name_col, reviews_col):
        max_reviews = {}
        
        for app in dataset:
            app_name = app[name_col]
            if app_name not in max_reviews:
                max_reviews[app_name] = float(app[reviews_col])
            else:
                if float(max_reviews[app_name]) < float(app[reviews_col]):
                    max_reviews[app_name] = float(app[reviews_col])

        return max_reviews
    print(len(reviews_dict(google, 0, 3)))

    def remove_duplicates(dataset, name_col, reviews_col):
        cleaned_data = []
        app_names = []
        
        max_reviews = reviews_dict(google, 0, 3,) 
            
        for app in dataset:
            name = app[0]
            n_reviews = float(app[3])

            if (max_reviews[name] == n_reviews) and (name not in app_names):
                cleaned_data.append(name)
                app_names.append(name)
        
        return cleaned_data
    dupes_gone = remove_duplicates(google, 0 ,3)
    print(len(dupes_gone))
2 Likes

I’m glad you got it figured out! Sometimes when looking at lines of code it gets hard to pick those things out, so it’s good to step back or get another set of eyes on it. Good luck on the rest of your project!

If we are talking about debugging our own code, this technique is good :slight_smile:

2 Likes