iOS: Found 2 duplicates, Reduced non-English apps by another 20

Basics_EL.ipynb (83.6 KB)

Looking to improve, for a start, whether there're: 1) mistakes to amend, and 2) better ways to code. https://app.dataquest.io/m/350/guided-project%3A-profitable-app-profiles-for-the-app-store-and-google-play-markets/13/most-popular-apps-by-genre-on-google-play

Thank you!



Click here to view the jupyter notebook file in a new tab
1 Like

Hi,

  1. length column of header and length of columns in list haven’t
    any logical link with each other.
  2. For correct check of duplicates you must check besides app name yet apps versions, size and rating. In IOS case duplicates have different versions, size and rating and formally are different applications. In android case name, rating, size and verison of duplicate are equal.
  3. Why you only delete row [10472] - you are sure that google store dataset contains only one rating value more than 5, why you don’t check all values of rating? If somebody will reverse sort google store dataset by name, rating - what value is in row 10472? You a sure that the google store dataset from dataquest that in ours project and dataset from kaggle have equal rows numbering.
    In some discussions some people train model for google store no remove NaN value and write science report about this ))
  4. Remove non English apps - you a sure that name of application doesn’t contain from one or two character and is combination from one Latin and one Asian - how our filter will be work in this case?
    Trust but verify!
    Best regargds, Vadim Maklakov
2 Likes

Hi Vadim,

Thank you very much for your feedback!

I just started learning Python early this month, hoping to change path after years in teaching, financial advisory and journalism. This guided project is the very first I’ve completed-to-date, and nervously posted on a platform that is no different from an alien terrain. Hence, I’m absolutely grateful you took the time and effort to help!

In my “baby-steps”, I held on to the project-solutions for dear life, to guide me in terms of overall composition as well as the directions to take, making small detours here and there with insights from the community-sharing.

Pertaining to the points you’d raised, my thought-process are as follows (feel free to point out any mistakes and/or blind spots I made):

  1. Generally, a dataset’s header-row and its indexed column-titles dictate the data-body structure, i.e. all pieces of data should adhere to their respective titles’ column-positions.

Notwithstanding the possibility that information may be mis-placed, checking the length of each row of data against that of the header will immediately highlight the row(s) with abnormality, which led me to row[10472].

  1. Thank you for pointing out that iOS duplicates sharing the same app name may not always mean they’re of the same app, and that Android apps are spared this complication. Indeed, each of the iOS “duplicates” I found has a different id from the rest, when there should only be one unique id for the same app. Unfortunately, there seems to be no such id details in Android’s data.

Using ‘HipChat - Chat Built for Teams’ as an Android example, the rating, size and version for two duplicate entries are both 3.8, 20M and 4.1 & up respectively. Having said that, the guided solutions’ criterion seems to be the apps-names too; I didn’t see the use of any other criteria anywhere.

And to get a deeper understanding on why and how different versions, sizes and ratings would render the same app to become different apps. Fortunately, I managed to find further explanation here.

  1. I had intentionally deviated from the solutions, preferring to create a function capable of combing both data sets, instead of simply going with the offered answer. Using the second set of links to retrieve both datasets, I opened and saved both csv to my computer to work locally on Jupyter Notebook.

Had I chosen to obtain datasets via links to Kaggle, or possibly experienced a dataset update, the row_check() function will most likely show a different row index from 10472. But isn’t that one of reason why we write function to do so?

You also asked if I’d check for other instances of ratings > 5, I didn’t (which should have been the way instead of using the solutions’ explanation). However, I had since followed up with a frequency table for that column. Other than row[10472] missing the category and raising its rating beyond the max of 5, all remaining ratings within the data set are either ‘NaN’ values or <= 5.

Instances of ‘NaN’ values made up almost 14.65% of total number of ratings, with the ‘Family’ and ‘Business’ having the most number relative to other categories. I’ve yet to successfully replace all ‘NaN’ rating values with either zero, mean/median value or other options. (Started to explore pandas and numpy for this purpose.)

  1. When removing non-English apps, I added a new criterion below to be True:

(non_ascii <= 3 and non_ascii == len(string))

This condition didn’t affect the android data but it removed another 18 apps from the iOS data, and they’re:

[‘豆瓣’, ‘知乎’, ‘飞猪’, ‘大辞林’, ‘雨时’, ‘鬼とび’, ‘屠龙杀’, ‘ブリ猫’, ‘のび毛’, ‘币优铺’, ‘任务客’, ‘秒速’, ‘www’, ‘素飛び’, ‘謎解き’, ‘指神’, ‘和我信’, ‘針の穴’]

It appears that ‘www’, originally grouped under the ‘Games’ category, is no longer available. As for the rest, it becomes more apparent that they weren’t meant for English-speaking users.

Thanks again Vadim! For your acute observation and the additional learning that comes with this reply.

Warmest regards,
Edgar

2 Likes

And we distorting the real map of data when we will fill NaN columns values with mean and median values…Do you think will be safe car something car with autopilot having 14.65% mistake for define distance, velocity or accelerate? :grinning: :grinning: