Thank you very much for your feedback!
I just started learning Python early this month, hoping to change path after years in teaching, financial advisory and journalism. This guided project is the very first I’ve completed-to-date, and nervously posted on a platform that is no different from an alien terrain. Hence, I’m absolutely grateful you took the time and effort to help!
In my “baby-steps”, I held on to the project-solutions for dear life, to guide me in terms of overall composition as well as the directions to take, making small detours here and there with insights from the community-sharing.
Pertaining to the points you’d raised, my thought-process are as follows (feel free to point out any mistakes and/or blind spots I made):
- Generally, a dataset’s header-row and its indexed column-titles dictate the data-body structure, i.e. all pieces of data should adhere to their respective titles’ column-positions.
Notwithstanding the possibility that information may be mis-placed, checking the length of each row of data against that of the header will immediately highlight the row(s) with abnormality, which led me to row.
- Thank you for pointing out that iOS duplicates sharing the same app name may not always mean they’re of the same app, and that Android apps are spared this complication. Indeed, each of the iOS “duplicates” I found has a different id from the rest, when there should only be one unique id for the same app. Unfortunately, there seems to be no such id details in Android’s data.
Using ‘HipChat - Chat Built for Teams’ as an Android example, the rating, size and version for two duplicate entries are both 3.8, 20M and 4.1 & up respectively. Having said that, the guided solutions’ criterion seems to be the apps-names too; I didn’t see the use of any other criteria anywhere.
And to get a deeper understanding on why and how different versions, sizes and ratings would render the same app to become different apps. Fortunately, I managed to find further explanation here.
- I had intentionally deviated from the solutions, preferring to create a function capable of combing both data sets, instead of simply going with the offered answer. Using the second set of links to retrieve both datasets, I opened and saved both csv to my computer to work locally on Jupyter Notebook.
Had I chosen to obtain datasets via links to Kaggle, or possibly experienced a dataset update, the row_check() function will most likely show a different row index from 10472. But isn’t that one of reason why we write function to do so?
You also asked if I’d check for other instances of ratings > 5, I didn’t (which should have been the way instead of using the solutions’ explanation). However, I had since followed up with a frequency table for that column. Other than row missing the category and raising its rating beyond the max of 5, all remaining ratings within the data set are either ‘NaN’ values or <= 5.
Instances of ‘NaN’ values made up almost 14.65% of total number of ratings, with the ‘Family’ and ‘Business’ having the most number relative to other categories. I’ve yet to successfully replace all ‘NaN’ rating values with either zero, mean/median value or other options. (Started to explore pandas and numpy for this purpose.)
- When removing non-English apps, I added a new criterion below to be True:
(non_ascii <= 3 and non_ascii == len(string))
This condition didn’t affect the android data but it removed another 18 apps from the iOS data, and they’re:
[‘豆瓣’, ‘知乎’, ‘飞猪’, ‘大辞林’, ‘雨时’, ‘鬼とび’, ‘屠龙杀’, ‘ブリ猫’, ‘のび毛’, ‘币优铺’, ‘任务客’, ‘秒速’, ‘ｗｗｗ’, ‘素飛び’, ‘謎解き’, ‘指神’, ‘和我信’, ‘針の穴’]
It appears that ‘ｗｗｗ’, originally grouped under the ‘Games’ category, is no longer available. As for the rest, it becomes more apparent that they weren’t meant for English-speaking users.
Thanks again Vadim! For your acute observation and the additional learning that comes with this reply.