Profitable App Profiles - Mo' Money Mo' Problems

Hello all! Please take a moment to review my second DQ project titled ‘Free App Profiles with High User Engagement’. Again, any feedback would be greatly appreciated, enjoy your weekend!

Profitable_App_Profiles.ipynb (845.5 KB)

Click here to view the jupyter notebook file in a new tab


Hi @shaun.oilund ,

Great great project. Let me start by saying even though this was a big project, your comments and section descriptions made this project easy to follow. I could understand each step and what you were trying to do. I enjoyed reading.

Great job adding sections and comments. Great job using charts and tables. You did a great job guiding the reader though your analysis.

You did a great job explaining the cleaning process especially removing the duplicates. Ln[9]

The graphs and tables were very helpful. You could quickly see the top 10 apps.

I also feel like a hypocrite for making any suggestions because I know you spent a lot of time creating this great project. You did such a great job. So, please just feel free to ignore my suggestions because the project is really great.

  • ln[7] add row instead of appending each column individually. In ln[12] you added the row

  • Did you consider ranking the installs separately and not by genre to see top installed apps?

  • Did you consider explaining why you used rating_count_total for installs on ln[37]

Again, great project. The graphics were great and I loved the table of contents. I enjoyed reading. Your comments really helped to guide and understand your analysis process. Great job.


Hi Casandra, thank you very much for your awesome review and taking the time to go through it all! It definitely took some effort to finish and I’m glad you like the table of contents; it was a pain in the butt; despite that, I will never forget how to make one! :wink:

Now, let me see, why did I do some of things I did? As for ln[7], you’re absolutely right, I could have just appended ‘row’ instead of listing all rows like in ln[12]. I don’t know if this has ever happened to you before, but for some weird reason if I remove all the row names from the append I get ‘str’ operand errors down the line and other strange stuff, the program just didn’t seem happy unless I did that way. So, I said “Okay, computer, if this makes you happy, fine, let’s do it this way, ln[12] is cool with you, but ln[7] is a no go, we will move on then”.

I did rank the installs separately with all genres in Table 2 and did kind of a sensitivity test where I first called all apps with the most installs with user ratings greater than zero. I increased user rating in increments and at 4.4 I had a list of 12 apps, 4.5 gave me 8 apps, I wanted a top 10, but then I realized I could just take a slice of the top 10 so it didn’t matter what the user rating was as I sorted the list by user rating, by that I point I just said, “carry on”.

ln[37] was dealing with the Apple apps dataset which doesn’t have ‘install’ data available so I just wanted to see the min/max of the rating count total to compare with my top 10 results. I could have went with just max rating count total there, but I decided to include the highest user rating too. I hope that answers your questions.

Thanks again so much for your comments and questions, I really do appreciate the time taken to respond!

Enjoy the rest of the day Casandra!




Thanks for sharing your project. I will echo @Casandra_Hayward’s praise and say that your project is excellent. I especially like how you structured the project, the clear explanations for each step you took, and the extra effort you took for displaying the tables.

From a cursory look, the only nitpick I have is for this part:

The heading column names from both the Google and Apple dataset csv files were modified prior to importing so that the column names were consistent and in similar order. The Google dataset does have two additional column headers installs and sub_genre as shown below.

If there’s preprocessing of the raw data (the ones from Kaggle) not handled by the notebook, consider adding a link or two to the preprocessed files. If someone wants to replicate the results you get in your notebook, they’ll have to do extra processing of the raw data, instead of just downloading the raw data sets, modifying a few file paths, and simply running your notebook afterwards.

Also consider highlighting that when you’re trying to make the column headers from the two datasets consistent, it’s not just making sure ‘the column names were consistent and in similar order’, but also involves the removal of certain irrelevant columns such as lang.num from the Apple Store data set.

Again, a cursory look, so please correct me if my suggestion is incorrect.


Hi wanzulfikri, thank you very much for taking the time as well!

Regarding the preprocessed files, that’s a really good point and I never thought about it that way until you mentioned it. The preprocessing of the original datasets from Kaggle really should have been included in the report before or even part of the data cleaning process. This would also clarify to the reader why I removed certain columns and then arranged them the way I did. Thanks again for pointing that out, it definitely will be on my mind the next time I work on a project!

Enjoy your weekend wanzulfikri!




That makes sense. Yeah, this is a big project. That makes sense about appending the row and using the rating count total.

Great job.


No worries @shaun.oilund.

The truth is I made the same mistake in my current personal project. When I tried to rectify the problem by adding notes on what have changed from the raw data sets, I can’t recall exactly what I did. There are 12 datasets and I modified almost all of them. A very painful mistake but probably necessary.

Anyhow, enjoy your weekend as well.

1 Like

Yes, mistakes will be made, unfortunately some compound more than others; but, big or small they are all lessons learned!