Need Feedback: Analyzing NYC High School data

Hello everyone,

I hope you are doing great. I have finished working on this project and now I need feedback to rectify my weakness/mistakes.

I enjoyed this project the most and learned a ton, and also I had to go through various concepts. This project gave me the feel of how I should approach a project and break down the tasks.

However, I still have a few questions that I need to ask:

1- Should we standardize all columns by using lower()/upper() and using _? We did not follow the Python column naming convention. Was there any particular reason?
2- When making DBN unique in each row, how do we determine which column to deal with? For example, we dealt with the GRADE, PROGRAM TYPE, and CORE SUBJECT (MS CORE and 9-12 ONLY) columns in the class_size dataset.
3- How can we assign scores to schools based on sat_score and other attributes? I am struggling to understand this one.

Overall it was a great opportunity to learn from this project.

Thank you!

analyzing-nyc-high-school.ipynb (908.8 KB)

Click here to view the jupyter notebook file in a new tab


Hi @m.awon

What an amazing project! You have emphasise every stage of your workings. Be it the introduction- information background of the dataset(including research on topic),data sources, project overview have been well explored. The use of comments and docstring are so evident in your workings. Observations made in various outputs are very informing

In regards to Data cleaning, I couldn’t hide the enjoyment. How you have managed to work with DBN is just interesting. You have indicated and outlined the process on how to have unique DBN in various datasets and also how to create these DBN for some datasets with the linked information. Same to when it comes to combining these datasets, the walk through is super perfect including creating of the dummy data frames . I think any reader will love it reading through the whole process.

I also love how you have managed to handle missing data after combining these datasets. Going ahead to create the dummy data has made the whole process friendly. The analysis section , the visualizations together with the embedded explanations are also very informing , keep it up mate for the good work.

Based on the questions you have raised, I will try to answer them to the best of my knowledge;

Your argument is true, and it’s always recommended to use the python column naming convention which is snake_case (lower case with underscores). But the whole dataset have like over 2000 columns which is making it impractical to do the renaming. Though this was possible after doing the cleaning, remember DQ doesn’t give any limitations when working on any guided project, so maybe you could have just go ahead to standardize some of these columns.

I think the idea was to work with only high schools, and if you carefully study through the class_size dataset it’s only GRADE and PROGRAME TYPE columns that contain information which can help identify high schools . So ideally, the best way to understand or rather to determine the column to be used anytime you want to uniquely group your dataset is to thoroughly study and do research on the columns of that dataset.

If I study closely through your sat_results dataframe , I find the information already available like every school has SAT Math Avg. Score, SAT Critical Reading Avg. Score and SAT Writing Avg. Score which you can then sum up the way you did to get the sat_scores. Make me understand the question if you are not contented , and by the way , what other attributes are you talking about?

Having said that, have got the following suggestions (point outs ) to make

  • Outputs in cell [4] are not that readable, you ought to have used the display() function instead and then add some spacing as well , for example ,the code cell below will render a more readable well spaced output, you can try it out.
# Display the first five rows of each dataframe in the 'data' dictionary
for k in data:
    print(f'\033[31m{k} dataset')
  • Check the observation you made below cell[10] the third paragraph , I think there is a word or two you are missing between the word numbers and missing. Same to cell [16] you don’t have a dataset named clean_size I think you meant class_size.
  • Always consider having supporting code lines when giving observations that are not direct to the reader. Like in one of the observations, you insinuated that DBN values are unique in the entire sat_results dataframe, I don’t still understand how you arrived onto this, maybe you ought to have applied value_counts() function to affirm this .
  • Always consider numbering your subheadings, this is normally advisable when you know you are planning to have list of subheadings below this main subheading . Like in your case, a reader will find it very difficult to understand how you moved from 6 to 1 if by any chance he or she miss to see the new subheading - combining the data.
  • Your explanations in correlation is perfect, only that the visual distinction between the positive and negative correlation is not that clear. Basically, when our scatter plot tends to move in one direction that is when the two variables increases or decreases together then this will imply positive correlation and when one increases and the other decreases or the vice versa then we say negative correlation - this was missing in your explanations.

Further Exploration

What do you think if maybe we choose to work with a different cohort, let’s say the ones for the year2005, do you think this will affect the findings? You can consider expounding on this as you progress with your learning . I think it will gives more understanding of the analysis.

Otherwise congratulations mate for the good work and all the best in your upcoming projects.


Thank you for the words of encouragement and for taking the time to see my work. Your feedback is very helpful, especially the way to point out my rookie mistakes and explain how I should make them better. I’ll remember them.



This is great @m.awon ,
Happy codding::clinking_glasses: