Review for Analyzing NYC High School Data Needed

Hi everyone,

Just finished yet another guided project.

Analyzing NYC High School Data was a fun project and I completely understand that there are so much more to do with this project. But at the moment I have just done mostly what has been instructed by the guided project.

I will be revisiting this project with improvements soon. So please let me know what all things you have added in addition to the basic guidelines in your projects. When I revisit, I will take all your suggestions and make it a better project.

Looking forward to your suggestions and feedback. Thanking in advance.

Here is my last mission screen

P7_Guided_Project_Analyzing+NYC+High+School+Data.ipynb (687.0 KB) P7_Guided_Project_Analyzing+NYC+High+School+Data.ipynb (687.0 KB)

Click here to view the jupyter notebook file in a new tab

6 Likes

Hi @jithins123

Disclaimer: I tend to focus only on code, coding style and visual output when reviewing guided projects. This being said, I think your overall code is easy to read and to comprehend. Great that you make use of code comments as well. The overall structure is good.

Some concrete feedback on your project.

  • Code comments: Having a blank line between a comment and a code line is rather unusual. More to the point, try not to desribe the process you employ in your comments (this is already apparent from the code itself), but what the purpose is and maybe why you have chosen this solution (especially if others have failed).

  • In know that you followed DQ instructions here, but it is still not ideal to not take care of column names in a consistent fashion. Having names with different capitalizations, with trailing whitespaces or with a combination of whitespaces and underscores will cause severe headachs sooner rather than later. You always need to keep track of the column style mentally. Try to find a consistent approach here across all data sets. I tend to work with no capitalization, no trailing whitespaces and underscores (for multi-word column names). A simple helper function can go a long way.

def standardize_col_names(df):
"""Standardize formatting of dataframe colums"""
    df.columns = (df.columns.str.strip()
                            .str.replace(' ', '_')
                            .str.lower())

    return df
  • Cell 6, applying zfill(): This is also part of the instructions, but again not ideal. If there are vectorized Pandas methods (str.zfill() in this case), it doesn’t make sense to write a function yourself and map it to the individual cells. Built-in, vectorized methods are allways preferable (more concise and faster). You could instead do something along the lines of
data['class_size']['padded_csd'] = (data['class_size']['csd'].astype('str')
                                                             .str.zfill(2))
  • Cell 9, adding columns. For 3 columns it is arguably possible to add them via the + operator. However, using sum() leads to more concise code in my eyes. Note: Be careful here, because depending on the Pandas version sum() handles missings in the cells it sums up differently. In Pandas 1.1.1 you do need to write data['sat_results'][cols].sum(axis=1, min_count=3) for this particular scenario. See the documentation for details.

  • Cell 11, again the issue with mapping functions instead of using Pandas vectorized methods (also in cell 29). Alternative code using str.extract() for this problem.

data['hs_directory']['lat'] = (data['hs_directory']['location_1']
                               .str.extract(r'(\d{2}\.\d*)')
                               .astype('float'))

data['hs_directory']['lon'] = (data['hs_directory']['location_1']
                               .str.extract(r'(-\d{2}\.\d*)')
                               .astype('float'))
  • The use of the argument inplace=True is discouraged and there is a debate about depracating it API wide. Just reassign the output. For further details: StackOverflow

  • Plots: Good use of seaborn! But please import all libraries in the first notebook cell and not somewhere in the middle. I also would highly encourage you to format all axis and columns labels properly. Based on my experience, people, who are not directly involved with coding, tend to be annoyed by plots just using variable names. Regarding the use of color (as said somewhere else): In most scenarios I would recommend avoiding double-coding of variables. In your barplots the x/y-value is also used for color (hue). Here, you are in essence adding an additional dimension (level of complexity) without increasing the information presented. Yes, it does look nice on the first glance, but it is not considered a good practice in data visualization. Better to avoid it.

I hope this list gives you some inspiration for revision. Let me know if you have further questions.

Best regards
htw

3 Likes

Hey Hannes, This is golden! Thank you so much for this very detailed review!

I have always wanted to know if I was overdoing my code comments or not. Well, now I know that I might be overdoing it. In my initial guided projects I wanted to know if I was doing it right. No one really said anything about those comments and I thought it might be okay. Now that you’ve said it, it makes complete sense.
But I also want those comments to be super helpful for beginners like me. That is why I use comments and often repeat myself through the code comments.

I completely agree with your column naming conventions. It is good to know more about good practices. I will keep it in mind for my future projects.

The vectorized zfill() solution looks so elegant. I will update this.

Also, thanks a lot for giving me more knowledge about inplace=True

Regarding importing the library, I think I read somewhere that it is probably a better idea to import libraries whenever it is actually required. Maybe they were talking about very lengthy programs.

And yes, I understand that I could have done a plenty of plot formating. I will keep that also in mind.

Thanks for all the input and feedback. Thanks a lot for your time and suggestions. No wonder why you are a community champion. Thank you again and looking forward to exchanging more ideas in the coming days.

@jithins123

About code comments: I guess you could say that there are different ‘styles’ here also depending on the overarching purpose. Say for example that you are writing a tutorial than I would probably comment my code in a more extensive fashion (in comparison to writing an analysis). There are also more generally some very competent people in the data science landscape claiming that if one follows good naming practices - names reflect the intent of functions or make clear what a variable stores - and writes readable code than comments are almost not needed at all. While I personally think that this is going too far, I still think that it is useful at some point to think about what to comment and what to leave as plain code (because it is already self-explanatory). And to reevaluate your older scripts/notebooks.

Best
htw

Hi @htw
I think I also support providing code comments because there are many small and trivial details I have learned from other’s code through code comments. I don’t think we can really assume what others know and don’t.

I’d also like to know if more lines of code (including code comments) affect the efficiency or run time of a program? Because I know from working on my blog, how they all advice you to minify css and javascript for faster website load time. Something similar applies to a python program as well?

1 Like