Best coding practices for data science

I just read https://towardsdatascience.com/top-10-coding-mistakes-made-by-data-scientists-bb5bc82faaee and wanted to check if it’s really industry practice to extensively use DAGs according to his point 5? Also in point 7, he was encouraging using assert for data validation while https://dbader.org/blog/python-assert-tutorial says Don’t Use Asserts for Data Validation because asserts can be disabled. So which is the better view? Any other comments on the 10 coding mistakes?

Oh good article! I personally love reading articles from people in the industry because it gives a unique perspective on what shapes their opinions. I will add that these are just opinions and not always the path everyone should follow. Let me explain.
Every company is different. I say every company because the practices changes from company to company. It is always best to have a wide view but also know that you will have to be flexible and open to doing things different in each company you may join.
More to this article. Here is how it is done at my company and my view point.
Point 1: I would say this is true for open source projects as a whole. Internal project within a company may not share this view point
Point 2: This is an industry standard and even a basic software standard.
Point 3: I have mixed feeling about this point. Not because I disagree with it (I absolutely agree with it) but I feel that if a given company does not have a style guide this can happen because if no one knows better you can’t always expect a code review to fix this. Style guides help avoid this.
Point 4: I would say most every company I have been with follows this as a standard
Point 5: This one is widely different at most companies. We personally use airflow but every company does it different. I say it is best to know the basic of a few things (including writing functions) because it will help you acclimate faster to a companies given standard
Point 6: Mixed feeling about this honestly. I feel as datasets grow in complexity this point is correct but you must learn to crawl before you walk and walk before you run. Knowing the basic is critical and for loops help with that bit as data grows you must switch gears easily and be ready to use Numpy and all the other tools. This is where student collaboration comes in. Teamup with others to work on datasets that are larger so you get accustom to these tools.
Point 7: In most companies unit test are standard. In all companies I have worked at, unit test are required. Get use to that and I would say that is an industry standard.
Point 8: Industry standard
Point 9: This changes from company to company. I have seen it both ways at different companies over the last 3 to 5 years. I would say get familiar with a few different way
Point 10: This is not an industry standard. I have seen large fortune 100 companies use Apache Zeppelin. https://www.analyticsindiamag.com/jupyter-vs-zeppelin-a-comprehensive-comparison-of-notebooks/

I will close this out by saying this is just my take and I am by no means the end all view. Everyone has different perspectives. It is best to be as well rounded as you can and familiarize yourself.
I hope I give you some view points to help and also addressed your post as best as possible.
Have a great weekend :slight_smile:

I suggest the reader/original poster to learn the best practices of software engineering - it covers most of points listed in the article. Jupyter notebooks are good for prototyping, but it is not good at scaling to a production system.

1 Like