Could use some help understanding formatting this project

LINK TO MY CODE

Screen Link:
I-94 for beautifying plots.ipynb (3.6 MB)

Background:

I have a project where the analysis is essentially finished. It’s the project to analyze the I-94 data between St Paul and Minneapolis Minnesota in the westbound direction. I happen to live nearby, so got pretty into this project.

I finished it once, went back and cleaned the data a couple more times after discovering new issues, went back and split the data into new segments after discovering new features in the data, performed a couple of additional analyses, discovered a couple of additional patterns, trends, and correlations. Okay it seems done to me.

I noticed that there was a missing data between 2014-2015, so I split the data into 2012-2014 and 2015-2018. In the second segment of data (which I call the second era), months December, January, and July have a lower average traffic volume than the rest. In the first era, only December and January have a lower traffic volume. So what made July change so dramatically between the two eras?

From my conclusion, the effects I found:
ASSESSMENT OF THE EFFECT IN JULY, JANUARY, AND DECEMBER

While there was a road closure July 22-24 2016 and seems to have been a similar closure July 25 2015, excluding these closures does not resolve the decrease in July traffic volume in 2015-2018.

The squall in the data is a tempting explanation, but is not a good one, since it takes place in May 2013 rather than July between 2015-2018.

Smoke was present on July 6, 2015; May 7, 2016; and August 18, 2018. The traffic volume was most remarkably low in May 2016. It would be challenging to claim that this accounted for the whole July effect in 2015-2018.

The low traffic on Independence Day itself cannot explain the drop in July traffic from the first era to the second because there is a higher average traffic volume in the second era on Independece Day.

July was cloudier in the second era, and clouds correlate negatively with traffic volume.

I believe that so far the effect of lower traffic volume from 2015-2018 in July correlates most strongly to Friday traffic volumes in July. It may also be due to students, especially University of Minnesota students, not driving in the summer.

The question

However, I’m having some trouble writing it up neatly. There’s a lot here. I’m not sure how to emphasize the plots I want, or the relevant parts of the analysis. I’m not sure what parts should take the focus, or how to even make something take focus in a jupyter notebook. I’m not sure how to make my many lines of code less intrusive in the presentation.

Does anyone have any suggestions for how to make this a bit neater and more relevant? Also which plots I should focus on? I think probably the paragraph I excerpted above is the most relevant conclusion, but if some other part strikes you as more interesting, I’d definitely love to know!

Thanks so much for any help you can give me. The science and programming are quite achievable, but turning this into something presentable is quite a challenge for me!

Click here to view the jupyter notebook file in a new tab

2 Likes

Hi @sdorsher,

Yeah, this is a tough one since there’s a lot you can potentially write and visualise with plots but giving equal weight to every explainable thing can dilute the impact of the most important findings.

The first thing to do is probably to go back to the introduction and review the questions you seek to answer. Even if you didn’t write the questions to answer in the introduction, you probably had some questions in your mind you’re interested in answering. Other than that, since you’re very familiar with the highway, your questions are probably more elaborate than what’s intended by Dataquest so it’s worth it to give your own questions an extra think. Maybe you’re interested in something different that what Dataquest intended, and since it’s your project, you can skip the guide and focus on whatever questions you feel are more pertinent. In other words, always go back to your own research questions.

Second, it might be worthwhile to prime the readers on what findings to focus on by adding an abstract or summary of results of sorts in the introduction. Focus first on the results for what you initially sought to answer and then, accordingly sprinkle a curated set of incidental results which was not part of your original questions but still worthwhile and interesting. Try to limit the number of findings in the summary; that will force you to only pick the most important ones.

Third, maybe have separate styles and quality for plots of varying importance. Have a standard style for most plots but go beyond standard for the important ones. That style sense is something you’ll have to develop on your own and one good method is to look at some plots by others that pique your interest and try to reproduce them in your project. Off the top of my head, in this forum we have @shaun.oilund and @anna.strahl who are both exceptional at making visually appealing and informative plots and both are also great at explaining their findings in a structured manner; have a look at their projects.

Forth, one thing I’ve seen some people do is instead of lumping all explanation of the results at the end of the project or the conclusion (like the ones you find in the discussion section in many scientific papers), they separate the analysis into sections and have a mini-conclusion for each section. Since the mini-conclusion has provided all the necessary and detailed explanation for each section, the main conclusion will be shorter and focuses more on being a brief recap of the whole project. You have something similar with the “Summary so far” section but it covers multiple parts of the analysis so it can get quite long especially if it is written infrequently thus it accumulates multiple findings to summarise.

Fifth, for unwieldy code, encapsulating repeating codes as functions can help a lot especially for similar looking plots with similarly written codes. Another option is to extract those codes and put them into a Python module. You can then just import the module and call the functions without exposing the intrusive codes in Jupyter.

Hopefully that helps.

3 Likes

Hi @sdorsher,

First of all, I want to commend you on all the work and effort you put into your project. This is a big project with many elements and you have done well putting it all together! I find one of the biggest challenges is trying collect all your thoughts and present them to the reader in a clear and concise manner.

Layout wise, I would suggest using numbered sections and sub-sections to clearly break apart the analysis and conclusions; this will help carry the reader along. As @wanzulfikri recommended, I would consider writing out separate section and sub-section summaries. This can help organize your thoughts and then you can go back to those section summaries and pick out the main points you want to emphasize in the grand finale summary.

I do like including a lot of code comments in my projects, at times too many of them where they become redundant. For you project, I would suggest writing out some of those longer code comments into a short markdown paragraph to explain you intentions there and then only include a short code comment statement if additional clarification is necessary.

I am also not a stranger to using ‘bold’; however, I would avoid ‘bolding’ entire sentences and maybe just use bold to emphasize a couple words in that sentence.

As part of your analysis (line [52]) I noted that your cut-off for rain fall per hour was 300mm. I would revisit that cut-off, the only reason that caught my eye was because I lived in some hurricane prone areas and the max I ever saw was maybe 200mm for a 24hr-48hr period.

I know you said you are going to go back to format your plots so just some things to keep in mind. You can enter grid size parameters to control the height and width of the chart: plt.figure(figsize = (width, height)). Include units for your x-y labels and intuitive plot titles. Consider softer color combinations that are easier on the readers eyes. If you have an x-axis with months of the year for example, make sure the axis includes all months so it is easy to identify values on the plot. There are a lot of good resources out there, I just like to play around with formats, colors, shapes, sizes, and plot styles to see what I can create.

I hope this helps and if you have any other questions please don’t hesitate to ask!

3 Likes

Hi!

I really like how thorough you are with investigating potential data quality issues. In particular I like how you identified which years/months/days/hours were missing and then used this information to help frame your analysis.

Here are my thoughts as I read through your notebook:

  • For Lines 7-15 (converting variable types) I would lump all of those into one code block since you are essentially doing the same “type” of cleaning multiple times. It makes it easier for a viewer to follow your workflow if each code block contains clustered (but similar) code.
  • Since you end up dropping the second and minute column, would it be simpler to just not create them in the first place?
  • When you drop rows with less than 6 hours of data per day I would recommend copying the original dataframe (.copy()) and creating a new one with a different name just in case you want to compare the cleaned data with the original data. For your analysis of July in particular I wonder if any of the rows you dropped due to low hour representation could have changed your conclusion.
  • When you display “num_hours_per_year” this would be a good place to include an exploratory visual with a bar chart. It could help you and your audience note any years with more/less missing data.
  • Really cool pivot table for num_days_per_month_and_year! My initial reaction is that it would be cool to view as a seaborn heatmap (but I always look for excuses to use heatmaps since I think they look so neat).
  • In line[40-42] you use a lot of comments in your code cells. This might be more readable as markdown.
  • " Drop all rows with temperatures below -30 F which is a reasonable lower bound for MN" didn’t you do this already with -40F? The thought process for filtering on temp (F) twice isn’t clear. I personally would have picked a single cutoff and stayed with it.
  • For your holidays dataframe you would have fewer repeat columns to remove if you made the initial holiday dataframe just the date and holiday columns (since none of the other columns affect the holiday status)
  • In Line[81] where you have a histogram of traffic volume for 2012-2014 why did you not repeat the histogram for second era? It would be nice to be able to compare them side-by-side
  • I love the 2x2 graph grid in Line[93]!
  • I like your investigation into why July has a drop in volume but the side-by-side comparisons of 2012-14 and 2015-18 looked very similar to me. Why do the graphs look very similar but there is one that has a drastic peak and the other doesnt? I’d like to see more explanation/analysis as to what’s going on here.
  • For day_traffic_pivot I’d recommend using mean instead of sum (or using both to compare). Sum can be affected by missing datapoints more easily than mean.
  • The holiday traffic volume graph in line[199] is great!!!

Overall I am inspired by your deep dive into answering interesting questions about the dataset’s anomalies and really enjoyed seeing your perspective on some elements I didn’t consider in my own analysis. In general I would recommend adding more exploratory graphs to help visualize relationships in each section instead of relying on text and tables since graphs are much easier to look at from an end-user perspective in a long analysis. I would also recommend making more use of markdown cell formatting between each major analytical theme so that your audience can follow the story along the way instead of getting a large block of text at the end.

Thanks for sharing!

3 Likes

Thank you all so very much! These are excellent suggestions! Because of Thanksgiving I haven’t had a chance to go back to this yet but I will try to do so sometime soon. I feel like this part might take a little longer between other commitments right now and the fact that it is really ironing out the details. But I definitely will return to it! Thank you so much for the Community Champion recognition, it’s wonderful to know I’m making progress and on the right track! I also really appreciate your feedback!

3 Likes