Getting NaNs using a left join, okay I'm skipping ahead using prior knowledge, but I'd love to know why

Screen Link:

My Code:

day_ymd=day.set_index(['year','month','day'])
holidays_midnight_ymd=holidays_midnight.set_index(['year','month','day'])
holiday_join=day_ymd.join(holidays_midnight_ymd, how = 'left', lsuffix='_day',rsuffix='_night')

What I expected to happen:

I expected there to be a list of holidays similar to this list

night['holiday'].unique()
array(['None', 'Columbus Day', 'Veterans Day', 'Thanksgiving Day',
       'Christmas Day', 'New Years Day', 'Washingtons Birthday',
       'Memorial Day', 'Independence Day', 'State Fair', 'Labor Day',
       'Martin Luther King Jr Day'], dtype=object)

in the holiday_night column afterward, rather than entirely NaN’s, and I expected there to be about 24*(5 to 7) values for each holiday since previously each holiday had around 5 to 7 counts, all at night. I also expected there to be some NaNs in the column, since the holidays table only contained days on which there was a holiday, and the days table also contained days on which there was not a holiday. I was planning to make a further cut after that point for days on which there was a holiday, and do some analysis about the weather, traffic, time of day, and other factors. Clearly holidays do not occur only at night.

What actually happened:

holiday_join['holiday_day'].unique()
array(['None'], dtype=object)
holiday_join['holiday_night'].unique()
array([nan], dtype=object)

I’m working on the I-94 project, with some additional exploration after completing the certificate, to make it ready for community review. It will be my first project I submit for community review. I’m trying to analyze the day/night holiday patterns. Only the holidays are only listed at midnight in this data set. I could probably split it by day and night using a for loop, or just consider the combined data set, and I may do that if I can’t get this to work. Either would be an okay approach.

Right now I have a dataframe with all daytime data with an index given by year, month, and day (numbers), which is called day_ymd. I also have a dataframe with all holiday data, where the holiday data is only given at midnight, in a dataframe called holidays_midnight_ymd. Again, the index is a three layer index with year, month, day in that order. The day_ymd data frame has many entries for every year,month,day set because there are many hours in a day. The holidays_midnight_ymd dataframe has unique indices.

I am doing a left join from day_ymd on holidays_midnight_ymd because I want there to be an entry for every hour in the day corresponding to a holiday, if one exists. I set the left suffix to “_day” and the right suffix to “_night”, because the left data is drawn from daytime data and the right data is drawn from midnight data.

I expected the columns with the suffix of “_night” to have non nan values. In particular, if the index in the holiday_join table is 12,11,22 for Thanskgiving, I expected all 24 entries associated with the 24 hours of that day to have data that could be matched in the holidays_midnight_ymd table and that would appear in the holiday_night column, because the key 12,11,22 matches the key for Thanksgiving for all 24 hours of the day in the day_ymd table and it’s a left join from that table. Instead I got NaN’s for all columns in the right table. I’m not sure why. That definitely shouldn’t have happened.

Can anyone explain what I did wrong, whether it was programming or conceptual?

Thanks.

A twitter friend named Sam Laki gave me some help (thanks so much to him!). I had a column in my daytime table called ‘date_0h’ that was a datetime object with the hour removed. This was intended to match the midnight time of the holidays table. At Sam’s suggestion, I added a similar column to the holidays table, then merged on this column. The critical step seemed to be using “left_on” and “right_on” keys words in my merge. I had been under the impression that if I merged with a matching index column, it would automatically use that column and merge based on indices, but that is not true. The left_on and right_on keywords are necessary.

So the solution was

holidays_datetime_merge=day_datetimekey.merge(holidays_tidy,how='left',left_on='date_0hr',right_on='date_0hr')
holidays_datetime_merge['holiday_y'].value_counts()
Martin Luther King Jr Day    122
Labor Day                    115
Thanksgiving Day              92
Christmas Day                 89
New Years Day                 80
Washingtons Birthday          79
Memorial Day                  71
Independence Day              71
State Fair                    65
Columbus Day                  63
Veterans Day                  63
Name: holiday_y, dtype: int64

This was the output I was expecting. Hurray!

1 Like