Boolean Indexing with NumPy Final Challenge means

Screen Link: Learn data science with Python and R projects

My Code:

from statistics import mean 

trip_mph = taxi[:,7] / (taxi[:,8] / 3600)
cleaned_taxi = taxi[ trip_mph < 100 , : ]

mean_distance = cleaned_taxi[:,7].mean()
mean_length = mean(cleaned_taxi[:,8])
mean_total_amount = mean(cleaned_taxi[:,13])

print("statistics.mean(mean_distance) = ", mean(cleaned_taxi[:,7]) )
print("array.mean(mean_distance) = ", cleaned_taxi[:,7].mean() )

What I expected to happen:
Calculated the mean of the array using statistics.mean() and also numpy.ndarry.mean()
The two values calculated for the mean of the array should be the same.

What actually happened:
The mean value is different by a very, very small amount

statistics.mean(mean_distance) =  12.902684630738523
array.mean(mean_distance) =  12.902684630738525

I assume this related to how the underlying methods calculate the mean. Can anyone explain this in more detail? Which mean() is more accurate?

Both are accurate enough and for almost all (if not all) intents and purposes, they are the same. The minor difference could just be what kind of operations are carried out in the underlying source code and the potential precision issues that could come up because of floating point arithmetic.

So, as per me, not much to worry about accuracy here. However, NumPy is usually faster than the statistics library, iirc. So, that is a reasonable metric to consider if you want to select any one of them.

Couple of discussions that could be useful if you want to jump into more details -