Use of the mean value instead of incorrect values in an array

Screen Link:
https://app.dataquest.io/m/290/boolean-indexing-with-numpy/6/assigning-values-in-ndarrays

In the question 3 of this mission, we are requested to use the mean value of a column to replace the incorrect data.
The provided solution uses the mean that is calculated using the incorrect data.
Wouldn’t it be better to calculate a mean without using the incorrect data ?
I realise the difference between the two means is very small here but I was wondering what is the best approach in the real world. Is it worth the trouble of adding some extra code to compute a mean without the incorrect data ? Or the mean provided in the answer is good enough ?

Question 3:
The values at column index 7 (trip_distance) of rows index 1800 and 1801 are incorrect. Use assignment to change these values in the taxi_modified ndarray to the mean value for that column.

Provided answer:

taxi_modified[1800:1802,7] = taxi_modified[:,7].mean()

In addition I wrote the below code to compute the mean without the two incorrect data. However, I would like to know what would be a more elegant way of doing it.

mean = 0
for item in taxi_modified[:1800,7]:
    mean += item
for item in taxi_modified[1802:,7]:
    mean += item

mean = mean / (taxi_modified.shape[0] - 2)

print(mean)

Thank you for reading !

1 Like

I don’t know the best approach, yet, since I’m still learning but as for your additional question, instead of for loops, you can do the following:

´´´
mean_7 = (sum(taxi_modified[:, 7]) - sum(taxi_modified[1800:1802, 7])) / (taxi_modified[:, 7].shape[0] - 2)
taxi_modified[1800:1802, 7] = mean_7
´´´

1 Like

I think incorrect values should be ignore. and for second question you can use numpy.r_ to concate two slicing objects. (first 0 to 1799 and second 1802 to 89559) like

>>> taxi_modified[np.r_[:1800,1802:len(taxi_modified)], 7].mean()
12.667502623997859
1 Like

The good thing with pandas is that when aggregating data, it ignores the missing data. look at this example

In [10]: a.mean()
Out[10]: 2.5

In [11]: a = pd.Series([1, 2, 3, 4, np.nan, np.nan])

In [12]: a
Out[12]:
0    1.0
1    2.0
2    3.0
3    4.0
4    NaN
5    NaN
dtype: float64

In [13]: a.mean()
Out[13]: 2.5

when calculating the mean it ignored the missing data, This techniques can also be used to impute missing data in a Series or DataFrame.

1 Like

@info.victoromondi The difference in this particular case is that we are not replacing missing data. We are replacing incorrect data. Those two rows have some value for column 7.

So, calculating the mean including the incorrect values doesn’t seem like the best approach here. That’s what the question is about.

@PythonSattva I would say, this would depend on the incorrect values and what makes sense given the rest of the data.

If you change the values for those two yourself, and set them to be really large, you will notice the mean to change drastically. Which would be incorrect.

I changed those two values to be times a 100, so 988 and 860, and the mean changed to 12.688. Which isn’t much of a change from the previous mean calculation.

Times 1000. Not that much difference either. ~12.87.

At 10000 times, it changed to ~14.7. That starts to differ a bit.

If the difference is too small when calculating the mean and that doesn’t impact the quality of the data when trying to solve whatever task, then it isn’t much of a problem as per me.

In this particular scenario, the values are trip_distance.

As a pure hypothetical (with not much thought put into it), if we were to use this mean to try and figure out, as a business, what price to charge per Km or mile, then difference of 2 Km or miles could have an impact. A difference of 0.20 Km or miles, may or may not for the passenger or the taxi driver.

The above is not a concrete answer. Just my thought process on what might be an “acceptable” margin of error given the use-case and data. It does seem that it’s better to avoid including the incorrect values no matter what.

1 Like

Thanks for introducing me to a new method.
I was expecting a slightly different result:
12.667502623997947

I’ve played around with numpy.r_, however, I did not manage to find the above result.

The below question might deserve another topic about numpy.r_

Why is the result significantly different if I remove len(taxi_modified) ?

taxi_modified[np.r_[:1800,1802:], 7].mean()

In addition, I thought the reported end bound index is not included in the slicing and to have it included, I need to add one. However, I end up with an error when running the below code:

end_bound = len(taxi_modified) + 1
taxi_modified[np.r_[:1800,1802:end_bound], 7].mean()

Error:

IndexError: index 89560 is out of bounds for axis 0 with size 89560

I did the below experiment:

  • create an array [ 1 2 3 4 5 6 7 8 9 10]
  • slice the array by removing the number 5. Expected result: [ 1 2 3 4 6 7 8 9 10]
tab = np.array([1,2,3,4,5,6,7,8,9,10])

print(tab)

end = int(len(tab)+1)
tab = np.r_[:5,6:end]

print(tab)

tab = np.r_[1:5,6:end]

print(tab)

which gives me the below result:

[ 1  2  3  4  5  6  7  8  9 10]
[ 0  1  2  3  4  6  7  8  9 10]
[ 1  2  3  4  6  7  8  9 10]

I don’t understand why there is a 0 in the second array that I print.
In order to get the result that I expected, I had slice by adding a 1 : tab = np.r_[1:5,6:end]

1 Like

Thanks for your reply, I haven’t learned Panda yet. Might get your point a bit later in my studies. Cheers !

Thanks for the contribution, I understand that there is no absolute response and it is more about a business analysis consideration. Appreciate the share of your reasoning.

That is because of precision difference
See the different results between python sum and numpy np.sum

>>> p_sum = sum(taxi_modified[np.r_[0:1800,1802:len(taxi_modified)], 7])
>>> n_sum = np.sum(taxi_modified[np.r_[0:1800,1802:len(taxi_modified)], 7])
>>> p_sum, n_sum
(1134476.200000008, 1134476.2000000002)

if you use python sum you will get your expected output

>>> p_sum / (len(taxi_modified)-2)
12.667502623997947

while using numpy np.sum

>>> n_sum / (len(taxi_modified)-2)
12.667502623997859

gives simiar to taxi_modified[np.r_[:1800,1802:len(taxi_modified)], 7].mean() so i guess mean internally using np.sum but i am not sure.


You don’t have to include extra 1 as index start with 0 so it already covering upto latst element of array.

Here again you don’t have to add extra 1 in end. and second here using np.r_ we are generating indices for our array so after generating indices you have to slice it with array and more you want to ignore 5th value which means 4th index so while generating indices we should ignore 4
Hence it should be

end = len(tab)                 # removed extra `1`
tab = tab[np.r_[:4,6:end]]     # Slicing with array `tab` and ignore 4th index instead of 5th

See the doc that says

So np.r_[:1800, 1802:] similar to concatenation of np.arange(start=None, stop=1800) and np.arange(start=1802, stop=None) but i dont know why it returning result when stop=None in second slicing.