# Use of the mean value instead of incorrect values in an array

In the question 3 of this mission, we are requested to use the mean value of a column to replace the incorrect data.
The provided solution uses the mean that is calculated using the incorrect data.
Wouldn’t it be better to calculate a mean without using the incorrect data ?
I realise the difference between the two means is very small here but I was wondering what is the best approach in the real world. Is it worth the trouble of adding some extra code to compute a mean without the incorrect data ? Or the mean provided in the answer is good enough ?

Question 3:
The values at column index `7` (trip_distance) of rows index `1800` and `1801` are incorrect. Use assignment to change these values in the `taxi_modified` ndarray to the mean value for that column.

``````taxi_modified[1800:1802,7] = taxi_modified[:,7].mean()
``````

In addition I wrote the below code to compute the mean without the two incorrect data. However, I would like to know what would be a more elegant way of doing it.

``````mean = 0
for item in taxi_modified[:1800,7]:
mean += item
for item in taxi_modified[1802:,7]:
mean += item

mean = mean / (taxi_modified.shape - 2)

print(mean)
``````

1 Like

I don’t know the best approach, yet, since I’m still learning but as for your additional question, instead of for loops, you can do the following:

´´´
mean_7 = (sum(taxi_modified[:, 7]) - sum(taxi_modified[1800:1802, 7])) / (taxi_modified[:, 7].shape - 2)
taxi_modified[1800:1802, 7] = mean_7
´´´

1 Like

I think incorrect values should be ignore. and for second question you can use `numpy.r_` to concate two slicing objects. (first `0` to `1799` and second `1802` to `89559`) like

``````>>> taxi_modified[np.r_[:1800,1802:len(taxi_modified)], 7].mean()
12.667502623997859
``````
1 Like

The good thing with pandas is that when aggregating data, it ignores the missing data. look at this example

``````In : a.mean()
Out: 2.5

In : a = pd.Series([1, 2, 3, 4, np.nan, np.nan])

In : a
Out:
0    1.0
1    2.0
2    3.0
3    4.0
4    NaN
5    NaN
dtype: float64

In : a.mean()
Out: 2.5
``````

when calculating the mean it ignored the missing data, This techniques can also be used to impute missing data in a Series or DataFrame.

1 Like

@info.victoromondi The difference in this particular case is that we are not replacing missing data. We are replacing incorrect data. Those two rows have some value for column `7`.

So, calculating the mean including the incorrect values doesn’t seem like the best approach here. That’s what the question is about.

@PythonSattva I would say, this would depend on the incorrect values and what makes sense given the rest of the data.

If you change the values for those two yourself, and set them to be really large, you will notice the mean to change drastically. Which would be incorrect.

I changed those two values to be times a 100, so `988` and `860`, and the mean changed to `12.688`. Which isn’t much of a change from the previous mean calculation.

Times 1000. Not that much difference either. ~`12.87`.

At 10000 times, it changed to ~`14.7`. That starts to differ a bit.

If the difference is too small when calculating the mean and that doesn’t impact the quality of the data when trying to solve whatever task, then it isn’t much of a problem as per me.

In this particular scenario, the values are `trip_distance`.

As a pure hypothetical (with not much thought put into it), if we were to use this mean to try and figure out, as a business, what price to charge per Km or mile, then difference of 2 Km or miles could have an impact. A difference of 0.20 Km or miles, may or may not for the passenger or the taxi driver.

The above is not a concrete answer. Just my thought process on what might be an “acceptable” margin of error given the use-case and data. It does seem that it’s better to avoid including the incorrect values no matter what.

1 Like

Thanks for introducing me to a new method.
I was expecting a slightly different result:
12.667502623997947

I’ve played around with numpy.r_, however, I did not manage to find the above result.

The below question might deserve another topic about numpy.r_

Why is the result significantly different if I remove len(taxi_modified) ?

``````taxi_modified[np.r_[:1800,1802:], 7].mean()
``````

In addition, I thought the reported end bound index is not included in the slicing and to have it included, I need to add one. However, I end up with an error when running the below code:

``````end_bound = len(taxi_modified) + 1
taxi_modified[np.r_[:1800,1802:end_bound], 7].mean()

``````

Error:

``````IndexError: index 89560 is out of bounds for axis 0 with size 89560

``````

I did the below experiment:

• create an array [ 1 2 3 4 5 6 7 8 9 10]
• slice the array by removing the number 5. Expected result: [ 1 2 3 4 6 7 8 9 10]
``````tab = np.array([1,2,3,4,5,6,7,8,9,10])

print(tab)

end = int(len(tab)+1)
tab = np.r_[:5,6:end]

print(tab)

tab = np.r_[1:5,6:end]

print(tab)
``````

which gives me the below result:

``````[ 1  2  3  4  5  6  7  8  9 10]
[ 0  1  2  3  4  6  7  8  9 10]
[ 1  2  3  4  6  7  8  9 10]
``````

I don’t understand why there is a 0 in the second array that I print.
In order to get the result that I expected, I had slice by adding a 1 : tab = np.r_[1:5,6:end]

1 Like

Thanks for your reply, I haven’t learned Panda yet. Might get your point a bit later in my studies. Cheers !

Thanks for the contribution, I understand that there is no absolute response and it is more about a business analysis consideration. Appreciate the share of your reasoning.

That is because of precision difference
See the different results between python `sum` and numpy `np.sum`

``````>>> p_sum = sum(taxi_modified[np.r_[0:1800,1802:len(taxi_modified)], 7])
>>> n_sum = np.sum(taxi_modified[np.r_[0:1800,1802:len(taxi_modified)], 7])
>>> p_sum, n_sum
(1134476.200000008, 1134476.2000000002)
``````

if you use python `sum` you will get your expected output

``````>>> p_sum / (len(taxi_modified)-2)
12.667502623997947
``````

while using numpy `np.sum`

``````>>> n_sum / (len(taxi_modified)-2)
12.667502623997859
``````

gives simiar to `taxi_modified[np.r_[:1800,1802:len(taxi_modified)], 7].mean()` so i guess `mean` internally using `np.sum` but i am not sure.

You don’t have to include extra `1` as index start with `0` so it already covering upto latst element of array.

Here again you don’t have to add extra `1` in `end`. and second here using `np.r_` we are generating indices for our array so after generating indices you have to slice it with `array` and more you want to ignore 5th value which means 4th index so while generating indices we should ignore `4`
Hence it should be

``````end = len(tab)                 # removed extra `1`
tab = tab[np.r_[:4,6:end]]     # Slicing with array `tab` and ignore 4th index instead of 5th
``````

See the doc that says

So `np.r_[:1800, 1802:]` similar to concatenation of `np.arange(start=None, stop=1800)` and `np.arange(start=1802, stop=None)` but i dont know why it returning result when `stop=None` in second slicing.