Boolean indexing with Numpy, final challenge

Screen Link: https://app.dataquest.io/m/290/boolean-indexing-with-numpy/10/challenge-calculating-statistics-for-trips-on-clean-data

In this final challenge, and throughout the exercise to be honest, I have been confused by the first line of code you see below ( trip_mph = taxi[:,7] / (taxi[:,8] / 3600) ). From my understanding, and when I print it, I see that trip_mph is a 1 dimensional array, and that it is not connected to the “taxi” array. So how does using a boolean mask on it enable you to calculate date on columns that are in the “taxi” array? Wouldn’t you essentially just be masking the one dimensional “trip_mph” array?

Please let me know if have any thoughts / wisdom.

Thanks!

Answer Code:

trip_mph = taxi[:,7] / (taxi[:,8] / 3600)
cleaned_taxi = taxi[trip_mph < 100]

mean_distance = cleaned_taxi[:,7].mean()
mean_length = cleaned_taxi[:,8].mean()
mean_total_amount = cleaned_taxi[:,13].mean()
1 Like

hey @alvand.hajizadeh

Hope this helps you. if not then, most likely it will confuse you! apologies at the onset :grimacing:

say a very simple example: x, y = 5, 4 I am declaring two variables and assigning them values.
case1: print(x + y) results 9
case2: print(x * y) results 20
This happens because when the print command is given the variables get substituted by the values.

Similarly, this code taxi[trip_mph < 100] takes the form of this code when executed taxi[(taxi[:,7] / (taxi[:,8] / 3600)) < 100].

so it’s not an interaction between the two arrays, instead the taxi array gets treated by it’s own columns, boolean mask is applied and values get filtered.

the same mission screen 5 may help you on this or maybe this blog

6 Likes

This is in fact very helpful! Thank you for the explanation, really appreciate it!

1 Like

Does this imply that the expression used to define trip_mph is reapplied to taxi to produce cleaned_taxi? I.e., for each row of taxi, the division is reapplied so the rows get filtered? Why do I need trip_mph then?

1 Like

I am finding it hard to absorb the effect of numpy slicing and the consecutive boolean indexing on a particular variable.
For example in the final challenge when numpy slicing is applied on taxi to get trip_mph(which is only supposed to store speeds), how come another variable cleaned_taxi which is subset of taxi_mph is again used to fetch distance, length columns and methods are applied on them?
Why does cleaned_taxi have those columns in the first place?
someone explain plz.

I also stuck in this and extremely confused. Can some one make me understand this in simple (layman ) terms?

Hey @tyche2k and @aswadr093

I am also a student trying to learn. I might be wrong, but here is what I understood:

  1. Because the goal is to remove bad data, the variable trip_mph is basically a condition used to return True or False. In other words, you need something (like a number or a condition) to enable you to decide whether XYZ is bad data. A filter criterion, if you will. (Bad data in this case == above 100 mph)
trip_mph = taxi[:,7] / (taxi[:,8] / 3600)
  1. The variable cleaned_taxi stores a ndarray that has only those rows meeting the condition above (returning True = good data). So, for rows which the condition taxi[:,7] / (taxi[:,8] / 3600) is greater than 100 it returns False (bad data). Therefore, those rows are not stored in cleaned_taxi
cleaned_taxi = taxi[trip_mph < 100]

I think what is was confusing, at least for me, is that despite the conditions and new variables, you are NOT creating a new column…as one would expect to do so if working in MS Excel spreadsheet, for example.

Hope it helps and that a Community Moderator could double-check what I have tried to explain. :grimacing:

3 Likes

great explanation. It makes perfect sense. kudos

@aswadr093 @tyche2k The cleaned_taxi array is identical in structure to the taxi array. It contains all columns. However, it contains only the data(rows) that satisfy the condition (taxi[:,7] / (taxi[:,8] / 3600)) < 100.

thanks for explaining this- i was confused too!

can you explain what is meant by this?
image

I don’t understand what is meant by axis here and how the limitations of length effects the output in ndarray.