Boolean Indexing with NumPy slide 5

Please can someone provide an alternative explanation to this? I don’t understand the (4) or the reference to the first and second axis

You have the following:

bool_1 = [True, False, True, True]

arr = np.array([
 [1, 2, 3],
 [4, 5, 6],
 [7, 8, 9],
 [10, 11, 12]
])

So when you index the rows of the arr array like so: arr[bool_1], what’s going on is this:.

Bear with the rough portrayal. But you can see how the rows that “match” to True remain after the index, whereas those that “match” to False don’t. This is boolean indexing.

The resulting array will thus simply look like this:

([
 [1, 2, 3],
 [7, 8, 9],
 [10, 11, 12]
])

For the boolean indexing to work, the dimensions of the Boolean Index had to match the number of rows in arr. This is what was meant by “shape”.

That part I understand, however what is the (4) referring to? It says bool_1’s shape (4) is not the same as the arr’s second axis (3). The second axis I thought referred to the number of columns, however the result of bool_1 has the same number of columns but only 3 rows. Wouldn’t that mean its not the same shape as the arr’s first axis?

The length of bool_1 is 4. Since bool_1 is just a list of boolean values, its most relevant feature is naturally going to be its length.

When you do:
arr[bool_1]

You’re indexing the rows of arr, with bool_1.

If you changed the length of the bool_1 list, for instance by adding or removing another True value, you’ll see it won’t work anymore.

You can experiment using a different bool of length 3 to index arr by columns.

Try the following and see the output:

import numpy as np

bool1= [True, False, True]

arr = np.array([
    [1,2,3],
    [4,5,6],
    [7,8,9],
    [10,11,12]
])

arr[:,bool1]

The column indexing above would work because the number of columns of the array (3), match the length of the bool1 list (3).

Yes, this is why indexing by column didn’t work in the provided example.

The shape of arr after the indexing doesn’t matter, if that’s what your question is, because by that time, well, indexing already occurred. It is the indexing process that the shape needs to be matched for.

The technical details look answered, i will just share some big picture stuff.

30 minutes here gives benefits for life: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#boolean-indexing

Boolean indexing (sometimes called boolean mask) is very important. It translates what you want as one/multiple conditions which produces a list of true/false values, to be applied to some data to get some filtered data (usually of less rows/columns/both). Sometimes that boolean array may not be a list, but of some other iterable type.

The boolean mask may come from some other data source, or be produced by the data you are working with itself, such as s[s > 0] for filtering all positive values in the series s. Rather than hardcoding the list of True,False yourself to define what to select, it is usually generated from some comparison operator taking in 2 operands, one of which is your data (most likely a particular column) and the other a value that you set yourself. Then you can apply this mask to whatever data, but usually it’s the same data you produced your mask from, but still keep in mind where the mask comes from and where it is applied to is always open for manipulation, especially when you start working with large dataframes where it may be impossible to join 2 df in RAM at once and have to get the mask first, mask = df1[condition], del df1 to clear memory then df2[mask] to select.

Digressing in this paragraph: Masks can be manipulated like sets too. For example you have multiple dataframes each having different indexes and you want to do some index alignment stuff like “Find all the rows with index that appear in df1 but not df2”. You can then use pandas.Index.difference https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.difference.html.
Here’s introducing you to df.reindex(https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html) super important for filling in missing values and index alignment, such as in time series applications where not every entity has same sampling frequencies/ number of points (df.pivot_table is a good auto index/column filling tool too).

These skills are important because in businesses information comes from databases where each table represents different entities and you must read them into dataframes and df.merge them together to enrich the number of columns for analyzing a certain object. Without auditing your index alignment you won’t have accurate row counts to do analysis. Yes pandas has automatic index alignment feature, but it’s important to know these tools so you are clear what pandas is doing for you and not doing for you.

df.set_index and df.reset_index are also must know tools. Because they will preserve sanity in the world of df.unstack/stack/pivot_table/melt/groupby

On to some Machine Learning applications, masks are very useful for selecting training/testing rows from the combined set of rows. It is faster to shuffle the mask (maybe the row index, but whatever identifier you want) then use the mask to select the rows for training/testing set, then shuffling all the data. Here’s introducing the negated mask. If training set was df[mask], test can simply be df[~mask], although you will more commonly see array[:split_point], array[split_point:] slicing syntax.

Besides train-test-split, this mask, ~mask thing you can also apply when doing decision tree, splitting a dataset into left and right branches, basically this applies anywhere where 2 groups are mutually exclusive